< Back to IRCAM Forum

Source filter synthesis in SuperVP


The manual page says:

The module supervp.sourcefilter~ performs source-filter cross-synthesis to two incoming sound streams, imprinting the spectral envelope of the ‘filter’ stream (right inlets) to the ‘source’ stream (left inlets).

I am trying to wrap my head around the notion of a spectral envelope. Having played with the supervp.sourcefilter~ overview patch, it is clear to me that source stream definitely obtains “an” envelope from the filter stream. However, it does not sound like a conventional gain envelope and it sounds different than what a gain envelope would do. What is a spectral envelope in layman’s terms, if possible? :slight_smile:


A spectral envelope is a filter passing through the spectral peaks. It represents the sounds timbre or for speech the formants. So -Gmul extracts the spectral envelope = the filter representing the timbre of the second track sound and applies it to the first track. So if your first track contains the sound of a washing machine and the second track contains speech you can in principle generate a speaking washing machine.

For this to work you need to have a good quality of the formant estimation and also you would better remove the original timbre of the washing machine. But this is may be another story.

If you are interested here a paper explaining the best spectral envelope estimation algorithm available in supervp


the simpler algorithm hat is available is the lpc algorithm, you can find the ideas here

https://web.ece.ucsb.edu/Faculty/Rabiner/ece259/digital%20speech%20processing%20course/lectures_new/Lecture%2013_winter_2012.pdf notably page 6

Thanks! I think you meant to say -Gcross though.

Also I am curious if you are familiar with http://www.akirarabelais.com/o/software/al.html. The tool is enigmatic by design :slight_smile: but it has a module called “Faltung In Zeit” which is said to provide “convolution with a time domain aspect”. It does something that I feel like is similar to source filtering but maybe a bit more.

I think you meant to say -Gcross though.

No - cross synthesis has three modes: -Gcross is the cross mode that works with amplitude frequency encoding of spectra. There are no spectral envelopes there! It is the mul mode that you select with -Gmul that works with spectral envelopes.

but it has a module called “Faltung In Zeit”

I cannot tell you whether there algorithm is good or not. I never worked with it.

As a preparing note: it is very well possible to use the most basic algorithms like granular synthesis and filtering to create wonderful audio effects and it may well be that the software you mention use these fundamental algorithms in innovative ways, and achieves unheard results or is otherwise a very good piece of software.

That said, the description is a great marketing piece: It creates mystery and value out of thin air! As they say they use granular type synthesis, this is the most fundamental and most simple algorithm you may think of - following their marketing strategy they should have named it: Zeit schneider = time cutter or cutting with a time aspect :slight_smile: . Concerning Faltung In Zeit. My intuition tells me that in 99.9% of all cases where convolution is used with audio the convolution is performed in the time domain and would therefore qualify as Faltung in Zeit. Basically, besides if you optimize for computational efficiency, every digital filter, e.g. a low pass filter, is doing Faltung in Zeit.

I personally don’t like this kind of marketing.

Ah! Let me get my mapping straight. Is my understanding below accurate?

  • supervp.cross~ -> -Gcross
  • supervp.sourcefilter~ -> -Gmul
  • I did not notice a supervp-max equivalent option/command for -Gadd and -Gpmul options.

Thanks for your thoughts! Interesting, would it be possible to transfer the characteristics of one sound stream to another without cross synthesis but via granular synthesis and filtering? I had always approached granular synthesis as a technique/approach which operates on a singular stream.

Personally I kind of like the presentation of the tool in the way that it forces you tinker with it and somehow build an intuition for how you use it. If it were a paid product, it could be another story.

correct for Gcross and Gmul

I did not notice a supervp-max equivalent option/command for -Gadd and -Gpmul options.

-Gadd is simply adding to sounds together. I think nobody ever used it really.
-Gpmul is missing.

Thanks for confirming. As a side question, when using the command line Supervp for cross synthesis, I hit the wall with some sound files because of srate_different which I believe indicates that sampling rates are different. I do not observe this in supervp for max, synthesis just works with any file irrespective of the sample rate.

  1. Is there a practical way to discover the sample rate of the files and convert them to one another’s rate using some of the IRCAM tools?

  2. Is command line SuperVP better in terms of processing quality compared to SuperVP for max? This might be an irrelevant question but I presume there might be some technical limitations in either of the options that might effect processing quality.

Granular synthesis takes sound grains and assembles them according to a strategy. Where these grains come from (one or multiple sounds), and how you assemble them, depends on the system. Diemo Schwarz has built many such systems, see his CataRT system
here. For cross synthesis you can create a grain database, annotate all grains with sound descriptors, take another sound and calculate the descriptors and then synthesize from your grains database using the descriptors to control the grain selection. If I remember right that is what the CataRT system does.

Another cross synthesis approach (for that I don’t know any implementation) you could also take pairs of grains from two different sources and then convolve these pairs together to get a convolution with a time aspect. There are very many possibilities.

Personally I kind of like the presentation of the tool in the way that it forces you tinker with it and somehow build an intuition for how you use it.

Sounds good. Please once again. I have reacted to the description only. This does not imply any judgement of the quality of the tool!

1 Like

I am not using Max that much. I know we added a resampling option into the library interface that allows resampling the input sound on the fly. This is not available as an automatic procedure on the command line, you need to do this by hand.


supervp -S file.aiff -E0 -v /dev/null 

displays the sample rate and a few other parameters of the sound file.aiff.

supervp -S file.aiff -H16000 file_16kHz.aiff

converts the sample rate of file.aiff to 16kHz and saves this file into file_16kHz.aiff

I have discussed that already here in the section to EStudio under real time engine as well as directly related to Max here.

I see. So there are some differences with respect to frequency domain based transposition and potentially the choice of default params across different tools (tool=Max, command line, AudioSculpt etc.) can de different, leading to differences in output.

So there are some differences with respect to frequency domain based transposition

Frequency domain transposition is available on the command line and in Max, not in AudioSculpt.
This is also a question of parameter defaults because the transposition mode is a parameter and you can switch it off in Max.

I don’t have Max so I cannot check now what is the default, but from memory I think, it was “auto” and not “time”.

Thanks! Super useful answers as always :slight_smile:

Next I am looking forward to getting a practical understanding of window functions and window size in SuperVP, like “if I choose rectangle and a smaller window size, I will likely hear something grittier” type of an intuition, which sounds pretty reductive on its own and may not even be possible, I admit though :slight_smile:


window functions and window size determine the frequency and time resolution of the analysis. The larger the window the better you can resolve close by partials but the worse you can resolve sequences of attacks (transient preservation will only work if there is a single transient within each window). For monophonic sounds the rule of thumb is: choose the window size such that it covers about 4 periods of the smallest f0 you want to be able to treat. For polyphonic instrumental sounds a good compromise is choosing a window size of 100ms. This will most of the time work ok.

The window form is much less critical. It determines the the attenuation of side lobes. Put simply: the more side lobes you have the more energy of each sinusoid will be dispersed into the side lobes which then will not be handled coherently in the phase vocoder. The rectangular window is worst here, the Blackman window best, but difference between Blackman and Hanning is rather small and perceptually not very relevant.

1 Like

Can you please explain the use of the maximal f0 for envelope estimation in regards to different spectral content?

Should this be related to the harmonic content (eg fundamental frequency) of the sound or is it about the frequency range that is used in the processing?

Are there optimal settings overall when working with harmonic vs non-harmonic material in the supervp.sourcefilter? For example, should complex polyphonic harmonic material use the trueenv mode with a certain f0 value while non-harmonic ‘noisy’ material should use lpc mode with a certain LPC order?

I’ve tried reading the paper but as requested in the original post, a laymen’s terms explanation would be really helpful.


Hello srs

Should this be related to the harmonic content (eg fundamental frequency) of the sound or is it about the frequency range that is used in the processing?

The spectral envelope represents the filter component of a source filter model, for voice sounds it corresponds approximately to the vocal tract filter. Now we need to estimate this (vocal tract) filter from a given sound signal after passing the signal excitation source. In the source filter model the harmonic structure is supposed to be part of the excitation and further it is assumed to be white (all spectral peak approximately have the same amplitude). That means we need to find a filter that is defined by the harmonic peaks, but ignoring the valleys between the sinusoids.

The true envelope algorithm exploits the fact that the harmonic peaks of the excitation can be understood to create a down-sampled version of the log amplitude filter transfer function where the F0 is the sampling period.
Exactly like it is the case for sampling audio signals the sampling of the filter creates aliasing and so some details of the filter are lost. Because we sample (log ampitude) in the frequency domain, aliasing takes place in the time (more exactly cepstral) domain. The best reconstruction you can achieve is using bandlimited interpolation using 0.5 SR/F0 as Nyquist “cesptral” limit. Choosing this order for an harmonic spectrum will not allow the filter to follow the valleys between the harmonic peaks. If you make it smaller the filter will be less good in following the peaks (will be less precise) if you make it larger peaks will be followed precisely but the filter will start oscillate between the harmonic peaks. So order should be limited as a function of the F0 where the order need to be smaller the larger the F0 becomes.

Now in AudioSculpt you can specify the true envelope order directly in terms of the F0 you expect to be present in your audio signal. You can either directly use an F0 analysis you have performed to modify the order according to the local signal, or you take a constant value which would be used everywhere. In the latter case you need to take the maximum F0 over the signal because this value will ensure that the envelope will nowhere be able to capture the harmonic structure. On the other hand using a global value will result an over smoothed filter wherever F0 is smaller. Depending on the signal you may lower the max F0 a little bit to have a more precise filter if the maximum F0 does not take pace often.

Is this clearer ?

Yes, thanks. That does help explain it though I think I’ll need to study that paper further to really understand it.

I’m using it in Max so I’m assuming I can only have a constant f0 value. Does a smaller f0 mean greater frequency resolution in the processing? A bit like how a larger FFT size increases frequency resolution but reduces time resolution.

What would be the difference in terms of audio quality with an f0 of 50 (or less) versus an f0 of 1000 (or more)?

In my case, both the source and filter sounds are very harmonically rich so I want as much frequency resolution as possible.