< Back to IRCAM Forum

Source filter synthesis in SuperVP

Thanks! I think you meant to say -Gcross though.

Also I am curious if you are familiar with http://www.akirarabelais.com/o/software/al.html. The tool is enigmatic by design :slight_smile: but it has a module called “Faltung In Zeit” which is said to provide “convolution with a time domain aspect”. It does something that I feel like is similar to source filtering but maybe a bit more.

I think you meant to say -Gcross though.

No - cross synthesis has three modes: -Gcross is the cross mode that works with amplitude frequency encoding of spectra. There are no spectral envelopes there! It is the mul mode that you select with -Gmul that works with spectral envelopes.

but it has a module called “Faltung In Zeit”

I cannot tell you whether there algorithm is good or not. I never worked with it.

As a preparing note: it is very well possible to use the most basic algorithms like granular synthesis and filtering to create wonderful audio effects and it may well be that the software you mention use these fundamental algorithms in innovative ways, and achieves unheard results or is otherwise a very good piece of software.

That said, the description is a great marketing piece: It creates mystery and value out of thin air! As they say they use granular type synthesis, this is the most fundamental and most simple algorithm you may think of - following their marketing strategy they should have named it: Zeit schneider = time cutter or cutting with a time aspect :slight_smile: . Concerning Faltung In Zeit. My intuition tells me that in 99.9% of all cases where convolution is used with audio the convolution is performed in the time domain and would therefore qualify as Faltung in Zeit. Basically, besides if you optimize for computational efficiency, every digital filter, e.g. a low pass filter, is doing Faltung in Zeit.

I personally don’t like this kind of marketing.

Ah! Let me get my mapping straight. Is my understanding below accurate?

  • supervp.cross~ -> -Gcross
  • supervp.sourcefilter~ -> -Gmul
  • I did not notice a supervp-max equivalent option/command for -Gadd and -Gpmul options.

Thanks for your thoughts! Interesting, would it be possible to transfer the characteristics of one sound stream to another without cross synthesis but via granular synthesis and filtering? I had always approached granular synthesis as a technique/approach which operates on a singular stream.

Personally I kind of like the presentation of the tool in the way that it forces you tinker with it and somehow build an intuition for how you use it. If it were a paid product, it could be another story.

correct for Gcross and Gmul

I did not notice a supervp-max equivalent option/command for -Gadd and -Gpmul options.

-Gadd is simply adding to sounds together. I think nobody ever used it really.
-Gpmul is missing.

Thanks for confirming. As a side question, when using the command line Supervp for cross synthesis, I hit the wall with some sound files because of srate_different which I believe indicates that sampling rates are different. I do not observe this in supervp for max, synthesis just works with any file irrespective of the sample rate.

  1. Is there a practical way to discover the sample rate of the files and convert them to one another’s rate using some of the IRCAM tools?

  2. Is command line SuperVP better in terms of processing quality compared to SuperVP for max? This might be an irrelevant question but I presume there might be some technical limitations in either of the options that might effect processing quality.

Granular synthesis takes sound grains and assembles them according to a strategy. Where these grains come from (one or multiple sounds), and how you assemble them, depends on the system. Diemo Schwarz has built many such systems, see his CataRT system
here. For cross synthesis you can create a grain database, annotate all grains with sound descriptors, take another sound and calculate the descriptors and then synthesize from your grains database using the descriptors to control the grain selection. If I remember right that is what the CataRT system does.

Another cross synthesis approach (for that I don’t know any implementation) you could also take pairs of grains from two different sources and then convolve these pairs together to get a convolution with a time aspect. There are very many possibilities.

Personally I kind of like the presentation of the tool in the way that it forces you tinker with it and somehow build an intuition for how you use it.

Sounds good. Please once again. I have reacted to the description only. This does not imply any judgement of the quality of the tool!

1 Like

I am not using Max that much. I know we added a resampling option into the library interface that allows resampling the input sound on the fly. This is not available as an automatic procedure on the command line, you need to do this by hand.

Sure

supervp -S file.aiff -E0 -v /dev/null 

displays the sample rate and a few other parameters of the sound file.aiff.

supervp -S file.aiff -H16000 file_16kHz.aiff

converts the sample rate of file.aiff to 16kHz and saves this file into file_16kHz.aiff

I have discussed that already here in the section to EStudio under real time engine as well as directly related to Max here.

I see. So there are some differences with respect to frequency domain based transposition and potentially the choice of default params across different tools (tool=Max, command line, AudioSculpt etc.) can de different, leading to differences in output.

So there are some differences with respect to frequency domain based transposition

Frequency domain transposition is available on the command line and in Max, not in AudioSculpt.
This is also a question of parameter defaults because the transposition mode is a parameter and you can switch it off in Max.

I don’t have Max so I cannot check now what is the default, but from memory I think, it was “auto” and not “time”.

Thanks! Super useful answers as always :slight_smile:

Next I am looking forward to getting a practical understanding of window functions and window size in SuperVP, like “if I choose rectangle and a smaller window size, I will likely hear something grittier” type of an intuition, which sounds pretty reductive on its own and may not even be possible, I admit though :slight_smile:

Hello

window functions and window size determine the frequency and time resolution of the analysis. The larger the window the better you can resolve close by partials but the worse you can resolve sequences of attacks (transient preservation will only work if there is a single transient within each window). For monophonic sounds the rule of thumb is: choose the window size such that it covers about 4 periods of the smallest f0 you want to be able to treat. For polyphonic instrumental sounds a good compromise is choosing a window size of 100ms. This will most of the time work ok.

The window form is much less critical. It determines the the attenuation of side lobes. Put simply: the more side lobes you have the more energy of each sinusoid will be dispersed into the side lobes which then will not be handled coherently in the phase vocoder. The rectangular window is worst here, the Blackman window best, but difference between Blackman and Hanning is rather small and perceptually not very relevant.

1 Like

Can you please explain the use of the maximal f0 for envelope estimation in regards to different spectral content?

Should this be related to the harmonic content (eg fundamental frequency) of the sound or is it about the frequency range that is used in the processing?

Are there optimal settings overall when working with harmonic vs non-harmonic material in the supervp.sourcefilter? For example, should complex polyphonic harmonic material use the trueenv mode with a certain f0 value while non-harmonic ‘noisy’ material should use lpc mode with a certain LPC order?

I’ve tried reading the paper but as requested in the original post, a laymen’s terms explanation would be really helpful.

Thanks

Hello srs

Should this be related to the harmonic content (eg fundamental frequency) of the sound or is it about the frequency range that is used in the processing?

The spectral envelope represents the filter component of a source filter model, for voice sounds it corresponds approximately to the vocal tract filter. Now we need to estimate this (vocal tract) filter from a given sound signal after passing the signal excitation source. In the source filter model the harmonic structure is supposed to be part of the excitation and further it is assumed to be white (all spectral peak approximately have the same amplitude). That means we need to find a filter that is defined by the harmonic peaks, but ignoring the valleys between the sinusoids.

The true envelope algorithm exploits the fact that the harmonic peaks of the excitation can be understood to create a down-sampled version of the log amplitude filter transfer function where the F0 is the sampling period.
Exactly like it is the case for sampling audio signals the sampling of the filter creates aliasing and so some details of the filter are lost. Because we sample (log ampitude) in the frequency domain, aliasing takes place in the time (more exactly cepstral) domain. The best reconstruction you can achieve is using bandlimited interpolation using 0.5 SR/F0 as Nyquist “cesptral” limit. Choosing this order for an harmonic spectrum will not allow the filter to follow the valleys between the harmonic peaks. If you make it smaller the filter will be less good in following the peaks (will be less precise) if you make it larger peaks will be followed precisely but the filter will start oscillate between the harmonic peaks. So order should be limited as a function of the F0 where the order need to be smaller the larger the F0 becomes.

Now in AudioSculpt you can specify the true envelope order directly in terms of the F0 you expect to be present in your audio signal. You can either directly use an F0 analysis you have performed to modify the order according to the local signal, or you take a constant value which would be used everywhere. In the latter case you need to take the maximum F0 over the signal because this value will ensure that the envelope will nowhere be able to capture the harmonic structure. On the other hand using a global value will result an over smoothed filter wherever F0 is smaller. Depending on the signal you may lower the max F0 a little bit to have a more precise filter if the maximum F0 does not take pace often.

Is this clearer ?

Yes, thanks. That does help explain it though I think I’ll need to study that paper further to really understand it.

I’m using it in Max so I’m assuming I can only have a constant f0 value. Does a smaller f0 mean greater frequency resolution in the processing? A bit like how a larger FFT size increases frequency resolution but reduces time resolution.

What would be the difference in terms of audio quality with an f0 of 50 (or less) versus an f0 of 1000 (or more)?

In my case, both the source and filter sounds are very harmonically rich so I want as much frequency resolution as possible.

Hello,

sorry I lost track of the question.

Does a smaller f0 mean greater frequency resolution in the processing?

Yes that is correct. Smaller F0 means you have more samples of the spectral envelope on the frequency axis and the frequency resolution the F0 and so smaller F0 means smaller step which is than higher resolution.

In fact this question is a bit confusing. You probably need to distinguish the F0 that is actually in the sound (so you cannot choose it anyway) and the F0 that you choose as a parameter to get a spectral envelope for a given sound. I will denote these as SndF0 and EnvF0.

So first the SndF0: the higher the pitch the lower the resolution. While you don’t hear this as a quality reduction (because you are used to it) in fact it is one and leads to problems in the perception. You probably know this effect from soprano singers for that you have difficulty to understand the text sung. A soprano sings so high that your perception does not manage to get the formant positions correctly which in urn hinders understanding.

Now the EnvF0. Here we try to find the formants the sampled filter envelope. We don’t do this for understanding (besides for example in text recognition) but for sound modification. If we use

EnvF0 == SndF0

The spectral envelope will gather all details that are available so you get best quality for all transformations. Let’s assume we want to do transposition. If we don’t transpose all errors will be compensate so it does not matter. If we transpose up the necessary resolution reduces so we have not lost anything and in a first approximation you don’t perceive problemes due to the sub sampling of the envelope. If you transpose down you would need more resolution to construct the features of a voice with the new pitch, you cannot get that and the sound will be perceived as strange. Most of the time this generates an effect that resembles as voice pronounced while pressing the nose with the fingers from both sides (you close the it). The more you transpose down from the original F0 the stronger will be the effect.

Now if you choose

EnvF0 > SndF0

Problems with the effect of the closed nose will start earlier and will be strong

if you want to be very clever and choose

EnvF0 < SndF0

So you want to extract more resolution than there is than you will get even more problems because the filter you estimate now encodes as well the partial positions (which it should not) and when you transpose (up or down) you will get additional timbre modulations depending on the transposition you choose.

Obviously, all these are only approximations, but it should be good to explaining the principle effects.

Best
Axel

2 Likes

Thanks for such a detailed clarification Axel, this is very helpful.

For the case of doing cross-synthesis using the supervp.sourcefilter (instead of transposition as in your example), would you recommend setting the maximal f0 for envelope estimation as the ‘fundamental frequency’ of the source and filter material?

And if either the source or filter material has melodic content so the fundamental frequency is constantly changing, is there an f0 that is ideal?

For example, when using a melodic solo guitar part as the filter for an orchestral texture source? Or is supervp.sourcefilter better suited to using tonal<>noise or noise<>tonal source/filter combinations?

Hello @srs

For the case of doing cross-synthesis using the supervp.sourcefilter (instead of transposition as in your example), would you recommend setting the maximal f0 for envelope estimation as the ‘fundamental frequency’ of the source and filter material?

The envf0 flag always needs to be adapted to the sound for that you want to estimate the filter component of a source-filter model. When you work with the source filter model all sounds are seen a composed of source and filter. For your example you would normally first remove the filter component of the source sound to get only the source component of the source sound. For this you would use set the envF0 parameter to the F0 of the source sound. Then to extract the filter component of the filtering sound you would estimate the filter using the f0 of the filter sound as envF0 parameter. The for the filtering itself you don’t need any other parameter because the filtering works independently of any estimation.

Now to the question of the dynamic F0s. As explained above you loose details if your are to high and you risk artifacts that may become rather extreme if you are too low. So you would better stay at the upper end, I often position the envF0 parameter 30% below the maximum F0 because with 30% too low you generally don’t hear any artifacts.

For example, when using a melodic solo guitar part as the filter for an orchestral texture source? Or is supervp.sourcefilter better suited to using tonal<>noise or noise<>tonal source/filter combinations?

No, while the parameter settings are rather uncritical if you want to extract the filter component of a noise sound, with a bit more effort invested into parameter selection all these effects work exactly the same when working with tonal sounds. For noise sources it is better (mathematically speaking) to use the LPC estimator to extract the envelop (the filter component). For estimating envelopes of orchestral sounds you need to see whether there one instrument standing out (in which case true env would be better) or if the partials are rather dense (in which case you can favor LPC).

If you want to apply the guitar as filter to the orchestral sound (and you don"t want to remove the sound color of the orchestral sound you only need to consider envelope estimation for the guitar.

Best
Axel