< Back to IRCAM Forum

Retrieving frequency and amplitude data from "niveau de gris" sonograms

Hi all,
with a SVP 1GB4 analysis I’ve extracted the frequency and amplitude sonograms of a sound (see attachments). The sonograms are grayscale images, so I assume the frequency and amplitude floats are scaled into a range of values between 0.0 and 1.0. I would like to know what calculation can be used on these floats (coded by pixels in the images) in order to retrieve frequency and amplitude data from the “niveau de gris”. In other words, I’d need to know what formulas are the correct reverse engineering operations to get back from the “niveau de gris” to frequency and amplitude.
Best,
Francesco

Dear Francesco

I don’t quite understand what you are doing here? The 1GB4 files contain amplitude and frquency values that are perfectly readable, eg with sdiftotext, why would you first encode these values in a gray scale image and then ask us how to get them back? Bascially, if you put these values into a gray scale image simply dump the values before you transform them into an image. If it was not you that created the images then there I am sorry to tell you that I don’t see a means that will allow you to decrypt how the values were transformed (frequencies in fact are not between 0 and 1 originally). Depending on the quantization that was used to transfom values into pixels it is not sure that the small differences between neighboring frequency bins are in fact resolved at all.

Best
Axel

Dear Axel,
thanks for your message. You ask: why would you first encode these values in a gray scale image to get them back? Because I’m manipulating sound in a visual way, on the basis of its sonogram: this is an idea on which AudioSculpt itself is grounded. By the way: isn’t it a resynthesis of sound on the base of pixel data what we get when copying an “instantaneous spectrum” snippet and pasting it into a new soundfile? How does it work AS or SVP in this case? This is what I was asking, reformulated from a different point of view. And then, if you say that for the frequency data the reverse process is problematic, what about amplitude? I read somewhere that sound amplitude=((pixel height-(image height/2))/image height)*2,147,483,648 (32 bit sound resolution). Is it correct to apply this calculation to the amplitude sonogram to get the amplitude values back? I hope that what I’m saying sounds not totally unreasonable to you.
Best,
Francesco

thanks for your message. You ask: why would you first encode these values in a gray scale image to get them back? Because I’m manipulating sound in a visual way, on > the basis of its sonogram: this is an idea on which AudioSculpt itself is grounded. By the way: isn’t it a resynthesis of sound on the base of pixel data what we get > when copying an “instantaneous spectrum” snippet and pasting it into a new soundfile?

What you see is not what is happening. In deed you are correct that graphical manipulation of the spectrogram is the fundamental idea of AS. But, and this is what you see when you manipulate the images, however, internally we always keep the precice amplitude and frequency values of the spectrogram and all you do on images is then internlly performed not by means of decoding the images, but by means of directly doing the equivalent operations on the internal data.

Please note as well that we never have shown an image of the internal frequency code used for the phase vocoder, these values are very hard to map to anything perceptually meaningfull.

…somewhere that sound amplitude=((pixel height-(image height/2))/image height)*2,147,483,648 (32 bit sound resolution). Is it correct to apply this calculation
to the amplitude sonogram to get the amplitude values back? I hope that what I’m saying sounds not totally unreasonable to you.

No this formular is not correct, I dont see what pixel height would be here and what image height and width would do in there anyway! In AS we transform amplitude into db and then scale the maximum amplitude (which is 1 in our case) to 0 db and cut everything below -116dB.

As I said there is no standard in how to translate Spectrogram data into image data. Given the total freedom to do the mapping rom amplitude to grey level it is impossible to inverse this process!

Best
Axel

Thanks Axel. You gave a new insight (not provided by the official documentation) into how AS actually works when manipulating spectrograms. You say “I dont see what pixel height would be here and what image height and width would do in there anyway!” In the images attached to my first post, you can easily check that the image width = numframes (here in the example 138), and the image height = 513 (the number of bins, given that the 1GB4 analysis is set to 1024). So width and height do carry some informations (as it happens also with the frequency image). Then, the formula for sound amplitude would be still useless even having specified what image height and width are? The pixel height is the brightness value of the pixel, and the brightess is higher then the amplitude values in the 1GB4 analysis are higher. So, even with these specifications, do you still think that the above formula is completely meaningless?
Thanks again,
Francesco

Dear Francesco,

I still don’t see what you are trying to achieve. You started with a sound and an 1GB4 analysis so why then you want to pass via an image?

Anyway - you gave this formular:

sound amplitude=((pixel height-(image height/2))/image height)*2,147,483,648 (32 bit sound resolution)

If “pixel height” is the luminance value of a pixel than this is the only value in your equation that has a relation with the amplitude as you already described

The pixel height is the brightness value of the pixel, and the brightess is higher then the amplitude values in the 1GB4 analysis are higher

Subtracting half the image height and then deviding by its height is certainly not justifyable. The scaling factor 2,147,483,648
is completely arbitrary. In fact the absolute scaling of the spectrogram amplitudes is an arbitrary factor selected by the programmer of the underlying code
and can therefore not be retrieved from the image (which has been scaled to obtain the limitation to 1 later anyway). The only information you can get (assuming there is no logarithm (dB scale) and squaring (power spectrum) involved) are the relative amplitude levels of the different bins and frames - you can set the maximum amplitude level to 1, and all other amplitudes to relative amplitudes according to the gray level. Any error in the maximum would only scale the volume of the result.

So this gives you the formular

A(k,n) = H(k,n)

where A is the amplitude level and H is the “pixel height”, k is the bin index and n the frames time position.

For completeness if the image shows a power spectrum you would need to do

A(k,n) = sqrt(H(k,n))

and if it shows a db spectrum (which is somehow unlikely)

A(k,n) = 10**((H(k,n)-H_max)/20)

where H_max would be maximum “pixel height” in the whole image.

Unfortunately now you need to know where k and n are in the real world. The image size may inform you about the frequency and time location of the pixels, but only if you know the original sounds sample rate, duration, and analysis window size and overlap!

With respect to the frequency of 1GB4 representation you could search for a scaling factor that if you apply it to the average value of the highest line in your image results in a value = sr/2 using this scaling factor you can then scale all other frequencies accordingly. The overlap you may retrive using auto covariance in time direction of the highest image pixels. Autocovariance is maximum without delay and should stabilize to a value close to zero when ou timeshift further than the overlap. The window size you could then guess if you know the analysis step (that you get divising the sound by the image width image). Like this - with a lot of guessing - you may get a more or less correct 1GB4 equivalent of your images, that may allow you to recreate a 1GB4 file with that you can get back to the sound.

Good luck,
Axel

Great Axel,
your articulate and brilliant response shows me exactly the direction I need to follow, and, in spite of what you premised at first (“I still don’t see what you are trying to achieve”) you seem to fully understand the problem: as you said, the matter is getting a more or less correct 1GB4 equivalent of the images, from which you can get back to the sound. This operation should test the correctness of the “sound to sonogram and back” series of encoding and decoding. You ask “why you want to pass via an image?” The answer is: if AS doesn’t allow, as you said, the direct visual manipulation of sounds in terms of pixels, nevertheless SVP, in all its power, can be used to perform such experiments (I’ve done many). Last question: do you think that the 1GB4 analysis is the most appropriate to achieve this task, or do you recommend other analysis types, like the 1GB5? Would the reconstruction process be easier and more correct or not?
All the best,
Francesco

Dear Axel,
to be more precise about the direct visual manipulation of sounds in terms of pixels, I can cite as a reference (here attached) the paragraph 8.5 in chapter 8 of Diemo Schwarz’ 1997/1998 thesis “Spectral Envelopes in Sound Analysis and Synthesis” (supervised by X. Rodet). I guess that these experiments were done at IRCAM more than 20 years ago!

Hi Francesco,

as I know many composers I tried to guess a compositional context of the question and replied to the related question that seemed to emerge - I am happy to hear that apparently my guessing was correct.

The usual problem with the image examples is the creation of the phase. Here with frequency you get at least some sort of horizontal
phase coherence even if the vertical phase coherence will be completely missing. If you would like to try something else then betetr use 1GB3 but you would need to construct the phase from the amplitude. There are algorithms for this but we don’t have anything publicly available.

Best
Axel

Dear Francesco,

Diemo extract…
yes, but if you reread the section carefully you will see that Diemo discusses the creation of the spectral envelope (this is a sense of a filter) and
not the complete spectrum including phases. You may also note that Diemo uses normal images and tries to convert them to envelopes=filters,
he does not start with a spectrogram including frequency, trying to reproduce the original sound - still I admit that this quite complex process could
create interesting effects to sounds.

BTW, if you want an image as filter you can just use AS - have this effect available since about 10 years under processing -> Image filter!

Best
Axel

Dear Axel,
you’re completely right in pointing out that Schwarz discusses the “standard way” of manipulating sounds visually, i.e. using the image as a filter mask – this is the method popularized by E.Wenger, R.D.James, and so on… The image filter in AS is the clever and serious version of many toy softwares we all know. The method I’ve chosen shares with Diemo’s method only the premise (the possibility of translating pixel data in frequency and amplitude data), and is more direct and radical, but, after all, we can’t remain stuck to the same 20 year old tricks!
Best,
Francesco