< Back to IRCAM Forum

Onset detection based on mel flux

Here is a short flute arpeggio which is really difficult to segment.

I needed to lower the onseg.threshold down to 100, then I got most of the markers at the beginning of each note. Perhaps that is quite similar to the melodica.

But its really sensitive on the kind of input and the amplitude. I guess there are a few more tweaks available which will make the detection more adaptable to different source material.

Detection is definitely challenging with wind-based instruments with continuous changes between notes. I’d like to give Sequenza a try, but first I am just trying to get my head around basic tests like the flute arpeggio here.

Here’s a cleaner version of the segmenting test patch if interested. I separated the attrui into FFT / YIN / MEL columns for better quick comparisons, and can add more to MEL for sure. Which are most pertinent?

I will try the flute arpeggio too and can propose a few other short sound files. And perhaps we can compare notes on which segmentation and calibration work best?

Segmenting-test-cleaner.maxpat (122.5 KB)

I can say, for instance, that yin~ segmentation with the default values and only the slightest tweak (yin threshold lowered to 0.15) seems to segment this flute line perfectly.

I agree, with the flute arpeggio yin~ was best (mel 2nd, standard onset 3rd):



Very nice testing patch, by the way!

For repeated notes with the same pitch, using this file

the yin came out a little bit worse (skipped an onset), and the fft performed good (because the repeating adds a transient which is more like an energy type onset. The mel also captures all the onsets as well, but was a little bit trigger happy at the end and sliced the last note.



I have not yet concluded which would be a best compromise for capturing many different kinds of material. But for now:

  • The standard FFT is quite reliable, but tends to underperform when there is no clear transient in the onset.
  • The yin works very well for pitch related transitions, but tends to skip some onsets where there is no pitch change.
  • The mel seems to works better than FFT on continuous pitch changes, but not as good as yin. However, mel captures more onsets that are not related to pitch change than does yin.
1 Like

Another example here, with REALLY encouraging results via MEL separation: A passage of overlapping notes and diads on Fender Rhodes

You can see that the YIN analysis is useless for segementation, and FFT gets some attacks but not all — but the mel flux segmentation does the job very well:

Wow, that is encouraging!

Hi Notto and Chris, fantastic thread, thanks for the amazing contributions!
I mean to write a “segmentation lab” patch to compare several methods, can I base it on your great patch, @ctrapani?
I’d also like to add numeric analysis of segmentation quality based on human ground truth. Would you mind sending me a hand-corrected “perfect” segmentation of your respective example sounds?
Cheers!

Hi Diemo,

Happy to know the thread is useful to you. Here is my ground truth for the two flute examples.

Octave arpeggio:
Onset markers at 0, 590, 1020, 1495, 1920, 2370, 2820, 3270, 3690, 4120, 4580, 5020, 5410, 5870, 6310

This example has a final breath intake at around 6850 which the mel onset interpreted as a new final onset. I would say it is not necessarily “wrong” to isolate the inhalation as its own onset, but with the flute note still reverberating in the background it makes more sense to ignore this.

Same notes repetition:
Onset markers at 0, 400, 620, 880, 1140, 1380, 1640, 1920, 2170, 2660, 2890, 3150

Thanks! If you still have it in the patch, you can just do writetrack and save as sdif or txt.

I looked into some details of onseg:
First, it is astonishing that it works so well, even without using the abs delta. You can actually compare this with onseg.odfmode square or rms which use the square difference to the median.
Second, the sum hides a lot of information from onseg. Without it, each band’s difference to its median is taken and then summed (onseg.numcols starting at onseg.colindex).
More to come…

Dear Diemo — Of course, it is fine with me to use me patch as a starting point.

Here is the .mubu and a screenshot of adjusted Mel parameters for the Rhodes example.

Rhodes-Mel-segmentation.mubu (1.0 MB)

Regarding yin segmentation: Actually I am surprised that the flute example works so well because I thought that onseg was only looking for upward changes when pointed to the frequency column (and thought I had verified that!).

Can I propose a tougher example? I have trouble with this one and would be curious to know if either of you have suggestions or success:

Hi Chris,

That last one was tough! I didn’t get satisfactory results with any of the detectors.

Right now, I’m on the verge of going full circle and concluding that, for all its flaws, the standard FFT detector is the best compromise when dealing with large amounts of audio. Because the yin and mel based detectors are more specialised for dealing with the type of onsets which FFT can’t handle.

Wouldn’t it be great to have a machine learning algorithm that could decide which algorithms to use based on categories of input? Now that’s a project…

Thanks for having a look. Yes, this is where it gets complicated… Maybe another option, more within reach than machine learning, would be some system that performs different types of analyses and compares results - confirming clear onsets when data points coincide or rejecting those that seem not to work.

For me, the ideal would be to somehow replicate the slider in AudioSculpt:

Of course this is kind of equivalent to changing the threshold for onseg segmentation, but the dream would be a slider that allows one to move between combined (weighted!?) data from FFT / YIN / and spectral differencing analyses…

Yep, that’s called fusion. Would it be a workable first step to combine all of the segmentations into one stream? In your experience, does yinseg catch stuff which loudness doesn’t, or would it produce too many false positives for too many sound classes?

Based on my limited experience, yin tends to capture less overall than loudness, but sometimes it captures continuous pitch changes which loudness ignores.

The mel detection algorithm, on the other hand, as per now, is more tricky to control. There is a very fine line (dependent on threshold) between capturing many false positives and nothing at all.

Exactly: yinseg can catch slurred or continuous notes better than loudness detection. But yinseg is not useful if there is contrapuntal motion, for instance, but mel can catch changes in individual voices.

If we can get all segmentations into one stream, that would be helpful, but weeding out the false positives will be tricky.

One idea might be to create a series of defined presets: “midrange percussive,” “treble legato,” “contraupuntal” that would give different weights to the various detection algorithms.

So here is a first version of the Segmentation Laboratory:


Here you can run 4 segmenters, and look at the data in various processing steps. It should use only features in v1.9.14, but we found that the editor is kind of stubborn, so you’ll have to resend the view setup quite often.

segmentation-lab.maxpat (244.0 KB)
attrchanged.js (1.1 KB) [EDIT: this file doesn’t seem to be downloadable as .js, here is its source:]

// attrchanged.js: listen to change in obj of given list of attrs

function valuechanged (data)
{
    if (data.attrname)
    {
	outlet(0, data.attrname, data.value);
    }
}

function loadbang ()
{
    this.bang();
}

function bang ()
{
    if (jsarguments.length < 2)
	error("first argument must be scripting name of listened to object");

    var ob = this.patcher.getnamed(jsarguments[1]); // get the object in 1st arg
    var obattrs = ob.getattrnames();
    var listento = [];
    var listeners = [];

    if (jsarguments.length == 2)
	listento = obattrs; // no attr arg: listen to all
    else
	listento = jsarguments.slice(2);

    for (var i = 0; i < listento.length; i++)
    {
	if (obattrs.indexOf(listento[i]) >= 0)
	    listeners[i] = new MaxobjListener(ob, listento[i], valuechanged);
	else
	    error(ob.maxclass +" object "+ jsarguments[1] +" does not have attribute "+ listento[i]);
    }
    gc();
}

Have fun!
…Diemo