< Back to IRCAM Forum

Onset detection based on mel flux

I am trying to find out if there is a way to create a PiPo chain that will enable me to detect onsets based on transitions in the distribution of mel frequencies. The help file provides the standard example of slice:fft:sum:scale:onseg which is basically a loudness based onset detection algorithm. I have also seen slice:yin:scale:onseg. However these algorithms are not optimized to detect transitions where the overall spectral energy remains quite even, but there is clearly a perceptible “edge”, such as in note transitions.

I tried mel:scale:onseg but there is clearly a qualifier missing here. I do not want to sum the mel values, I want the detector to look for sudden shifts in the mel band distribution (either direction).

Hope someone can point me in the right direction!

Hi Notto,

great question! What comes close is called spectral flux, i.e. the absolute difference between spectral envelopes from one frame to the next. pipo.onseg will be able to detect peaks of spectral flux.
So what you’d need here is pipo.delta. You also might need to exclude the first MFCC value (energy) via @onseg.colindex 1 and maybe work in log domain:


The missing link in the current Mubu version is the abs, I can provide you with a prerelease for this.

This would make a great example, do you have a nice sound example for this?


Hi Diemo,

Yes, that sounds great. I could send you a sound example. What I have right now is recorded on the computer’s internal mic so the sound quality is poor. Should I rerecord with a better mic first?

I wonder though, wouldn’t mel:scale:delta:onseg be the equivalent process for mel frequency bands instead of MFCC bands?


This seems similar to something I have also been searching for: an implementation of AudioSculpt-style spectral differencing for pipo segmentation — so I am following with great interest and would be happy to be involved with testing examples too!

1 Like

Hi @schwarz @ctrapani

I got it working with a little extra help from this paper.

My solution for this mubu.process object for now is:

mubu.process melfluxonsets audio slice:fft:bands:delta:sum:onseg @name markers @process 0 @progressoutput input @prepad 3000 @slice.wind blackman @slice.norm power @slice.size 1024 @slice.hop 64 @fft.mode power @bands.mode mel @bands.num 32 @bands.log 1 @delta.size 5 @delta.normalize 0 @sum.colname Loudness @onseg.filtersize 21 @onseg.colindex 0 @onseg.numcols -1 @onseg.duration 1 @onseg.max 0 @onseg.min 0 @onseg.mean 0 @onseg.stddev 0 @onseg.threshold 300 @onseg.offthresh -2000 @onseg.durthresh 100 @onseg.mininter 100 @onseg.startisonset 1 @onseg.maxsize 15000 @info gui "interface markers, autobounds 1, paramcols Cue Label Duration"

1 Like

@ctrapani Do you have any testing examples of your own?

what about some Sequenza by Nono?

Here is a short soundfile I use to test with six notes played on an out-of-tune melodica. Even with this I have trouble getting satisfactory results, particularly with yin.

melodica-1.aif (794.1 KB)

I also attach a version of the segmenting patch I have been using. There are options for an onseg algorithm using yin for pitch segmentation and a second one for fft attack detection — and I have hastily added a third option with the melflux arguments from @sirnotto.

Segmenting-test.maxpat (137.7 KB)

I will experiment with mel flux parameters and do some tweaking. Would be curious to compare notes — C

1 Like

Here is a short flute arpeggio which is really difficult to segment.

I needed to lower the onseg.threshold down to 100, then I got most of the markers at the beginning of each note. Perhaps that is quite similar to the melodica.

But its really sensitive on the kind of input and the amplitude. I guess there are a few more tweaks available which will make the detection more adaptable to different source material.

Detection is definitely challenging with wind-based instruments with continuous changes between notes. I’d like to give Sequenza a try, but first I am just trying to get my head around basic tests like the flute arpeggio here.

Here’s a cleaner version of the segmenting test patch if interested. I separated the attrui into FFT / YIN / MEL columns for better quick comparisons, and can add more to MEL for sure. Which are most pertinent?

I will try the flute arpeggio too and can propose a few other short sound files. And perhaps we can compare notes on which segmentation and calibration work best?

Segmenting-test-cleaner.maxpat (122.5 KB)

I can say, for instance, that yin~ segmentation with the default values and only the slightest tweak (yin threshold lowered to 0.15) seems to segment this flute line perfectly.

I agree, with the flute arpeggio yin~ was best (mel 2nd, standard onset 3rd):

Very nice testing patch, by the way!

For repeated notes with the same pitch, using this file

the yin came out a little bit worse (skipped an onset), and the fft performed good (because the repeating adds a transient which is more like an energy type onset. The mel also captures all the onsets as well, but was a little bit trigger happy at the end and sliced the last note.

I have not yet concluded which would be a best compromise for capturing many different kinds of material. But for now:

  • The standard FFT is quite reliable, but tends to underperform when there is no clear transient in the onset.
  • The yin works very well for pitch related transitions, but tends to skip some onsets where there is no pitch change.
  • The mel seems to works better than FFT on continuous pitch changes, but not as good as yin. However, mel captures more onsets that are not related to pitch change than does yin.
1 Like

Another example here, with REALLY encouraging results via MEL separation: A passage of overlapping notes and diads on Fender Rhodes

You can see that the YIN analysis is useless for segementation, and FFT gets some attacks but not all — but the mel flux segmentation does the job very well:

Wow, that is encouraging!

Hi Notto and Chris, fantastic thread, thanks for the amazing contributions!
I mean to write a “segmentation lab” patch to compare several methods, can I base it on your great patch, @ctrapani?
I’d also like to add numeric analysis of segmentation quality based on human ground truth. Would you mind sending me a hand-corrected “perfect” segmentation of your respective example sounds?

Hi Diemo,

Happy to know the thread is useful to you. Here is my ground truth for the two flute examples.

Octave arpeggio:
Onset markers at 0, 590, 1020, 1495, 1920, 2370, 2820, 3270, 3690, 4120, 4580, 5020, 5410, 5870, 6310

This example has a final breath intake at around 6850 which the mel onset interpreted as a new final onset. I would say it is not necessarily “wrong” to isolate the inhalation as its own onset, but with the flute note still reverberating in the background it makes more sense to ignore this.

Same notes repetition:
Onset markers at 0, 400, 620, 880, 1140, 1380, 1640, 1920, 2170, 2660, 2890, 3150

Thanks! If you still have it in the patch, you can just do writetrack and save as sdif or txt.