< Back to IRCAM Forum

GMM training: best practices

Hello everybody,

I am using mubu.gmm as part of a real-time installation, in a very similar way as in the gmm_vowels.maxpat example (i.e. using MFC coefficients to detect particular sounds in the human voice)

I’ve tried a few different things / settings before asking what I’m about to ask, but still getting some inconsistent results, so I thought I’d need some experts opinion.
Take the gmm_vowels example as a reference:

  • What would be the best way to create a new training set? Or, in other words, what’s the best audio you can record (to generate a meaningful MFCC set)?
    Should it be a super clean, well edited voice (i.e. removing silence before and after the actual voice sound)?
    Or, more important, would it make any sense to include in the SAME buffer many different consecutive examples of the same vowel? For example recording a long audio file where me and all my family members pronounce the letter ‘E’, one immediately after the other (LOL)?
    Does this ‘variety’ in the training set make the actual detection ‘more robust’?

  • Vowel detection works quite well with my own voice (as the original training set used a male speaker) but doesn’t work at all with e.g. my girlfriend.
    What would you do in this case to fix this ‘gender’ issue? Just records and add a new buffer for EACH vowel and rename the label e.g. /e_female, in contrast to to /e_male? Or (see above) record a totally different buffer which contains BOTH my and my gf’s voices?
    Performance-wise, would it be major issue to use e.g. 100 classes instead of 3?

I understand there is something in the functioning of a gmm machine which I still can’t quite catch, so any help in this regard would be very much appreciated.

Thank you in advance

L

Hi

Should it be a super clean, well edited voice (i.e. removing silence before and after the actual voice sound)? yes, and as GMM is a "static" recognizer, i.e. temporal shapes are averaged, it's might better to avoid transitions
Or, more important, would it make any sense to include in the SAME buffer many different consecutive examples of the same vowel? For example recording a long audio file where me and all my family members pronounce the letter ‘E’, one immediately after the other (LOL)?
yes, but avoid the silence or make separate files
Vowel detection works quite well with my own voice (as the original training set used a male speaker) but doesn’t work at all with e.g. my girlfriend. What would you do in this case to fix this ‘gender’ issue? Just records and add a new buffer for EACH vowel and rename the label e.g. /e_female, in contrast to to /e_male? Or (see above) record a totally different buffer which contains BOTH my and my gf’s voices?

both options are possible, it really depends on the goal of the final application: the type of errors will be different (false positive vs false negative)

Performance-wise, would it be major issue to use e.g. 100 classes instead of 3?
it should work, especially with the latest release (we improve the performance). If not, please let us know

other things to consider in the gmm_vowels

  • the first MFCC is basically the intensity, you might try the recognition without it, or be careful about audio normalization
  • you can try to change the value of @varianceoffset, for example try 1 0.2 to make it less specific

I hope this helps
Fred

Thank you very much Fred, really appreciate your answer! I will let you guys know if I find any performance related issues, I’m using it a lot right now…

I already had the thought of increasing the @variance offset (while also slightly increasing @mixture even if I know it’s a bit counter intuitive) and it worked a bit better.
The goal of the final application is to recognise syllables (but mainly vowels) in a real time speaking, no matter if a man or a woman is talking, during an installation (so also quite a noisy environment).

I didn’t really know about removing the first MFCC: seems like a great idea, also because I find it quite hard to implement proper volume normalization on such audio installations (people can speak loud or quiet, there can be many audio peaks, background noise, etc…)

Thanks a lot again

L