Hello everybody,
I am using mubu.gmm as part of a real-time installation, in a very similar way as in the gmm_vowels.maxpat example (i.e. using MFC coefficients to detect particular sounds in the human voice)
I’ve tried a few different things / settings before asking what I’m about to ask, but still getting some inconsistent results, so I thought I’d need some experts opinion.
Take the gmm_vowels example as a reference:
-
What would be the best way to create a new training set? Or, in other words, what’s the best audio you can record (to generate a meaningful MFCC set)?
Should it be a super clean, well edited voice (i.e. removing silence before and after the actual voice sound)?
Or, more important, would it make any sense to include in the SAME buffer many different consecutive examples of the same vowel? For example recording a long audio file where me and all my family members pronounce the letter ‘E’, one immediately after the other (LOL)?
Does this ‘variety’ in the training set make the actual detection ‘more robust’? -
Vowel detection works quite well with my own voice (as the original training set used a male speaker) but doesn’t work at all with e.g. my girlfriend.
What would you do in this case to fix this ‘gender’ issue? Just records and add a new buffer for EACH vowel and rename the label e.g. /e_female, in contrast to to /e_male? Or (see above) record a totally different buffer which contains BOTH my and my gf’s voices?
Performance-wise, would it be major issue to use e.g. 100 classes instead of 3?
I understand there is something in the functioning of a gmm machine which I still can’t quite catch, so any help in this regard would be very much appreciated.
Thank you in advance
L