< Back to IRCAM Forum

Best practices for short gesture recognition?


I am trying to achieve reliable and responsive detection of a few (up to 5-6 let’s say, but currently 2) short gestures, let’s say swipe left, swipe right, and idle. My current input is the accelerometer data (with gravity). I am interested in making a dataset to show enough variance to the model. Currently I have three labels, “left”, “right”, and “still”, the latter for everything else which is not a swipe-to-left or swipe-to-right gesture. I have 10 instances of each class. So here are some questions:

  1. Which one is the best object for this use case? mubu.gmm, mubu.hhmm or mubu.xmm?

  2. How many examples / class are needed for consistent results? The case is 2 short gestures (let’s say 0.3-0.6 s long) plus the class for filtering out everything else (for which the examples are 5-6 s long random stuff). My background is working with LSTMs so I am used to much bigger datasets, but my hunch was that maybe it is not necessary in this case? Should 10 be enough? Less, more?

  3. I started with mubu.gmm, then moved on to mubu.hhmm. With some tweaking of the parameters I get fairly OK results with not so much noise or misfiring. But it always takes some time for mubu.hhmm to go back from either “left” or “right” to “still”. So I cannot quickly retrigger the same gesture over and over, since it always takes around 0.5-1.5 s to go back to “still”. Which setting could make this transition faster? (Whithout losing precision.)

  4. About the training data. I noticed that I got better results with mubu.hhmm when I recorded the gesture with not so much “idle” before and after the action. How extreme should I be with this? Should I precisely select the gesture with markers removing as much idle from the recording as possible? Would this possibly also help the previous issue?

  5. I also noticed that training gestures which are similar (swipe left is let’s say trough-then-peak, and swipe right is peak-then-trough in the sensor readout) it is much easier to confuse the model compared to a situation of swipe up vs swipe left. Any advice on how to improve accuracy in these cases?

  6. How important is scaling? Will the model (or the mubu.track?) autoscale everything anyway or should I be careful about that? -1 to 1 or 0 to 1 as a normalized range? (Does it matter?) When I scale, should the signal be as “hot” as possible (as if it was a mic signal)? Some accelerometers are scaled extremely broadly in order to not clip with even the most extreme movements - but that means that most “normal” gestures will barely move the numbers from the middle position. Would “zooming into” the useful range (ie. gaining up the signal) help? Or it doesn’t matter?

  7. Any other advice, comment or suggestion you have, I am super grateful to get!

Thanks a lot!!!

Hi Balint,
sorry the late reply

  1. If you consider “gestures” as “time-profiles”, the best one is mubu.hhmm
    For static “postures”, mubu.gmm is the best

  2. It can actually work with a single example, if it represents well what you want to achieve
    In this case, you can “manually” set the variability (variance) using the regularization parameters:
    relative -> multiplying the computed variance
    absolute ->adding an absolute value (independent of the computed variance)
    With a single example, you probably need to increase the absolute regularization
    (using the message message : regularization )
    You need to retrain the data with you modify these parameters (train)
    Generally I used a very small number of examples (1 to 3), but recording them with care
    you can of course use many more…

  3. You can try to reduce the number of states, 10 by default
    To improve the recognition, also try to increase the likelyhoodwindow (setting it roughly equal to the number of states)

  4. Yes, the idle is definitely adding a lot of issues, the model will integrate it as a “feature”
    The use of marker is one good solution, or setting a gate to automatically start/stop the recording another one.

  5. I’m not sure I see what you mean by « trough-then-peak » ?
    Good practice is to visualize the gestures, and watch how much you can actually reproduce them

  6. The scaling doesn’t matter in principle.
    But…. with a small number of data, it might be important to scale well the « absolute regularization », which need to be scaled with the data (it’s a variance, so to be scaled to the square of the data scaling)
    To reuse the examples and the help patch, the easiest is to scale the data as [-1,1]

  7. With accelerometers, it usually very helpful to use some pre-processing before the recognition. The offset values due to gravity (values when you stand still) can significantly alter the recognition.
    Typically I usually joining to the accelerometers data an additional a parameter like an “gesture intensity” (that goes to zero when still),
    you can also try to filter your data using « pipo biquad »
    I can try to provide an example if this is helpful. Actually we would like to augment the number of examples soon

I hope this helps


Thanks, Fred for the awesome feedback!!

  1. OK, hhmm it is!
  2. okay, this makes sense. I tried to fiddle with these before but never understood them exactly. :slight_smile:
  3. great, will try this!
  4. okay, I tried gating previously, but it made things worse, probably because of my gating implementation (maybe the gate was slightly too late to switch on). Will go back to this and try with the markers too.
  5. So the issue here was that if let’s say I keep my hand in the same orientation and steady, and swipe left or right, I will mostly have movement on one axis (in the sensor vector). If I go left, the number first goes up (peak) then dips down (trough). If I go right, it first dips then peaks. I noticed that this was much more problematic for the model than distinguishing between let’s say up and left. But maybe with some more careful recordings this will go away?
  6. Okay, will go with [-1, 1].
  7. Oh I see, I thought acceleration was better than linear acc. because it also indicates the orientation of the device, but then will try with linear acceleration now. The gesture intensity trick sounds great! And I assume it can also help with wiping the idle data with some looped back mubu.process maybe? (delete leading/trailing zeros). I never tried [pipo.biquad], I tried simple [slide] and [pipo.mvavrg] (slide should be fairly similar to biquad, no?). Looking forward to the new examples, keep up the great work!


about 5. It might that your variation in the orientation is relatively important compared to the magnitude than the (dynamic) acceleration. You could check this by visualizing the data

about 7. about linear vs raw acc: it really depends of your choice of gestures.
about pipo.biquad: it useful for high pass, or band pass etc… Of course, for a low pass, slide or pipo.mvarg are perfectly fine

1 Like