< Back to IRCAM Forum

Best Ambisonic - Binaural combination?

Hi Thibaut,

I have been browsing several threads as well as Spat tutorials about HOA and binaural, but I’m still wondering about what is the most efficient strategy in terms of CPU economy as well as creativity. I need to spatialize a scene with many sources, as well as a Bformat audio stream of 4 channels (a room ambiance that I recorded with a tetramic and that won’t be require any processing.

  1. It seems that using a combination of HOA with binaural is more CPU efficient when it comes to spatialize multiple sources. I have tried to compare the three options (spat5.hoa.binaural vs. spat5.virtualspeakers vs. spat5.binaural) in the spat.5.hoa.binaural~ example of the overview, but I don’t see an obvious difference.
    What would you suggest? Is spat5.hoa.binaural reliable? I read in one of the tutorials “use at your own risk”…

  2. I have so far used spat5.spat with the binaural panning type. This solution was OK since I didn’t have many sources, but I now think I should apply the HOA-binaural combination. I will have at least 15 sources, including a BFormat 4 channel audio stream. Is this combination actually better in this case?
    So far, I used spat5.spat , along with spat5.oper to manage the positioning and perceptual parameters of the sources in binaural.
    If I were to use a combination such as spat.hoa.encoder / spat5.hoa.binaural or spat5.virtual speakers, how can I preserve the positioning and perceptual parameters control (without spat5.spat) ? Should spat5.spat intervene in the processing chain, or how can it be replaced?

I believe I could use the approach of nm0 here Ambisonic Synthesis + Decoding in spat5.spat
and place a spat5.spat object upstream, before the spat5.decoder?

There are many tutos, examples involved. Let me know if I should provide a patch. I’d rather understand the underlying logic before hand.

Thanks in advance!
Coralie

Hi Coralie,

Well, it depends…

Basically you can use three approaches :
#1 “native” binaural synthesis, using [spat5.binaural~] or [spat5.spat~ @initwith “/panning/type binaural”]
#2 using HOA encoding and virtual speakers binauralization i.e. [spat5.hoa.encoder~]+[spat5.hoa.decoder~]+[spat5.virtualspeakers~] or [spat5.spat~ @initwith “/panning/type hoa”]+[spat5.hoa.decoder~]+[spat5.virtualspeakers~]
#3 using HOA encoding and HOA-to-binaural transcoding i.e. [spat5.hoa.encoder~]+[spat5.hoa.binaural~] or [spat5.spat~ @initwith “/panning/type hoa”]+[spat5.hoa.binaural~]

Pros/Cons :
#1 is “the best”, in terms of quality (timbre and localization accuracy).
But it is expensive in terms of CPU, as the CPU load is proportional to the number of sources you need to spatialize.

#2 The CPU load depends 1) on the HOA order you’re using and 2) the number of virtual speakers.
Adding more sources will only have a minor impact on CPU load.
For rotating sources, you can apply rotations in the HOA-encoded domain, which is CPU efficient.
Drawbacks : the rendering quality depends crucially 1) on the HOA decoder settings you’re using and 2) on the chosen grid of virtual speakers.

#3 The most “elegant” approach, and often the most CPU efficient.
The CPU load depends on the HOA order.
Adding more sources will only have a minor impact on CPU load.
For rotating sources, you can apply rotations in the HOA-encoded domain, which is CPU efficient.
Drawbacks : at the moment, spat5.hoa.binaural~ is not giving full satisfaction (therefore it is marked as “prototype, use at your own risks”).
We know it could be improved – in terms of rendering quality – and it will be in the future.
If the current version is OK for you – for timbre and localization – then great. Use option #3 !
Note there has been new options added in v5.2.0 (still in experimental stage, but worth trying it out).

In any case (#1, #2 or #3), you can use spat5.spat~ + spat5.oper + perceptual parameters. No problem.

For actual CPU load, you need to benchmark.
As a ballpark figure: approach #3 with 4th order HOA will involve 25 =(4+1)^2 HRTFs. Thus it should be roughly the same CPU load as approach #1 with 25 (static) sources.

I hope this clarifies a bit.
Feel free to comment.

Best,
T.

2 Likes

Hi Thibaut,
Thank you for the thorough reply. Does the spat5.hoa.binaural object requires speaker coordinates? Looking at the comparison patch from the help file, it seems that it only requires a norm and an HRTF folder reference. Is that the case? This implies there are no actual virtual speakers at all?

Thank you.

Correct: spat5.hoa.binaural~ does not require loudspeaker coordinates.

Thank you Thibaut. Does spat5.hoa.binaural~ uses virtual speakers? I am wondering about getting the speakers coordinates / the possible use of phantom speakers at 90° and -90° elevation, as this option is offered for the spat5.hoa.decoder~. Does it make sense to use it? Or does spat5.hoa.binaural~ provides the best layout possible, regardless we resort to the phantom speakers ?

Actually, to be more “square”, I should ask about the difference between the options offered in the spat5.hoa.decoder~ in general, such as the decoding technique and optimization possibilities. I assume these are not reproduced in the spat5.ambisonics.hoa~ object?

spat5.hoa.binaural~ uses the best possible layout of virtual speakers, and therefore the phantom option is not necessary.
Given the optimal virtual speaker layout, the object internally uses the sampling decoder method for decoding.
Other options such as /blocksize and /hrir/length refer to the convolution process, and they impact the cpu load.

I assume these are not reproduced in the spat5.ambisonics.hoa~ object?

Which object are you referring to ?

Hi Thibaut,
Sorry, my mistake. I was talking about the spat5.hoa.decoder~ : method, type, energy compensation and so on.
I am afraid my understanding of the convolution parameters is still too limited to make the most of spat5.hoa.binaural~ , and since I’m using the project for clients who need a robust solution, I’ll stick to approach #2.
I have another question regarding the hrtfs though. I had been using raw Ircam hrts in a former project, and found them quite satisfactory in 2D. I was less convinced for the 3D (I had tested them all, and picked the individual that worked best for me).
Have you compared them with the Kemar, or have you read anything about possible comparisons? Since the patch will be used by quite a lot of different people, I wonder which bank is more efficient.
If I should switch to another bank, are the Ircam hrtfs on the SOFA format? Is it OK to load a folder that contains a set of Ircam individualized Hrtfs?
Thank you!
Coralie

Hi again,

Well, I strongly recommend not to use “raw” HRTFs.
As the name suggests, these are raw data from the measurement, without any post-processing to “clean-up” the acoustical measurement.
This is essentially useful for scientific analysis; but not really for “listening” purpose.
You should rather use the “Compensated” HRTFs, which have been post-processed, cleaned-up, equalized, diffuse-field equalized, etc.
(see spat5.sofa.loader)

I was less convinced for the 3D

That’s not entirely surprising : perception of height strongly depends on individual cues.
So non-individualized HRTFs usually work better in the horizontal plane.

Have you compared them with the Kemar, or have you read anything about possible comparisons?

No, we havent compared. It is likely that some people have made comparisons; but I dont have any reference to recommend.

I wonder which bank is more efficient

Have a look here :

are the Ircam hrtfs on the SOFA format?

Yes. (see spat5.sofa.loader)

Is it OK to load a folder that contains a set of Ircam individualized Hrtfs?

Why not ? What do you mean “is it OK” ?

Best,
T.

Hi @tcarpent and @coraliediatkine.
I have a similar workflow as @coraliediatkine and I’m struggling trying to wrap my head around the issue. I have a B-format 4 chan ambient recording, plus some additional sounds that I want to render as sound sources in a binaural scene.
So it makes more sense to encode the b-format file to a higher order and the decoden it with the hoa.binaural~ object, and the use another encoder for the sound other sound sources? Basically one will have two parallel encoders, one for the B to HOA and other for the sound sources to HOA. And the both outputs of the encoders get summed in the input of the hoa.binaural object. And that is it?

Does this make sense? Or shoudl I add the patch? I tried it already and to my ears sounds just fine. It blends well and I don’t hear anything particularly glitchy.
Thanks!

Nico

Hi Nico,

Why would you “encode the b-format file to a higher order” ?
Not sure what you mean by that, but it seems like just wasting (a bit of) cpu, but that wouldn’t improve the rendering/spatial resolution.

On one hand : encode your additional sounds into N-order HOA (say ACN/SN3D).

On the other hand : make sure your b-format stream is “converted” to the same Ambisonic format (ACN / SN3D), thanks to spat5.hoa.sorting~ and spat5.hoa.converter~.
(also apply 90° rotation if needed – have a look at spat5.tuto-bformat.maxpat)

Then sum up the two streams.
HOA is hierarchical, so it is legit to sum several HOA streams even if their orders differ.
i.e. you can sum a 1st order stream and a N-order stream.

Finally, decode this N-order stream to binaural. (using e.g. spat5.hoa.binaural~)

So, yes your workflow makes sense.
You can send a (strip-down) patch if you really have a doubt.

Best,
T.

1 Like

Thanks for your response Thibaut.
Yes, my idea was to “encode” the B Format stream to an N-order HOA so it will match the order of the additional sounds.
But in that also means that I can encode my additional sounds in First Order Ambisonics, to save CPU right? Does this affect the precision of the localization? Or is it less noticeable because it’s going to the hoa.binaural anyways?
Thanks,
Nico

For your additional sounds, you probably want the best (spatial) precision, so you’d rather encode them in higher order. (although this will obviously increase the cpu load)
It should be noticeable, because the HRTFs you’ll use in spat5.hoa.binaural~ are able to provide very high orders.
You will likely notice a difference between 1st and say 4th order encoding. Due to perceptual limits, it is however likely that the differences between 4th and 5th (for example) are much more subtle.
So, you need to make experiments and adjust the encoding order to your needs/taste.

Ok that makes sense. Thank you!

Hi Thibaut,
Sorry, I hadn’t understood the use of the spat5.sofa.loader. It’s a bit clearer to me now. So my question about loading one specific set of Hrtfs didn’t make sense at all.
If I understand, one has to filter the type of Hrtfs that they want to use, download it in Ircam/sofa directory, and then load it through a dialogue window for instance.

I am not certain I understand the difference between the SOS and HRIR data type. So far, I knew about the Listen and the Bili projects, but I had no idea about the SOS data type and its meaning. I have looked at the wiki about the SOFA convention, but I’m stuck when it comes to understanding the in and outs of that denomination. I don’t want to waste your time, so do you think this is important, and could you just give me a few directions to make a choice?
Edit: I have used the help patch to compare SOS and non SOS files, and SOS12, SOS24, SOS36. I have the feeling that SOS format allows maybe a more precise localization (the curves look less smooth), and that this is also increased with the SOS order. But I may be wrong. I’d like to make sure this makes sense from an aural standpoint.
Thanks
Coralie

1 Like

Hi,

Yes, spat5.sofa.loader gives you access to all HRTFs files on our servers.
You can download files, and later use them in your patch.

SOS stands for “second-order-sections”.
This means these are IIR filters. While HRIR are FIR filters.

SOS files are approximated versions of the original HRIR data.
The image you posted illustrates this approximation (on the frequency spectrum).
The red curve (SOS) is an approximation of the blue one (HRIR). (assuming you load the files corresponding to the same individual)

SOS rendering is A LOT less expensive than HRIR rendering.

SOS12 means 12th order IIR filter. It’s a rough approximation. But extremely cheap for CPU load.
SOS24 means 24th order IIR filter. Provides a more accurate approximation than SOS12. That’s my recommandation. Best tradeoff between quality and CPU load.
SOS36 means 36th order IIR filter. Usually, this only provides a marginal quality improvement compared to SOS24.

As you can see, SOS filters are usually a lot smoother than HRIR. This might also be a benefit for non-individualized listening. (SOS smoothes out strong peaks or notches that might be very specific to the measured individual)

In most situations, my recommandation is to use SOS24 files.

Best,
T.

Very interesting discussion…

@coraliediatkine where do you find this display ?

image

Best,

Jerome

Hi Jérôme,
It’s just a screenshot of the “comparison” tab of the spat5.sofa.loader help patch. I used the automated azimuth panning in the patch, and loaded different hrtfs files.
Regards
Coralie

Hi Coralie,

Ok, thank you very much…

I would like to use the same spectrographic object you use to make comparisons, which one is it? Sorry, if this question is a bit silly…

As a hrtf model, I really like the 1040 SOS24…

Thank you in advance,

Best,

Jerome

spat5.frequencyresponse or spat5.frequencyresponse.embedded
Check spat5.hrtf.infos.maxhelp

Cheers,
T.

1 Like

Thank you Thibaut, as @coraliediatkine, I want to compare several curves… I suppose i have to see with this : image

how many curves are available ?

Thanks,

Jerome