< Back to IRCAM Forum

Corpus-Based Synthesis Blind Spot - Researching New Tool

Hi Everyone,

I am trying to get a team together to build a concatenative tool that would require the blending of a few different existing technologies to fill what I see is a blindspot in available technology.

The idea is hopefully somewhat simple: To build a tool that allows someone to replace an audio target with samples of their choosing, outputting a .RPP file in a specific way instead of a WAV.

Audioguide, for example, can already do this, but there are some missing features in it for my application that are key. It seems the technology to do this in the specific way I’m looking for already exists, but not all in one program or not in a very user friendly way. The goal is to make a Frankenstein of this existing tech!

Here is an in-depth explanation:

For a use case, let’s say I’m trying to recreate what Bjork did with Vulnicara and make an all strings version of her electronic album. So I input Vulnicara as the target and then point to a corpus of tons and tons of string improvisations. The program would populate a reaper session with all those samples in a way that gets close to the to the original Vulnicara source. It’s not so different than audio mosaicing approach except the output is not a WAV but a reaper session populated with a ton of audio items overlapped and volume adjusted to recreate the target.

What I would want to be different than audio mosaicing is that I would like this to use whole files without repetition instead of grains, or have the ability to intelligently chop a long corpus file into whole files that it uses by stripping silence at zero crossings. Obviously we are aiming to achieve some sort of melodic and harmonic recreation (most of this music is diatonic), but I would also like to achieve a timbral match as well, not by filtering or manipulating samples from the corpus, but by layering and adjusting the volumes of so many layers of from the corpus that it could recreate the spectra of the target file. In the same way any sound is just many stacked sine tones at varying amplitudes, my hope here is that many stacked but unprocessed wholefile samples at varying amplitudes could spectrally approximate the timbre and characteristic of the source.

It would be convenient if this was done within reaper through a reascript but this is non essential. It would also be great if you could use selected tracks of multiple actual stems instead of NMF and one stereo WAV to create polyphony of the target. For my tastes as well I would like for this idea to be as non destructive/processed sounding as possible which means things like the option to slice at transients/zero crossings and using that whole phrase instead of set windows, infinite timesparsity or equivalent so that no collaged samples are ever repeated, and no spectral processing. Since there will be an option without NMF or other spectral processing I’m hoping it will be lighter resource wise as well should that be selected. I’m also thinking that even if target polyphony is only 5 tracks the final populated reaper session could be hundreds and hundreds of tracks, possibly grouped into folders to represent which of the 5 stems they are aiming to recreate.

Essentially it would be like if a person was locked in a room and all they had was a huge folder of audio they could slice musically and clip gain to create a montage like the source. The result being a Reaper session populated with a ton of unprocessed items that are clip gained etc but not pitch shifted or filtered thats sounds like Vulnicara (or whatever) harmonically and somewhat timbrally.

It could even be fun to have an option that is just sample placement and no clip gaining. Or offsetting lost dB from the clip gain by sending that to a reverb bus.

Lastly, whenever I do almost exactly this in Audioguide the results would benefit from some iterations as it is not so accurate harmonically (though amazing in other regards). Part of me wonders if some light ML might help in this case get us closer to the target, maybe with something as little as 1000 iterations of training.

Do let me know if you think you might know where to start in terms of accomplishing this or if you know anyone I should reach out to!

Hi Jonathan,

this seems to me like a rather ambitious research project, way beyond just putting together existing technologies. In the precision you seek it is closer to computer-aided orchestration (with tools from the Orchidea/Orchids family) than to mosaicing.

The idea of keeping longer phrases from the source material is the original motivation behind the concatenative part of corpus-based concatenative synthesis. In my PhD started in 1999 it was expressed as a concatenation cost which would be zero for taking consecutive units from the corpus. However, it was very unwieldy to weight this with the target cost, and for catart, just the latter was used, together with mostly granular-style synthesis.

As you mention NMF, source separation is a third research area that might be relevant here, as in the Let It Bee approach by Driedger, Prätzlich, Müller, where they integrated the continuity constraint into the optimisation algorithm.

Progress can be made by trying out the different existing tools and pinpointing where their built-in assumptions don’t match yours. Sometimes it might be just a technical limitation, sometimes this might reveal an interesting novel research question.

Best regards,