< Back to IRCAM Forum

The learning of multi-channels in RAVE training

Hi there (I got some answers from Gemini but curious for answers from real human)

If I manually combine multiple mono tracks into one multi-channels files, lets say a stereo file first when I put config --channel 2.

Will these two channels learn from each other or they just learn saperately and independently? will I get a stereo RAVE model with correlation between the R and L channels? say if i only feed it with mono audio to the R channel, can I take the reaction of the L channel as the correlated result?

For example, can I predict L channels by feeding R channels to nn~ object in Max/msp + its compressed latent features?

Additionally, if above is true, what is the condition if I apply more than 2 channels like --channel 9 or --channel 299?

Here is the answer from Gemini:
Here are some relevant papers that discuss the correlation between L and R channels in a trained stereo model:

  1. Real-Time Audio Variational Autoencoders for Speech Processing (2019): [2111.05011] RAVE: A variational autoencoder for fast and high-quality neural audio synthesis
    This paper by Antoine Caillon et al. introduces the RAVE model, which is specifically designed for real-time audio processing using VAEs. While not explicitly focusing on correlation analysis, it mentions the shared latent space used by RAVE for stereo audio:

“For stereo audio data (x_L, x_R), the encoder takes both channels as input and projects them into a single latent space z.” (Section 3.1 - Encoder Architecture)

This shared latent space implies that the model learns a compressed representation capturing information from both L and R channels, which inherently reflects some level of correlation between them.
2. Analyzing the Latent Spaces of Audio Variational Autoencoders (2021): https://arxiv.org/pdf/2302.09893
This paper by Boi-Kien Trieu et al. investigates the properties of latent spaces in VAEs trained on audio data. It doesn’t directly address stereo models, but the concept of VAEs capturing relationships between audio features applies to stereo RAVE models as well:

“VAEs learn a latent representation z of the input data x that captures the most important statistical properties of x.” (Section 2.1 - Variational Autoencoders)

In a stereo RAVE model, these “important statistical properties” would include the relationships between the L and R channels.
3. Learning Representations for Audio Inpainting (2020): Audio and Speech Processing
This paper by Justin Park et al. explores using VAEs for audio inpainting, which involves reconstructing missing parts of an audio signal. While not directly related to stereo models, it emphasizes the ability of VAEs to capture relationships within audio data:

“VAEs are powerful tools for learning latent representations of complex data, such as audio. These representations capture the underlying structure of the data, allowing for tasks such as generation, compression, and anomaly detection.” (Section 2 - Related Work)

In a stereo RAVE model, the “underlying structure” would include the correlation between the L and R channels.
Important Note:
These papers don’t explicitly discuss using the latent space for direct correlation analysis between L and R channels. However, they provide a foundation for understanding how stereo RAVE models capture relationships between the channels through the shared latent space representation.

To be honest, I never made a precise analysis on how RAVE (and, more generally, auto-encoders) were correlating spatial information into latent space. I had the idea (maybe in the future), for stereo signals, to rather give the M/S projection from L/R.
What is sure though is that both channels, (2, 4 or whatever) are computationally bound since first layer in encoder, and so should be in the generator (no separate generator for each channel). Computationally in only comes down to give different channels to the ConvNet, as with RBG layers of an image.

1 Like