Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple channels? #6

Open
ballerburg9005 opened this issue Mar 15, 2024 · 1 comment
Open

Support for multiple channels? #6

ballerburg9005 opened this issue Mar 15, 2024 · 1 comment

Comments

@ballerburg9005
Copy link

ballerburg9005 commented Mar 15, 2024

I ran a test with a mono model, and it worked after commenting out the if clause that pertains to "rave.stereo". This attribute seems to be no longer present in current Rave models, and has been replaced by n_channels (or something) in favor of supporting an arbitrary number of channels.

The mentioning of "don't forget to set stereo=True" made me quite confident that RAVE-Latent-Diffusion would work with 2 channels. However, with a 2 channel model I got the typical error that occurs when only one channel is supported:

  File "/home/l0rd/.local/lib/python3.11/site-packages/cached_conv/convs.py", line 221, in forward
    def forward(self, x):
        x = nn.functional.pad(x, self._pad)
        return nn.functional.conv1d(
               ~~~~~~~~~~~~~~~~~~~~ <--- HERE
            x,
            self.weight,
RuntimeError: Given groups=1, weight of size [96, 32, 7], expected input[1, 16, 1048582] to have 32 channels, but got 16 channels instead

In my lack of proper skills, I was able to fix this with the following code. But this only resulted in double-sided and identical mono output. The same is true if similar code is used on a custom python script, unrelated to RLD.

1. Replace "reshape(1, 1, -1)" at line 43 in preprocess.py:
x = torch.from_numpy(audio_data).reshape(1, rave.n_channels, -1)

It seems to me that rave is creating a tensor now with an additional dimension for each channel, whereas before it seemed to have simply split the data into left/right on one dimension.

If looking at the source code at preprocess.py and comparing it to rave, it shows that load_audio_chunk is using ffmpeg downmix to 1 channel, which necessitates that it trains on downmixed mono-input. Where in Rave's preprocess, it is extracting the individual channel with ffmpeg filter function. So even if stereo worked before, it seems like it worked in a way that would basically be double-sided mono, and the downmixed audio would not be ideal to train on since it wouldn't be the same sound anymore as the original. This would be still ok with normal music (which is like 99% mono), but not ok with some pure instruments or other binaural recordings that convey spatial information (consider, there are microphones with human earlobes now, that basically allow you to crudely visually see objects in a room through echo location effects).

So maybe it never worked with actual stereo to begin with, and the latent space was not only double-sided mono, but also the output was double-sided and identical mono?

I'm not good with all those ML APIs. But I would like to understand how to fix this properly and other Rave tools with similar problems created by multiple channels. And also if it is a huge hurdle to train a model on the latent space that would actually keep true stereo and the spatial information contained within alive.

@moiseshorta
Copy link
Owner

moiseshorta commented Mar 18, 2024

Hey there,

Thanks for opening this thread, it is a fix that I'm planning to implement, but haven't found the time to.

The mentioning of "don't forget to set stereo=True" made me quite confident that RAVE-Latent-Diffusion would work with 2 channels. However, with a 2 channel model I got the typical error that occurs when only one channel is supported:

Indeed, you're right in that RLD never really supported true stereo RAVE training, it only worked with mono signals and then used the 'fake stereo' present in the early RAVE commits (hence the --stereo=True flag) to decode the mono RLD generated latents to a faux-stereo image.

There should be a way to adapt the code so that it becomes multi-channel native. If you'd like to send a PR, please feel free. In the meantime, I'm expecting to have some time to update the RLD repo in the coming months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants