You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran a test with a mono model, and it worked after commenting out the if clause that pertains to "rave.stereo". This attribute seems to be no longer present in current Rave models, and has been replaced by n_channels (or something) in favor of supporting an arbitrary number of channels.
The mentioning of "don't forget to set stereo=True" made me quite confident that RAVE-Latent-Diffusion would work with 2 channels. However, with a 2 channel model I got the typical error that occurs when only one channel is supported:
File "/home/l0rd/.local/lib/python3.11/site-packages/cached_conv/convs.py", line 221, in forward
def forward(self, x):
x = nn.functional.pad(x, self._pad)
return nn.functional.conv1d(
~~~~~~~~~~~~~~~~~~~~ <--- HERE
x,
self.weight,
RuntimeError: Given groups=1, weight of size [96, 32, 7], expected input[1, 16, 1048582] to have 32 channels, but got 16 channels instead
In my lack of proper skills, I was able to fix this with the following code. But this only resulted in double-sided and identical mono output. The same is true if similar code is used on a custom python script, unrelated to RLD.
1. Replace "reshape(1, 1, -1)" at line 43 in preprocess.py:
x = torch.from_numpy(audio_data).reshape(1, rave.n_channels, -1)
It seems to me that rave is creating a tensor now with an additional dimension for each channel, whereas before it seemed to have simply split the data into left/right on one dimension.
If looking at the source code at preprocess.py and comparing it to rave, it shows that load_audio_chunk is using ffmpeg downmix to 1 channel, which necessitates that it trains on downmixed mono-input. Where in Rave's preprocess, it is extracting the individual channel with ffmpeg filter function. So even if stereo worked before, it seems like it worked in a way that would basically be double-sided mono, and the downmixed audio would not be ideal to train on since it wouldn't be the same sound anymore as the original. This would be still ok with normal music (which is like 99% mono), but not ok with some pure instruments or other binaural recordings that convey spatial information (consider, there are microphones with human earlobes now, that basically allow you to crudely visually see objects in a room through echo location effects).
So maybe it never worked with actual stereo to begin with, and the latent space was not only double-sided mono, but also the output was double-sided and identical mono?
I'm not good with all those ML APIs. But I would like to understand how to fix this properly and other Rave tools with similar problems created by multiple channels. And also if it is a huge hurdle to train a model on the latent space that would actually keep true stereo and the spatial information contained within alive.
The text was updated successfully, but these errors were encountered:
Thanks for opening this thread, it is a fix that I'm planning to implement, but haven't found the time to.
The mentioning of "don't forget to set stereo=True" made me quite confident that RAVE-Latent-Diffusion would work with 2 channels. However, with a 2 channel model I got the typical error that occurs when only one channel is supported:
Indeed, you're right in that RLD never really supported true stereo RAVE training, it only worked with mono signals and then used the 'fake stereo' present in the early RAVE commits (hence the --stereo=True flag) to decode the mono RLD generated latents to a faux-stereo image.
There should be a way to adapt the code so that it becomes multi-channel native. If you'd like to send a PR, please feel free. In the meantime, I'm expecting to have some time to update the RLD repo in the coming months.
I ran a test with a mono model, and it worked after commenting out the if clause that pertains to "rave.stereo". This attribute seems to be no longer present in current Rave models, and has been replaced by n_channels (or something) in favor of supporting an arbitrary number of channels.
The mentioning of "don't forget to set stereo=True" made me quite confident that RAVE-Latent-Diffusion would work with 2 channels. However, with a 2 channel model I got the typical error that occurs when only one channel is supported:
In my lack of proper skills, I was able to fix this with the following code. But this only resulted in double-sided and identical mono output. The same is true if similar code is used on a custom python script, unrelated to RLD.
It seems to me that rave is creating a tensor now with an additional dimension for each channel, whereas before it seemed to have simply split the data into left/right on one dimension.
If looking at the source code at preprocess.py and comparing it to rave, it shows that load_audio_chunk is using ffmpeg downmix to 1 channel, which necessitates that it trains on downmixed mono-input. Where in Rave's preprocess, it is extracting the individual channel with ffmpeg filter function. So even if stereo worked before, it seems like it worked in a way that would basically be double-sided mono, and the downmixed audio would not be ideal to train on since it wouldn't be the same sound anymore as the original. This would be still ok with normal music (which is like 99% mono), but not ok with some pure instruments or other binaural recordings that convey spatial information (consider, there are microphones with human earlobes now, that basically allow you to crudely visually see objects in a room through echo location effects).
So maybe it never worked with actual stereo to begin with, and the latent space was not only double-sided mono, but also the output was double-sided and identical mono?
I'm not good with all those ML APIs. But I would like to understand how to fix this properly and other Rave tools with similar problems created by multiple channels. And also if it is a huge hurdle to train a model on the latent space that would actually keep true stereo and the spatial information contained within alive.
The text was updated successfully, but these errors were encountered: