Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchaudio 11.0 yields different results than torchaudio 12.1 when loading MP3 #4889

Closed
patrickvonplaten opened this issue Aug 24, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@patrickvonplaten
Copy link
Contributor

Describe the bug

When loading Common Voice with torchaudio 0.11.0 the results are different to 0.12.1 which leads to problems in transformers see: huggingface/transformers#18749

Steps to reproduce the bug

If you run the following code once with torchaudio==0.11.0+cu102 and torchaudio==0.12.1+cu102 you can see that the tensors differ. This is a pretty big breaking change and makes some integration tests fail in Transformers.

#!/usr/bin/env python3
from datasets import load_dataset
import datasets
import numpy as np
import torch
import torchaudio
print("torch vesion", torch.__version__)
print("torchaudio vesion", torchaudio.__version__)

save_audio = True
load_audios = False

if save_audio:
    ds = load_dataset("common_voice", "en", split="train", streaming=True)
    ds = ds.cast_column("audio", datasets.Audio(sampling_rate=16_000))
    ds_iter = iter(ds)
    sample = next(ds_iter)

    np.save(f"audio_sample_{torch.__version__}", sample["audio"]["array"])
    print(sample["audio"]["array"])

if load_audios:
    array_torch_11 = np.load("/home/patrick/audio_sample_1.11.0+cu102.npy")
    print("Array 11 Shape", array_torch_11.shape)
    print("Array 11 abs sum", np.sum(np.abs(array_torch_11)))
    array_torch_12 = np.load("/home/patrick/audio_sample_1.12.1+cu102.npy")
    print("Array 12 Shape", array_torch_12.shape)
    print("Array 12 abs sum", np.sum(np.abs(array_torch_12)))

Having saved the tensors the print output yields:

torch vesion 1.12.1+cu102
torchaudio vesion 0.12.1+cu102
Array 11 Shape (122880,)
Array 11 abs sum 1396.4988
Array 12 Shape (123264,)
Array 12 abs sum 1396.5193

Expected results

torchaudio 11.0 and 12.1 should yield same results.

Actual results

See above.

Environment info

  • datasets version: 2.1.1.dev0
  • Platform: Linux-5.18.10-76051810-generic-x86_64-with-glibc2.34
  • Python version: 3.9.7
  • PyArrow version: 6.0.1
  • Pandas version: 1.4.2
@patrickvonplaten patrickvonplaten added the bug Something isn't working label Aug 24, 2022
@patrickvonplaten
Copy link
Contributor Author

Maybe we can just pass this along to torchaudio @lhoestq @albertvillanova ? It be great if you could investigate if the errors lies in datasets or in torchaudio.

@lhoestq
Copy link
Member

lhoestq commented Aug 24, 2022

torchaudio did a change in 0.12 on MP3 decoding (which affects common voice):

MP3 decoding is now handled by FFmpeg in sox_io backend. (pytorch/audio#2419, pytorch/audio#2428)

  • FFmpeg is now used as fallback in sox_io backend, and now MP3 decoding is handled by FFmpeg. To load MP3 audio with torchaudio.load, please install a compatible version of FFmpeg (Version 4 when using an official binary distribution).
  • Note that, whereas the previous MP3 decoding scheme pads the output audio, the new scheme does not. As a consequence, the new version returns shorter audio tensors.

@patrickvonplaten
Copy link
Contributor Author

Do we have a solution for this now? Should we just upgrade to torchaudio 0.12.0 then?

@lhoestq
Copy link
Member

lhoestq commented Oct 5, 2022

datasets supports torchaudio 0.12 if you have an environment that supports reading MP3 with torchaudio, i.e. if you have ffmpeg>=4

@mariosasko
Copy link
Collaborator

Closing as we no longer use torchaudio for decoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants