Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please provide an option to not depend on downloading model data #337

Open
bbhtt opened this issue Apr 9, 2024 · 10 comments
Open

Please provide an option to not depend on downloading model data #337

bbhtt opened this issue Apr 9, 2024 · 10 comments

Comments

@bbhtt
Copy link

bbhtt commented Apr 9, 2024

We build opus from git source and we'd like to not switch to tarballs or download the model tarball. Given the recent xz situation downloading tarballs with no trace on git is complicated and seems like a security issue.

Having an option to stop depending on that seems like it'd be great.

With the current meson configuration:

-Dfloat-api=true 
-Dasm=disabled 
-Dhardening=true 
-Dcustom-modes=true 
-Denable-deep-plc=true 
-Denable-dred=true 
-Denable-osce=true 
-Ddocs=enabled 
-Dextra-programs=disabled 
-Dtests=disabled 
-Dintrinsics=enabled 
-Drtcd=enabled

meson seems to fail on meson.build:636:24: ERROR: File dnn/fargan_data.h does not exist. and is not even aware of the download model script living in autogen.

Whereas the download model script is passing an unknown commit https://github.com/xiph/opus/blob/ab4e83598e7fc8b2ce82dc633a0fc0c452b629aa/autogen.sh#L12C24-L12C31

What is the source of that commit? Can this model data be found on git somewhere? Can this be implemented as a git submodule?

Looking through the git history seems at some point there was a submodule, which was reverted in favour of downloading the model tarball.

Thanks!

@bbhtt
Copy link
Author

bbhtt commented Apr 9, 2024

Hello, also why does the model tarball include ~130 MB of binary model files which the opus tarballs do not?

Also some of the headers are quite larger in model tarball than in opus tarballs

Model tarball:

Screenshot from 2024-04-09 23-46-38

Opus tarball:

Screenshot from 2024-04-09 23-47-55

@jmvalin
Copy link
Member

jmvalin commented Apr 9, 2024

In the model tarball downloaded from git, the models include all float weights (e.g. for debugging), whereas for the Opus release tarball, the float data was removed to make the release as small as possible.

@bbhtt
Copy link
Author

bbhtt commented Apr 9, 2024

What do you mean by downloading from git? The script is downloading from here https://media.xiph.org/opus/models/

Also it makes really hard to know which model tarball to update to, because the expected tarball commit is in the script only. If you are building in an environment without network access you need to know which filename to download beforehand.

the models include all float weights (e.g. for debugging), whereas for the Opus release tarball, the float data was removed to make the release as small as possible.

Is there some kind of docs on how to do this? Also knowing the source of the model data would be great too. How is the model tarball produced?

Thanks

@jmvalin
Copy link
Member

jmvalin commented Apr 9, 2024

You can run scripts/shrink_model.sh to remove the debug float data from the extracted models. From there, you can just run "make dist" to build a tarball that has just what's needed.

@heftig
Copy link

heftig commented Apr 18, 2024

I also ran into this issue. Could you add the shrunken model data to the opus git repository, please, so that they're covered by your signed tags and opus can be built without any further downloads? The model tarballs are not an option as they're entirely unverified (and also very large).

@Erick555
Copy link

Maybe you can also provide an option to disable this new feature.

@jmvalin
Copy link
Member

jmvalin commented Apr 23, 2024

The newer models are just too big for git. That being said, the main branch has been updated to verify them so they cannot be compromised on the download server.

@nanonyme
Copy link

Looks like the process is now more or less documented. Would it be possible to still have this checksum a separate file so it wouldn't need to be parsed from a shell script?

@jmvalin
Copy link
Member

jmvalin commented Apr 27, 2024

I'm not against, but I don't have the skills to update the Windows version of the script. @mklingb any thoughts on that?

@nanonyme
Copy link

The batch file would probably only turn simpler by having the checksum in a separate file rather than having it parse the shell script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants