Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web Assembly builds? #28

Closed
amit777 opened this issue Apr 28, 2020 · 27 comments
Closed

Web Assembly builds? #28

amit777 opened this issue Apr 28, 2020 · 27 comments

Comments

@amit777
Copy link

amit777 commented Apr 28, 2020

I was wondering if there are any plans to have wasm builds as well?

@bblanchon
Copy link
Owner

No, there is no such plan but I would accept a PR ;-)
I'm not familiar with Wen Assembly, do you know what changes it would require?

@amit777
Copy link
Author

amit777 commented Apr 28, 2020

I think there was discussion to create a build target for it. Basically, you use emscripten instead of the built in C++ compiler to generate the web assembly files.

Here is an example repo: https://github.com/urish/pdfium-wasm

However, it seems a bit outdate in its build process.

@jerbob92
Copy link
Contributor

@bblanchon we're looking into using the WASM build for go-pdfium, see: klippa-app/go-pdfium#60
We would then embed the WASM binary into the repository, making usage of the plugin way easier because it wouldn't need CGO anymore and pdfium would be embedded so we also don't need users to install pdfium globally anymore.

Would it be possible to also compile WASM in standalone mode so that we can use it for that purspose?

@bblanchon
Copy link
Owner

@jerbob92, I'm not so familiar with WASM...
Could I break any existing code by enabling the standalone mode?

@bblanchon
Copy link
Owner

I found the answer to my own question:

Do we need non-standalone Wasm?

Why does the STANDALONE_WASM flag exist? In theory Emscripten could always set STANDALONE_WASM, which would be simpler. But standalone Wasm files can't depend on JS, and that has some downsides:

  • We can't minify the Wasm import and export names, as the minification only works if both sides agree, the Wasm and what loads it.
  • Normally we create the Wasm Memory in JS so that JS can start to use it during startup, which lets us do work in parallel. But in standalone Wasm we have to create the Memory in the Wasm.
  • Some APIs are just easy to do in JS. For example __assert_fail, which is called when a C assertion fails, is normally implemented in JS. It takes just a single line, and even if you include the JS functions it calls, the total code size is quite small. On the other hand, in a standalone build we can't depend on JS, so we use musl's assert.c. That uses fprintf, which means it ends up pulling in a bunch of C stdio support, including things with indirect calls that make it hard to remove unused functions. Overall, there are many such details that end up making a difference in total code size.

If you want to run both on the Web and elsewhere, and you want 100% optimal code size and startup times, you should make two separate builds, one with -s STANDALONE and one without. That's very easy as it's just flipping one flag!

Source: https://v8.dev/blog/emscripten-standalone-wasm#do-we-need-non-standalone-wasm%3F

@jerbob92
Copy link
Contributor

jerbob92 commented Sep 19, 2022

Hi @bblanchon,

I wasn't completely clear in my question, my intention was indeed to have a web build and a standalone build.

@bblanchon
Copy link
Owner

I wonder if it's useful to provide both: it seems that the standalone build could work for everyone.
Sure it's not optimal, but I think in the case of PDFium the difference won't be noticeable, since the library is pretty big.
Moreover, what naming convention should I follow? I tried to look at existing projects that provide both builds but couldn't find any.

@jerbob92
Copy link
Contributor

I wonder if it's useful to provide both: it seems that the standalone build could work for everyone.
Sure it's not optimal, but I think in the case of PDFium the difference won't be noticeable, since the library is pretty big.

Looking at this article: https://github.com/emscripten-core/emscripten/wiki/WebAssembly-Standalone
It looks like it still generates a JS file when adding -s STANDALONE_WASM, I didn't know that, so then it's indeed not needed to supply 2 versions. I think the optimizations are minimal in a big library like PDFium indeed.

@bblanchon
Copy link
Owner

Here is a build that ran with -s STANDALONE=1:
https://github.com/bblanchon/pdfium-binaries/actions/runs/3081803559
You can download the artifact here:
https://github.com/bblanchon/pdfium-binaries/suites/8355824753/artifacts/368193028
Please give it a try!

@jerbob92
Copy link
Contributor

Thanks a lot! Will do soon when I have some time, hopefully tonight!

I do see that this has been added to the JS file:

err('warning: running JS from STANDALONE_WASM without WASM_BIGINT will fail if a syscall with i64 is used (in standalone mode we cannot legalize syscalls)');

Might become an issue

@jerbob92
Copy link
Contributor

@bblanchon I'm making some progress in getting it to work, but I'm getting some odd internal issues. Would it be possible to create a debug build so I can see the stacktrace? I can only see things like this now:

        .$1557(i32,i32)
        .$1556(i32) i32
        .$1558(i32)
        .$1567(i32) i32
        .$1588(i32,i32)
        .$1589(i32) i32
        .$1591() i32
        .$1592() i32
        .$1596(i32,i32,i32,i32,i32,i32) i32
        .$1598(i32,i32,i32,i32,i32) i32
        .$1604(i32,i32,i32,i32) i32
        .$1623(i32,i32,i32,i32,i32,i32) i32

I believe the flag is called --profile.

@jerbob92
Copy link
Contributor

@bblanchon nvm about that, the flag is called -g and I managed to make my own build using this repo, was super easy so thanks for making your build system so easy to use 👏

@bblanchon
Copy link
Owner

Awesome! Let me know if I can help you with anything else.

@jerbob92
Copy link
Contributor

@bblanchon Thanks! Nothing for now, just some FYI's:

  • Pdfium compiles fine with the latest version of Emscripten (3.1.22)
  • Emscripten relies on the JS FS implementation, which ofcourse isn't available in Standalone mode
  • FS implementation can be disabled by doing -s NO_FILESYSTEM=1, but it then uses a very simple standalone implementation for the fs, that's far from complete and not complete enough to make pdfium work (we need urandom): https://github.com/emscripten-core/emscripten/blob/f5a1916484da9b2dfb4242237f8fb7b29d42c501/system/lib/standalone/standalone.c#L86
  • Emscripten is working on a new filesystem, WasmFS, see: https://emscripten.org/docs/api_reference/Filesystem-API.html
  • By default it has a virtual filesystem in WASM memory, instead of in JS, but different backends exists and can be made
  • The default implementation has a urandom implementation that is needed for pdfium to work
  • For some reason it's not working yet, I'm talking with the Emscripten maintainers to find out why
  • It can be enabled with -s WASMFS=1
  • We're also talking about making a full WASI backend for WasmFS so that we can just let Wazero handle the filesystem and everything goes through the file system defined by Wazero.

Also:

  • I tried to compile with wasi-sdk, at it seems to be more simple and compared to Emscripten it's easier to use server-side
  • I got the build scripts adjusted to wasi-sdk to the point that it was being able to compile, but compilation fails at some point due to missing pthreads
  • It seems that they didn't implement threading yet, and also doesn't seem to be on the roadmap
  • Do you know if pdfium actual needs it or whether we can stub it out? pthreads seems to be used quite often over different dependencies

@bblanchon
Copy link
Owner

I don't know if PDFium works without threads.
However, you can probably use it without a complete filesystem: simply use FPDF_LoadMemDocument() instead of FPDF_LoadDocument().

@jerbob92
Copy link
Contributor

Yeah, I know we don't need a full filesystem, but it does at least need a working open call to open /dev/urandom, and that's where I'm stuck right now.

@jerbob92
Copy link
Contributor

@bblanchon I'm making some progress, I have gotten /dev/urandom working now. PDFium is now giving the following error:

[FATAL:partition_bucket.cc(618)] Check failed: adjusted_next_partition_page + slot_span_reservation_size <= root->next_partition_page_end.

Do you have any idea what could cause that? Looking at the code doesn't really give me a clue. Will do more research tomorrow what that actually means.

@bblanchon
Copy link
Owner

The PDFium team updated the "partition_allocator" last week.
Maybe it has some issue with Wasm; I'll test it in the browser to see if it still works.

@jerbob92
Copy link
Contributor

Great catch, I had no idea! I just compiled with 5254 and disabling the partition_allocator patch for wasm, I now have a more working build! It's running out of memory now it seems on FX_OutOfMemoryTerminate, but that should be fixable.

@jerbob92
Copy link
Contributor

jerbob92 commented Sep 24, 2022

@bblanchon even with 5254, it looks like it works more, but that might just be the difference between the two allocators.
It looks Emscripten does not support calling mmap with an address hint, which PDFium does:
.__syscall_mmap2(8388608,2097152,3,34,4294967295,0)

When looking at the code, this has not been implemented: https://github.com/emscripten-core/emscripten/blob/1bbb5f8dea4422768bdc9a9a44ce8a9eb6dd39c9/system/lib/libc/emscripten_mmap.c#L119

This was changed a few months ago: emscripten-core/emscripten@d81e048

Any idea if we can remove the address space randomization, it seems to be causing all of the issues right now (needing /dev/urandom, not being able to call mmap)

@jerbob92
Copy link
Contributor

@bblanchon I'm fairly certain that all the memory issues are caused by the same thing, and there is already a long thread about it at Emscripten: emscripten-core/emscripten#14459

So I think the current WASM version is not really usable in any form (Web or Standalone),

@jerbob92
Copy link
Contributor

@bblanchon I have been testing some more, and pdfium recently added the build option pdf_use_partition_alloc, if you set it to false for WASM, it fixes a lot of these issues, because it doesn't use the partition allocator anymore.

@bblanchon
Copy link
Owner

That's excellent news, @jerbob92!
Do you want to make a PR?
Here are the lines that set the configuration for WASM:

wasm):
echo 'pdf_is_complete_lib = true'
echo 'is_clang = false'
;;

@jerbob92
Copy link
Contributor

@bblanchon I'm doing some more tests now to see what needs to be changed to make this work completely for both serverside and in the browser support, but the most important change is adding the line in there indeed.

@jerbob92
Copy link
Contributor

jerbob92 commented Nov 23, 2022

@bblanchon Might be good to give you some sort of update:

I have been working together with the Wazero team to get a working runtime for the compiled pdfium (we needed to implement some WASI calls to get font/file reading to work). I have also been patching Emscripten to proxy syscalls to WASI to make that actually end up at the application. However, once these syscalls are implemented in Emscripten, they do not show up anymore in the generated JS, meaning that these features then suddenly break in the JS build.

What I also experienced is that there are a lot of different WASM features that are not finalized yet so they are not implemented in every engine. Emscripten allows you to turn them on/off with a flag.

My conclusion is that we can't really get away with 1 WASM build for all environments. I think we need at least 2 builds:

  • Web
  • Standalone

Then there's a load of flags that you can pass into Emscripten to change it's behaviour, like WASMFS and enabling WASM Exception support, which also change what is being required of the runtime. I would not use them right until there's support for those in more engines.

Then there's also the thing that Emscripten emits a lot of methods that the runtime should implement before it works:

  • emscripten_notify_memory_growth: when the memory size changes
  • _emscripten_throw_longjmp: I'm not sure yet, possible when an exception hasn't been catched?
  • invoke_***: when the program uses exceptions and/or long jumps, the type of function will depend on what is being called, so it could be like invoke_ii and invoke_viiii, I have seen examples with like 70 of them.

My suggestion for now would be to create two builds:

  • Web: emscripten build with the same settings as before
  • Standalone: emscripten build with the STANDALONE_WASM=1 flag and compile a list of runtimes that support the required imports of emscripten (emscripten_* and invoke_*) to include in the README.md.

I will let you know once I have my Emscripten patches completed enough to implement all the required syscalls/WASI calls.

Once we're able to compile pdfium with wasi-sdk, we should probably switch the standalone build to that, seems like a way better choice for server-side runtimes.

@bblanchon
Copy link
Owner

Thanks for the update.
In summary, you want me to add pdf_use_partition_alloc and split the WASM build into two variants.
Can I do this right away, or is there anything else I should know?

@jerbob92
Copy link
Contributor

jerbob92 commented Nov 24, 2022

I think for now add pdf_use_partition_alloc and keep it at 1 build (without the standalone flag) to fix memory issues in any runtime, I will open a MR when I have my patches ready for Emscripten for standalone build to actually be usable, we can create the second build variant after that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants