Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from Wasmer to Wasmtime #3349

Merged
merged 10 commits into from
Jun 28, 2024
Merged

Conversation

bjorn3
Copy link
Contributor

@bjorn3 bjorn3 commented May 14, 2024

Zellij is currently unable to update Wasmer as the currently used version is the last one without mandatory wasix support which Zellij doesn't want to adopt for various reasons. In the mean time a CVE has been found in the Cranelift version used by this Wasmer version. While it has been worked around in #2830, future CVE's may not be possible to workaround without updating Cranelift and thus Wasmer. This PR switches from Wasmer to Wasmtime to avoid all these problems.

Each individual commit builds. All commits except for the last two commits keep using Wasmer and thus could land without the actual switch to Wasmtime if preferred.

@imsnif imsnif self-assigned this May 14, 2024
.appender("logFile")
.build("wasmer_compiler_cranelift", LevelFilter::Warn),
.appender("logPlugin")
.build("wasmtime_wasi", LevelFilter::Warn),
Copy link
Contributor Author

@bjorn3 bjorn3 May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go to logFile or logPlugin? On the one hand these are logs in response to plugin actions, on the other hand they are not logs directly produced by the plugin itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter that much, they go into the same file in the end. But might indeed be nicer to change it to logPlugin since then we'll see the plugin id too.

@bjorn3
Copy link
Contributor Author

bjorn3 commented May 15, 2024

In Wasmer trying to create a file with a relative path creates it in the /host directory. In Wasmtime it will result in an error that there is no "pre-opened file descriptor" through which the file can be opened. As WASI doesn't have any such concept as a current working directory, the Wasmtime behavior is more natural. It does break the current released version of zjstatus however. dj95/zjstatus@25e3a56 works around this. I tried to work around the difference, but while it fixed this specific issue, it broke a lot more things. As such I would prefer documenting the difference rather than making Wasmtime behave the same way as Wasmer.

@bjorn3
Copy link
Contributor Author

bjorn3 commented May 15, 2024

CI should be fixed now.

@bjorn3
Copy link
Contributor Author

bjorn3 commented May 19, 2024

(Rebased. Will likely keep doing this as soon as conflicts arise to avoid accumulating conflicts.)

@syrusakbary
Copy link

Given that this PR can break many behaviors for plugins on Zellij, we can commit to update Cranelift in Wasmer very soon, so it doesn't become a problem for the project

@imsnif
Copy link
Member

imsnif commented May 20, 2024

Given that this PR can break many behaviors for plugins on Zellij, we can commit to update Cranelift in Wasmer very soon, so it doesn't become a problem for the project

Much appreciated @syrusakbary !! But since I don't want you to perform extra work, I will emphasize something I mentioned elsewhere: we will not consider any upgrade of wasmer that includes wasix or any sort of extra dependency that comes with wasix.

Given that I understood this is a big ask for your project (totally legit ofc, every project and application has different requirements), I gave @bjorn3 the go-ahead for this work. If all checks out with it (I haven't gone over it yet) and we manage to iron out the differences with plugins you rightfully mentioned (some of them have already been brought up as points of order by @bjorn3 ) - we will go through with this change.

@syrusakbary
Copy link

We will not consider any upgrade of wasmer that includes wasix or any sort of extra dependency that comes with wasix. Given that I understood this is a big ask for your project

Is not a big ask at all! Because WASIX is a superset of WASI, is actually feasible to only ship the base calls of WASI –for example– via a feature flag in Cargo. We can do that for you guys so you can upgrade Wasmer easily and without the burden of changing runtimes and plugins!

I've created this issue so we can follow up on that quickly: wasmerio/wasmer#4722

@imsnif
Copy link
Member

imsnif commented May 20, 2024

Thanks @syrusakbary ! As mentioned here and elsewhere, a feature flag will probably not suffice. We also need things like dependencies that are only used in wasix not to be included at all (for our packagers). This would likely have to be an entirely different crate (though I of course do not know about your internals).

But as I said - the work has already started here, so I intend to see it through. If we hit a road block that prevents us from migrating, or feel the migration is too risky - I'll be sure to check out the work in the linked issue to see if it would be a better path forward for us.

Thanks for your efforts and I am sorry things have to be this way, but we've sounded this alarm before a few times and we have to think of our application first.

@syrusakbary
Copy link

syrusakbary commented May 20, 2024

@imsnif of course, no worries! Do what you think is best for the project.

Just a minor clarification: on the Wasmer side, we can turn off the WASIX syscalls, dependencies and code paths incredibly easily via a feature flag, so a new crate shall not be needed if you would like to stick with Wasmer :)

Copy link
Member

@imsnif imsnif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @bjorn3 - first, thanks so much for your thorough work on this. This is a major infrastructure change for us and I do not want to take it lightly - so I hope you don't mind being patient with my review cycle.

I'm going on a short vacation tomorrow and didn't want to leave this hanging without a response while I'm away. So I added some comments about things that immediately jumped out at me to get us started. I will give this a more thorough shake-up and review when I get back around the beginning of June.

In Wasmer trying to create a file with a relative path creates it in the /host directory. In Wasmtime it will result in an error that there is no "pre-opened file descriptor" through which the file can be opened. As WASI doesn't have any such concept as a current working directory, the Wasmtime behavior is more natural. It does break the current released version of zjstatus however. dj95/zjstatus@25e3a56 works around this. I tried to work around the difference, but while it fixed this specific issue, it broke a lot more things. As such I would prefer documenting the difference rather than making Wasmtime behave the same way as Wasmer.

I've been thinking about this specific issue and while I hear you about the difficulty with the behavior change, I'm not enthusiastic about it. zjstatus is just one example that we happened to catch, I'm sure there are more. Backwards compatibility of already-compiled plugins is a strong value for our ecosystem.

I'm not saying this is a deal-breaker, but I'd like to understand the trade-offs before going ahead with this. What other stuff breaks if we try to work around this problem? How big of a hack is it? Will it affect upgrades? What else?

@@ -84,7 +84,7 @@ fn start_zellij(channel: &mut ssh2::Channel) {
)
.unwrap();
channel.flush().unwrap();
std::thread::sleep(std::time::Duration::from_secs(1)); // wait until Zellij stops parsing startup ANSI codes from the terminal STDIN
std::thread::sleep(std::time::Duration::from_secs(3)); // wait until Zellij stops parsing startup ANSI codes from the terminal STDIN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why we need the increase here? I don't mind increasing it in in this case of the e2e tests if we need it, but I'd prefer to understand what happened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it, the first snapshot for several tests seem to be taken while still compiling wasm, which suggests that compiling is slower. This is surprising to me as locally it actually seemed to be faster. Maybe the singlepass compiler of Wasmtime is slower than the one of Wasmer and only the optimizing compiler is faster?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Winch is about 2-6 times slower compiling than Wasmer's Singlepass for non-trivial applications (>2Mb), you can check more examples such as time to compile Spidermonkey or ffmpeg on each runtime on this article:

https://wasmi-labs.github.io/blog/posts/wasmi-v0.32/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All builtin plugins together add up to 2MB, but point taken. I agree Winch being a fair bit slower at compiling is likely the cause here. Caching compiled wasm modules between tests rather than recompiling every single time would likely help a lot. I haven't looked into how hard that would be, but Zellij already caches compiled wasm modules under normal operation. To be honest I'm surprised it isn't already cached between tests. Or maybe it is already getting cached, but would just need a warmup with a higher timeout before any tests run?

zellij-server/Cargo.toml Outdated Show resolved Hide resolved
@bjorn3
Copy link
Contributor Author

bjorn3 commented May 27, 2024

I've been thinking about this specific issue and while I hear you about the difficulty with the behavior change, I'm not enthusiastic about it. zjstatus is just one example that we happened to catch, I'm sure there are more. Backwards compatibility of already-compiled plugins is a strong value for our ecosystem.

I just reproduced this behavior in standalone Wasmer. Turns out it is completely disregarding the intention of the WASI specification:

fn main() {
    println!("{:?}", std::fs::write("foo", ""));
    println!("{:?}", std::fs::read("foo"));
    println!("{:?}", std::fs::read_dir("/").map(|dir| dir.collect::<Vec<_>>()));
    println!("{:?}", std::fs::read_dir(".").map(|dir| dir.collect::<Vec<_>>()));
    println!("{:?}", std::fs::read_dir("/tmp/foo/x").map(|dir| dir.collect::<Vec<_>>()));
}

when run in wasmtime with a /tmp/foo/x directory (containing my_file) passed through to the guest will print:

Err(Custom { kind: Uncategorized, error: "failed to find a pre-opened file descriptor through which \"foo\" could be opened" })
Err(Custom { kind: Uncategorized, error: "failed to find a pre-opened file descriptor through which \"foo\" could be opened" })
Err(Custom { kind: Uncategorized, error: "failed to find a pre-opened file descriptor through which \"/\" could be opened" })
Err(Custom { kind: Uncategorized, error: "failed to find a pre-opened file descriptor through which \".\" could be opened" })
Ok([Ok(DirEntry("/tmp/foo/x/my_file"))])

as I would have expected while wasmer prints

Ok(())
Ok([])
Ok([Ok(DirEntry("/.app")), Ok(DirEntry("/.private")), Ok(DirEntry("/bin")), Ok(DirEntry("/dev")), Ok(DirEntry("/etc")), Ok(DirEntry("/foo")), Ok(DirEntry("/tmp"))])
Ok([Ok(DirEntry("./.app")), Ok(DirEntry("./.private")), Ok(DirEntry("./bin")), Ok(DirEntry("./dev")), Ok(DirEntry("./etc")), Ok(DirEntry("./foo")), Ok(DirEntry("./tmp"))])
Ok([Ok(DirEntry("/tmp/foo/x/my_file"))])

In other words rather than passing through the directories the user specified as pre-opened directories as seems to be intended by the WASI specification, it creates an in-memory virtual filesystem whose contents are discarded upon exit and inside this VFS it mounts the user specified directories. The pre-opened directories it passes are / (fd 4), / (fd 5) and . (fd 6). )Yes, it passes / twice for whatever reason. I can imagine that some guest languages will get upset about that.) Wasmtime directly passes /tmp/foo/x as a pre-opened directory instead.

$ RUST_LOG=wasmer_wasix::syscalls::wasi::fd_prestat_dir_name=trace wasmer run target/wasm32-wasi/debug/foo.wasm
2024-05-27T17:00:01.380160Z TRACE ThreadId(01) fd_prestat_dir_name: wasmer_wasix::syscalls::wasi::fd_prestat_dir_name: return=Errno::success fd=3 path="/"
2024-05-27T17:00:01.380179Z TRACE ThreadId(01) fd_prestat_dir_name: wasmer_wasix::syscalls::wasi::fd_prestat_dir_name: close time.busy=31.2µs time.idle=8.81µs fd=3 path="/"
2024-05-27T17:00:01.380201Z TRACE ThreadId(01) fd_prestat_dir_name: wasmer_wasix::syscalls::wasi::fd_prestat_dir_name: return=Errno::success fd=4 path="/"
2024-05-27T17:00:01.380205Z TRACE ThreadId(01) fd_prestat_dir_name: wasmer_wasix::syscalls::wasi::fd_prestat_dir_name: close time.busy=4.83µs time.idle=448ns fd=4 path="/"
2024-05-27T17:00:01.380213Z TRACE ThreadId(01) fd_prestat_dir_name: wasmer_wasix::syscalls::wasi::fd_prestat_dir_name: return=Errno::success fd=5 path="."
2024-05-27T17:00:01.380217Z TRACE ThreadId(01) fd_prestat_dir_name: wasmer_wasix::syscalls::wasi::fd_prestat_dir_name: close time.busy=4.09µs time.idle=326ns fd=5 path="."
[...]

It would probably be possible to emulate this with wasmtime-wasi, but I think it would only hide bugs inside plugins. From what I can tell before dj95/zjstatus@25e3a56 the .zjstatus.log file would have been discarded rather than saved on the host as was intended, however based on https://discord.com/channels/771367133715628073/1240239515071942727/1240246058970517556 it seems like it somehow gets stored in the path mounted at /host anyway. If you do still want me to implement support for emulating the Wasmer behavior, I will do so. Maybe it could be a future compatibility warning when it seems like the plugin depends on the Wasmer behavior.

@imsnif
Copy link
Member

imsnif commented Jun 7, 2024

Hey @bjorn3 - I took this for a spin and am seeing some weird behavior. Could it be that a plugin's state is being reset when a new instance is loaded for some reason?

To reproduce:

  1. Clone this project: https://github.com/zellij-org/rust-plugin-example
  2. Start Zellij with the layout in this project (zellij -l plugin-dev-workspace.kdl)
  3. Once the plugin has rendered and you've given it permissions, change the input-mode a few times (eg. move to Tab then to Pane) - see that the mode counts have been logged in the plugin's UI
  4. Open a new tab (an identical tab will open, because we've loaded the plugin-dev-workspace.kdl layout globally, so another instance of the plugin will be loaded in the new tab)
  5. Go back to the first tab and see that the counts have been reset (in zellij 0.40.1 they are not reset).

@bjorn3
Copy link
Contributor Author

bjorn3 commented Jun 7, 2024

I can't reproduce this behavior locally. For me the counts are preserved as I switch between tabs.

Peek.2024-06-07.17-36.webm

@imsnif
Copy link
Member

imsnif commented Jun 7, 2024

Also when the second plugin instance is initially launched? That's when it gets erased for me.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Jun 7, 2024

Yes, opening a new tab preserves the state too:

Peek.2024-06-07.17-36.webm

@imsnif
Copy link
Member

imsnif commented Jun 7, 2024

Alright, I'll do some troubleshooting next week and find out what's up.

@imsnif
Copy link
Member

imsnif commented Jun 14, 2024

Hey, so first - I did some testing and this was a configuration error on my part. My apologies!

Otherwise, I'm still testing this PR. I'm sorry for not getting to do a full review this week, it was on my schedule and got pushed off. I fully plan to do so next week. Thank you for your patience @bjorn3 !

bjorn3 added 5 commits June 14, 2024 19:31
This will enable PluginEnv to be the Store context when migrating to
Wasmtime.
This will allow removing the Clone impl from PluginEnv when migrating to
Wasmtime as required by the missing Clone impl on Wasmtime's WasiCtx.
Wasmtime requires storing the read/write end of the pipe outside of the
WasiCtx. Passing PluginEnv to these functions allows storing them in the
PluginEnv.
@imsnif
Copy link
Member

imsnif commented Jun 20, 2024

By the way since I opened this PR, Wasmtime 21.0 has been released. Do you want me to update to it before or after merging this PR?

Let's do it before if we're already here. Do you think there are significant changes that will affect us?

@bjorn3
Copy link
Contributor Author

bjorn3 commented Jun 20, 2024

Updated. No api changes that affects us. Only had to ensure the new std feature of wasmtime is enabled.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Jun 20, 2024

Done and disabled a couple more features too.

By the way, I forgot to mention that I disabled the "pooling allocator" too. It is a performance optimization for when you quickly drop and recreate instances of a wasm module. For example creating an instance for each request in an http server. Are plugin instances meant to be quickly destroyed again after invoking a command inside them or do they live for a while in Zellij?

@imsnif
Copy link
Member

imsnif commented Jun 20, 2024

By the way, I forgot to mention that I disabled the "pooling allocator" too. It is a performance optimization for when you quickly drop and recreate instances of a wasm module. For example creating an instance for each request in an http server. Are plugin instances meant to be quickly destroyed again after invoking a command inside them or do they live for a while in Zellij?

They definitely live for a while. There can be cases where a plugin is spun up to deal with a request (eg. through a CLI pipe) and then kills itself, but I don't think there's a use-case for it then spinning itself up again multiple times.

@imsnif
Copy link
Member

imsnif commented Jun 20, 2024

So where things are now: I'm going to give stacab a few days to see if he can test this with some Go plugins, and if all is well or he doesn't have time, I'll give it one more shakeup with the new version and merge if all is well.

@imsnif
Copy link
Member

imsnif commented Jun 28, 2024

Alright. Things look good, multiple people have also tested this manually in various scenarios, so I'm giving this the go.

I would like to emphasize that this is not a criticism of wasmer in any way or form. Wasmer has been a great runtime for us over the years and has served us well. A heartfelt thank you to the maintainers.

Thank you very much for your work on this @bjorn3 !

@imsnif imsnif merged commit 7d7848c into zellij-org:main Jun 28, 2024
6 checks passed
@bjorn3 bjorn3 deleted the migrate_to_wasmtime branch June 28, 2024 15:08
@imsnif
Copy link
Member

imsnif commented Jul 30, 2024

@bjorn3 - I hope it's alright to consult you about this after we've already merged.

Working with main, I have been getting some of these hard crashes occasionally:

Error occurred in server:

  × Thread 'async-std/runtime' panicked.
  ├─▶ At /home/aram/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wasmtime-21.0.1/src/runtime/type_registry.rs:740:48
  ╰─▶ called `Option::unwrap()` on a `None` value
  help: If you are seeing this message, it means that something went wrong.

        -> To get additional information, check the log at: /tmp/zellij-1000/zellij-log/zellij.log
        -> To see a backtrace next time, reproduce the error with: RUST_BACKTRACE=1 zellij [...]
        -> To help us fix this, please open an issue: https://github.com/zellij-org/zellij/issues

I'm using --singlepass (winch?) compiler because I'm developing - so could this just be a development thing? I'm not too sure how to reproduce these unfortunately.

Any ideas?

@bjorn3
Copy link
Contributor Author

bjorn3 commented Jul 30, 2024

I hope it's alright to consult you about this after we've already merged.

Of course.

Working with main, I have been getting some of these hard crashes occasionally:

I will look into it. Do you have a backtrace for the crash?

@imsnif
Copy link
Member

imsnif commented Jul 30, 2024

Thank you! I unfortunately do not have a backtrace, but I'll try to keep an eye on this and produce one if it happens again.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Jul 30, 2024

Looks like the assertion in question is a debug assertion, so it would only be hit during development: https://github.com/bytecodealliance/wasmtime/blob/cedf9aa0f9620391cab465361b474b149654e88d/crates/wasmtime/src/runtime/type_registry.rs#L740 I will ask someone familiar with this code if they know what is wrong.

@fitzgen
Copy link

fitzgen commented Jul 30, 2024

Hi @imsnif,

Wasmtime maintainer here. @bjorn3 pinged me about this bug. Do you have a test case and steps I can follow to reproduce the assertion failure?

Thanks!

@imsnif
Copy link
Member

imsnif commented Jul 30, 2024

Hi @fitzgen - thanks for the very quick response! I unfortunately do not have a test case or steps to reproduce. Application-wise, this seems to happen pretty randomly (happened to me when I opened a new tab in Zellij which in this context causes new plugins to be instantiated).

It happens to me occasionally while developing, I'll try to keep an eye out for the trace next time.

Since this only happens in development though - on my side at least it's an extremely minor inconvenience and nothing to worry about.

@imsnif
Copy link
Member

imsnif commented Oct 3, 2024

Hey @bjorn3 and @fitzgen - this now happened to me in release code:

  × Thread 'async-std/runtime' panicked.
  ├─▶ At /home/aram/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wasmtime-slab-21.0.1/src/lib.rs:434:35
  ╰─▶ attempt to deallocate an entry that is already vacant
  help: If you are seeing this message, it means that something went wrong.

        -> To get additional information, check the log at: /tmp/zellij-1000/zellij-log/zellij.log
        -> To see a backtrace next time, reproduce the error with: RUST_BACKTRACE=1 zellij [...]
        -> To help us fix this, please open an issue: https://github.com/zellij-org/zellij/issues

I don't customarily run with RUST_BACKTRACE and due to the nature of this application it's not trivial to reproduce this. I'm honestly starting to get a little worried... any thoughts?

@fitzgen
Copy link

fitzgen commented Oct 3, 2024

Unfortunately, it is hard to debug this issue with so little information.

Can you confirm whether you are using Wasm GC types (and specifically the wasmtime::ManuallyRooted<T> API type) and/or Wasm externrefs?

I will do another review of some potentially relevant code paths.

@fitzgen
Copy link

fitzgen commented Oct 3, 2024

I think I've figured out what this is actually. Will link the PR when I have it up.

@imsnif
Copy link
Member

imsnif commented Oct 4, 2024

Much appreciated @fitzgen ! I can totally understand this is not trivial to troubleshoot with so little information. I'm working with the Zellij main branch as a matter of course and have now made sure to also have RUST_BACKTRACE=1 up and will provide more information here as I see it.

As for specific questions, I did not do the integration myself - so I think @bjorn3 can answer them best. But hopefully as you say you figured it out and this will not be needed.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Oct 4, 2024

No wasm GC types are used in the external interface as it is using wasip1 just like the Wasmer code. And as most (all) plugins are written in Rust, there is also no GC types usage expected internally within the plugins.

@fitzgen
Copy link

fitzgen commented Oct 4, 2024

Thanks for providing additional info.

Fix coming soon, I've managed to reproduce locally and verify the fix.

Aside: it might be neat to install a custom panic handler to Zellij that captures the stack trace and panic message and logs them to a file or something like that. Could make debugging this kind of bug (whether in Wasmtime, another dep, or Zellij itself) easier in the future.

Anyways, thanks for your patience while we figured this out!

@imsnif
Copy link
Member

imsnif commented Oct 4, 2024

Aside: it might be neat to install a custom panic handler to Zellij that captures the stack trace and panic message and logs them to a file or something like that. Could make debugging this kind of bug (whether in Wasmtime, another dep, or Zellij itself) easier in the future.

I totally agree. We actually do have a panic hook that is tied to some other internal app logic. I very much agree we should get it to do the right thing in this case. Like everything, it's a matter of developer capacity which is something we're unfortunately in short supply of.

Thanks for the quick turnaround on this!

@bjorn3
Copy link
Contributor Author

bjorn3 commented Oct 20, 2024

@fitzgen which PR fixed this bug?

@alexcrichton
Copy link

The bug was GHSA-7qmx-3fpx-r45m so it ended up not having an official PR for it but rather a direct merge

@bjorn3
Copy link
Contributor Author

bjorn3 commented Oct 20, 2024

Opened #3685 to update Wasmtime.

Tomcat-42 pushed a commit to Tomcat-42/zellij that referenced this pull request Nov 9, 2024
* Remove ForeignFunctionEnv wrapper around PluginEnv

This will enable PluginEnv to be the Store context when migrating to
Wasmtime.

* Pass PluginEnv by value to load_plugin_instance

This will allow removing the Clone impl from PluginEnv when migrating to
Wasmtime as required by the missing Clone impl on Wasmtime's WasiCtx.

* Avoid passing a Store around when an Engine is enough

* Pass PluginEnv to the wasi read/write functions

Wasmtime requires storing the read/write end of the pipe outside of the
WasiCtx. Passing PluginEnv to these functions allows storing them in the
PluginEnv.

* Migrate to Wasmtime

* Switch from wasi-common to wasmtime-wasi

* Reduce verbosity of wasmtime_wasi logs

* Increase startup delay

To wait for all plugins to be compiled.

* Disable some wasmtime features

* Update to Wasmtime 21.0.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants