-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Setup cgroupfs to run cells when running as PID1 #533
Conversation
Currently, cells aren't able to be freed and executables aren't found when attempted to be stopped. Its hard to tell if it can't find the process, or the process has exited too soon and isn't cleaned up. 🤔 |
let res = CellServiceClient::start( | ||
&remote_client, | ||
CellServiceStartRequestBuilder::new() | ||
.cell_name(cell_name.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't appear to be starting an executable here.. you need the executable to be set on the StartRequest (i think?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CellServiceStartRequestBuilder
test helper looks like it sets a default executable for us with an ae-sleeper- prefix.
ah i didn't know about that.. i think that should probably be something that doesn't end in case we end up with flaky issues in tests. it means every test needs to clean up but that's probably the right thing to do. do you think that change can go in here? |
Sure, I don't think that should be too much overhead. - To be clear, the request is to remove the default randomized executable name and add names & cell cleanup steps to relevant tests? |
i think the randomized executable name is fine, but the executable that's run should probably not be |
did it help fix the issues? |
Sorry for the intermittent communication, but a small update: Still having issues where the PID is being killed but its child processes not stopped, while still receiving a "PID not found" error. I believe I had isolated it to here where perhaps the process being waited on has already exited, but I haven't had some time to reproduce on a fresh system. |
I haven't quite pinned down what is causing this, but it seems to be outside of this PR. I have been able to reproduce the same test failing on my (x86_64 Ubuntu 24.04) system on multiple past commits as well with the same "process not found" errors: #[test_helpers_macros::shared_runtime_test]
async fn cells_start_stop_delete() {
skip_if_not_root!("cells_start_stop_delete");
skip_if_seccomp!("cells_start_stop_delete");
let client = common::auraed_client().await;
// Allocate a cell
let cell_name = retry!(
client
.allocate(
common::cells::CellServiceAllocateRequestBuilder::new().build()
)
.await
)
.unwrap()
.into_inner()
.cell_name;
// Start the executable
let req = common::cells::CellServiceStartRequestBuilder::new()
.cell_name(cell_name.clone())
.executable_name("aurae-exe".to_string())
.build();
let _ = retry!(client.start(req.clone()).await).unwrap().into_inner();
// Stop the executable
let _ = retry!(
client
.stop(proto::cells::CellServiceStopRequest {
cell_name: Some(cell_name.clone()),
executable_name: "aurae-exe".to_string(),
})
.await
)
.unwrap();
// Delete the cell
let _ = retry!(
client
.free(proto::cells::CellServiceFreeRequest {
cell_name: cell_name.clone()
})
.await
)
.unwrap();
} Output:
|
thanks for digging in to the issue. |
This PR properly sets up the file mounts and directories, such as
cgroupfs
, for running Cells as the PID1 runtime.Testing
Rust tests, configuring new remote client for nested auraed
sudo -E cargo test -p auraed --test vms_start_must_start_vm_with_auraed -- --include-ignored
Manually with
aer
andcloud-hypervisor
auraed
pid1 imagetap0
(13 in my case):~/.aurae/config
: