Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Setup cgroupfs to run cells when running as PID1 #533

Merged
merged 4 commits into from
Nov 10, 2024

Conversation

mccormickt
Copy link
Contributor

@mccormickt mccormickt commented Oct 8, 2024

This PR properly sets up the file mounts and directories, such as cgroupfs, for running Cells as the PID1 runtime.

Testing

Rust tests, configuring new remote client for nested auraed

sudo -E cargo test -p auraed --test vms_start_must_start_vm_with_auraed -- --include-ignored

Manually with aer and cloud-hypervisor

  1. Install cloud-hypervisor and build guest image/kernel
sudo make /opt/aurae/cloud-hypervisor/cloud-hypervisor
sudo make build-guest-kernel
sudo make prepare-image
  1. Run cloud-hypervisor with the auraed pid1 image
sudo cloud-hypervisor --kernel /var/lib/aurae/vm/kernel/vmlinux.bin \                                 
--disk path=/var/lib/aurae/vm/image/disk.raw \                                                                                   
--cmdline "console=hvc0 root=/dev/vda1 rw" \                                                                                     
--cpus boot=4 \                                                                                                                  
--memory size=4096M \                                                                                                            
--net "tap=tap0,mac=aa:ae:00:00:00:01,id=eth0"
  1. Retrieve zone ID from tap0 (13 in my case):
ip link show tap0                                               
13: tap0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 06:66:42:a8:3f:e1 brd ff:ff:ff:ff:ff:ff
  1. Configure aurae client config in ~/.aurae/config:
[system]
socket = "[fe80::2%13]:8080"
  1. Verify cells run:
aer cell allocate sleeper
aer cell start --executable-command "sleep 9000" sleeper sleep-forever
aer cell list
aer cell stop sleeper sleep-forever
aer cell free sleeper

@mccormickt
Copy link
Contributor Author

Currently, cells aren't able to be freed and executables aren't found when attempted to be stopped. Its hard to tell if it can't find the process, or the process has exited too soon and isn't cleaned up. 🤔

let res = CellServiceClient::start(
&remote_client,
CellServiceStartRequestBuilder::new()
.cell_name(cell_name.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't appear to be starting an executable here.. you need the executable to be set on the StartRequest (i think?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CellServiceStartRequestBuilder test helper looks like it sets a default executable for us with an ae-sleeper- prefix.

@dmah42
Copy link
Contributor

dmah42 commented Oct 11, 2024

ah i didn't know about that.. i think that should probably be something that doesn't end in case we end up with flaky issues in tests. it means every test needs to clean up but that's probably the right thing to do. do you think that change can go in here?

@mccormickt mccormickt changed the title Fix: Setup cgroupfs to run cells as PID1 Fix: Setup cgroupfs to run cells when running as PID1 Oct 14, 2024
@mccormickt
Copy link
Contributor Author

ah i didn't know about that.. i think that should probably be something that doesn't end in case we end up with flaky issues in tests. it means every test needs to clean up but that's probably the right thing to do. do you think that change can go in here?

Sure, I don't think that should be too much overhead. - To be clear, the request is to remove the default randomized executable name and add names & cell cleanup steps to relevant tests?

@dmah42
Copy link
Contributor

dmah42 commented Oct 14, 2024

ah i didn't know about that.. i think that should probably be something that doesn't end in case we end up with flaky issues in tests. it means every test needs to clean up but that's probably the right thing to do. do you think that change can go in here?

Sure, I don't think that should be too much overhead. - To be clear, the request is to remove the default randomized executable name and add names & cell cleanup steps to relevant tests?

i think the randomized executable name is fine, but the executable that's run should probably not be sleep 400. we should run something that continues to run until it's stopped i think. wdyt?

@dmah42
Copy link
Contributor

dmah42 commented Oct 19, 2024

did it help fix the issues?

@mccormickt
Copy link
Contributor Author

did it help fix the issues?

Sorry for the intermittent communication, but a small update:

Still having issues where the PID is being killed but its child processes not stopped, while still receiving a "PID not found" error. I believe I had isolated it to here where perhaps the process being waited on has already exited, but I haven't had some time to reproduce on a fresh system.

@mccormickt
Copy link
Contributor Author

I haven't quite pinned down what is causing this, but it seems to be outside of this PR.

I have been able to reproduce the same test failing on my (x86_64 Ubuntu 24.04) system on multiple past commits as well with the same "process not found" errors:

#[test_helpers_macros::shared_runtime_test]
async fn cells_start_stop_delete() {
    skip_if_not_root!("cells_start_stop_delete");
    skip_if_seccomp!("cells_start_stop_delete");

    let client = common::auraed_client().await;

    // Allocate a cell
    let cell_name = retry!(
        client
            .allocate(
                common::cells::CellServiceAllocateRequestBuilder::new().build()
            )
            .await
    )
    .unwrap()
    .into_inner()
    .cell_name;

    // Start the executable
    let req = common::cells::CellServiceStartRequestBuilder::new()
        .cell_name(cell_name.clone())
        .executable_name("aurae-exe".to_string())
        .build();
    let _ = retry!(client.start(req.clone()).await).unwrap().into_inner();

    // Stop the executable
    let _ = retry!(
        client
            .stop(proto::cells::CellServiceStopRequest {
                cell_name: Some(cell_name.clone()),
                executable_name: "aurae-exe".to_string(),
            })
            .await
    )
    .unwrap();

    // Delete the cell
    let _ = retry!(
        client
            .free(proto::cells::CellServiceFreeRequest {
                cell_name: cell_name.clone()
            })
            .await
    )
    .unwrap();
}

Output:

2024-11-07T01:30:08.046239Z  INFO auraed::init::system_runtimes::cell_system_runtime: Running as a cell
2024-11-07T01:30:08.046621Z  INFO auraed::init::system_runtimes: User Access Socket Created: /var/run/aurae/aurae-d11742bb-c96f-4
8b9-95b8-afbe6747006b.sock                                                                                                       
2024-11-07T01:30:08.051181Z  INFO allocate: auraed::cells::cell_service::cells::cell: Attach nested Auraed pid 1668285 to cgroup 
ae-e2e-742727f8-938a-4cc1-8a33-d28e7e9465eb request=ValidatedCellServiceAllocateRequest { cell: ValidatedCell { name: CellName("a
e-e2e-742727f8-938a-4cc1-8a33-d28e7e9465eb"), cpu: None, cpuset: None, memory: None, isolate_process: false, isolate_network: fal
se } }                                                          
2024-11-07T01:30:08.068934Z  INFO start: auraed::cells::cell_service::cell_service: CellService: start() executable=ValidatedExec
utable { name: ExecutableName("aurae-exe"), command: "sleep 400", description: "description" } request=ValidatedCellServiceStartR
equest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "sleep 400", description:
 "description" }, uid: None, gid: None }                                                                                         
2024-11-07T01:30:08.069353Z  INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stdout request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.069445Z  INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stderr request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.103119Z  INFO stop: auraed::cells::cell_service::cell_service: CellService: stop() executable_name=Executable
Name("aurae-exe") request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
2024-11-07T01:30:08.103377Z ERROR stop: auraed::cells::cell_service::error: executable 'aurae-exe' failed to stop: No child proce
sses (os error 10) request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
thread 'cells_start_stop_delete' panicked at auraed/tests/cell_list_must_list_allocated_cells_recursively.rs:172:6:             
called `Result::unwrap()` on an `Err` value: Status { code: Internal, message: "executable 'aurae-exe' failed to stop: No child p
rocesses (os error 10)", metadata: MetadataMap { headers: {"content-type": "application/grpc", "content-length": "0", "date": "Th
u, 07 Nov 2024 01:30:08 GMT"} }, source: None }

@dmah42 dmah42 merged commit 1d62cb0 into aurae-runtime:main Nov 10, 2024
4 checks passed
@dmah42
Copy link
Contributor

dmah42 commented Nov 10, 2024

thanks for digging in to the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants