Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trafficserver: new module #1243

Merged
merged 6 commits into from
Jun 25, 2024
Merged

Conversation

midchildan
Copy link
Contributor

Creates a new module for Apache Traffic Server.

@sandydoo
Copy link
Member

@midchildan
Copy link
Contributor Author

I couldn't reproduce the failure locally on x86_64 Linux. The CI output has multiple lines with [trafficserver ] NOTE: using command line path as RUNROOT, which suggests that Traffic Server crashed and restarted. If that is the case, I don't know why curl didn't immediately error out.

To help with debugging, I added a timeout for curl and made the console logs more verbose.

@midchildan
Copy link
Contributor Author

Hmm, I still can't figure out the cause of the failure. I ran the test on GitHub Actions using my repository, and the Linux runner was able to complete the test successfully.

https://github.com/midchildan/devenv/actions/runs/9339577654/job/25704130159

To make things worse, the complete logs are still missing likely because it's output too late. When I run it locally, the logs appear after the shell prompt. Maybe a different setting might be able to surface the logs, I'll look into it.

@domenkozar
Copy link
Member

Maybe something similar to #1248 (comment)

@midchildan
Copy link
Contributor Author

Thanks, I set the hostname for ATS and also tried to get more logs. If this doesn't go well, I might have to temporarily wrap the traffic_server command with strace to get more details.

@sandydoo
Copy link
Member

Huh, maybe it doesn't like one of the systemd restrictions on our runner: https://github.com/NixOS/nixpkgs/blob/051f920625ab5aabe37c920346e3e69d7d34400e/nixos/modules/services/continuous-integration/github-runner/service.nix#L227-L265

[Jun 10 10:04:22.158] traffic_manager NOTE: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 31: Bad system call

@midchildan
Copy link
Contributor Author

I tried applying the same restrictions on NixOS 23.11 with the default kernel, but the test still completed successfully.

sudo systemd-run -t -p AmbientCapabilities= -p CapabilityBoundingSet= -p DeviceAllow= -p NoNewPrivileges=true -p PrivateDevices=true -p PrivateMounts=true -p PrivateTmp=true -p ProtectClock=true -p ProtectControlGroups=true -p ProtectHome=true -p ProtectHostname=true -p ProtectKernelLogs=true -p ProtectKernelModules=true -p ProtectKernelTunables=true -p ProtectSystem=strict -p RemoveIPC=true -p RestrictNamespaces=true -p RestrictRealtime=true -p RestrictSUIDSGID=true -p UMask=0066 -p ProtectProc=invisible -p SystemCallFilter='~@clock ~@cpu-emulation ~@module ~@obsolete ~@raw-io ~@reboot ~capset ~setdomainname ~sethostname' -p RestrictAddressFamilies='AF_INET AF_INET6 AF_UNIX AF_NETLINK' -p DynamicUser=true -p PrivateNetwork=false -p MemoryDenyWriteExecute=false -p ProcSubset=all -p LockPersonality=false -p StateDirectory=devenv -p Environment=HOME=/var/lib/devenv zsh

@midchildan
Copy link
Contributor Author

I pushed a new commit that wraps the traffic_server command with strace to see which system call is causing the crash.

@midchildan
Copy link
Contributor Author

It appears Traffic Server is requesting CAP_NET_ADMIN and getting itself terminated by the kernel.

In a typical setup, Traffic Server is launched as root. During startup, the server process changes to an unprivileged user and further drops unnecessary capabilities. But in doing so, it also attempts to explicitly keep the required capabilities including CAP_NET_ADMIN.

https://github.com/apache/trafficserver/blob/90fbf13db0858cef0e0a094f445d846b60a4c1ef/src/tscore/ink_cap.cc#L259

On other machines, this appears to have no effect when Traffic Server is launched as an unprivileged user. So perhaps the CI host is configured differently?

In any case, I configured Traffic Server to not attempt to change the user during launch. This should prevent the problematic code from running.

https://github.com/apache/trafficserver/blob/90fbf13db0858cef0e0a094f445d846b60a4c1ef/src/traffic_server/traffic_server.cc#L1874

@midchildan
Copy link
Contributor Author

It appears the tests hanged before the Traffic Sever tests were able to run on the Linux runners. I pushed another commit to temporarily disable other examples so that we can see the results faster.

@sandydoo
Copy link
Member

I brought down our CI trying to upgrade NixOS. Will re-run once I get it back up, but the fix looks promising!

@midchildan
Copy link
Contributor Author

error: The option `services.trafficserver' does not exist. Definition values:

That's interesting, I don't see any change since last time that would cause the option to stop being defined.

@midchildan
Copy link
Contributor Author

I tried rebasing. .github/workflows/buildtest.yml did have a conflict as a result of me temporarily disabling non-Traffic Server tests. I'm not sure if it has anything to do with the error though. The test still succeeds on my end.

@domenkozar
Copy link
Member

Seems like the tests pass now?

@sandydoo
Copy link
Member

Yay! Thanks for being so patient, @midchildan. If you could please drop the debug commits, and we'll merge straight away.

@midchildan
Copy link
Contributor Author

Thank you too! I dropped e8fbe42 and ec9b9fe.

@sandydoo sandydoo added the module A new or updated module label Jun 25, 2024
@sandydoo sandydoo merged commit afaf476 into cachix:main Jun 25, 2024
266 of 275 checks passed
@midchildan midchildan deleted the feat/trafficserver-module branch June 26, 2024 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module A new or updated module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants