-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn users that kernel headers are missing during the Pixie install process #2051
Comments
… to `GetAgentStatus` (#2052) Summary: Add UDTF that detects linux kernel header installation and add column to `GetAgentStatus` This is a prerequisite to accomplish #2051. The `px deploy` command uses the GetAgentStatus UDTF in its final [healthcheck step](https://github.com/pixie-io/pixie/blob/854062111cf4b91a40649a2e2647c88c0a68b0db/src/pixie_cli/pkg/cmd/deploy.go#L607-L613). With this kernel header detection in place, the `px` cli can use the results from the `px/agent_status` script to print a warning message if kernel headers aren't detected. The helm install flow needs to be covered as well. My hope is that this UDTF could be used for that use case as well, but I need to further investigate the details of that. Relevant Issues: #2051 Type of change: /kind feature Test Plan: Skaffolded to a Ubuntu GKE cluster and tested the following - [x] Kelvin always reports `false` as it doesn't bind mount `/` to `/host` - [x] PEM running on host without `linux-headers-$(uname -r)` package reports `false` - [x] PEM running on host with `linux-headers-$(uname -r)` package reports `true` ``` $ gcloud compute ssh gke-dev-cluster-ddelnano-default-pool-a27c1ac2-x5k2 --internal-ip -- 'ls -alh /lib/modules/$(uname -r)/build' lrwxrwxrwx 1 root root 38 Aug 9 15:25 /lib/modules/5.15.0-1065-gke/build -> /usr/src/linux-headers-5.15.0-1065-gke $ gcloud compute ssh gke-dev-cluster-ddelnano-default-pool-a27c1ac2-j6pg --internal-ip -- 'ls -alh /lib/modules/$(uname -r)/build' ls: cannot access '/lib/modules/5.15.0-1065-gke/build': No such file or directory ``` ![Screen Shot 2024-12-02 at 9 30 29 AM](https://github.com/user-attachments/assets/9fa862f8-5a6c-46d6-8899-bfaf2bdf3371) Changelog Message: Add `GetLinuxHeadersStatus` UDTF and add `kernel_headers_installed` column to `GetAgentStatus` --------- Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…ing (#2061) Summary: Update `GetAgentStatus` and kernel header UDTF to allow kelvin filtering In order to leverage the `GetAgentStatus`'s `kernel_headers_installed` column for #2051, it would be convenient for the the UDTF to provide the ability to filter kelvins out -- they don't have access to kernel headers since they don't have the host filesystem volume mounted. This change introduces an `include_kelvin` init argument to the UDTFs with a default of `true` to preserve the existing behavior. This change also fixes a bug with UDTF's init arg default values, which didn't work prior to this change. Please review commit by commit to see the default arg bug fix followed by the UDTF changes. Relevant Issues: #2051 Type of change: /kind bug Test Plan: New logical planner test no longer fails with the following error ``` $ bazel test -c opt src/carnot/planner:logical_planner_test --test_output=all [ RUN ] LogicalPlannerTest.one_pems_one_kelvin src/carnot/planner/logical_planner_test.cc:64: Failure Value of: IsOK(::px::StatusAdapter(__status_or_value__64)) Actual: false (Invalid Argument : DATA_TYPE_UNKNOWN not handled as a default value) Expected: true ```
#1986 is another great case of the need for this warning/tooling. That was a bug that lasted from August until now and was due to the fact that Amazon linux headers needed to be installed since pixie's pre-packaged headers resulted in broken Go TLS tracing. |
Summary: Add px/agent_status_diagnostics pxl script This PR adds a new pxl script that helps identify if linux headers are missing on any PEMs. This can be extended in the future, but the first use cause will be to execute the script during `px deploy` and `px collect-logs`. Relevant Issues: #2051 Type of change: /kind feature Test Plan: Ran the script in the UI and tested it as part of the validation for #2065 ![Screen Shot 2024-12-18 at 10 51 15 AM](https://github.com/user-attachments/assets/bd4ab6de-ad92-4af3-8714-b044a7df7d65) Changelog Message: Add `px/agent_status_diagnostics` pxl script for checking common issues Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…ing kernel headers (#2065) Summary: Use `px/agent_status_diagnostics` script within px cli to detect missing kernel headers This PR leverages the script added in #2064 to detect missing kernel headers during cli deploys and `px collect-logs` commands. This solves 2/3 of the use cases I was hoping to identify for #2051 (the last being helm installs). A recent example of this problem is #1986, where a Go TLS tracing bug went undiagnosed for months (August to December). Amazon Linux 2023's headers are different enough that it breaks Go TLS tracing when pixie's pre-packaged headers are used. The tooling in this PR would have provided a few opportunities for this to be caught. Relevant Issues: #2051 Type of change: /kind feature Test Plan: Verified the following scenarios <details><summary>Test cases</summary> - [x] `px collect-logs` works against a cloud that doesn't have a `px/agent_status_diagnostics` script ``` $ bazel run -c opt --stamp src/pixie_cli:px -- collect-logs WARN[0006] healthcheck script detected the following warnings: error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes." Logs written to pixie_logs_20241223165214.zip # zip file contains px/agent_status output $ cat px_agent_diagnostics.txt {"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true} ``` - [x] `px collect-logs` works against a cloud that does have a `px/agent_status_diagnostics` script ``` $ bazel run src/pixie_cli:px -- collect-logs INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured). INFO: Found 1 target... Target //src/pixie_cli:px up-to-date: bazel-bin/src/pixie_cli/px_/px INFO: Elapsed time: 4.240s, Critical Path: 3.89s INFO: 3 processes: 1 internal, 2 linux-sandbox. INFO: Build completed successfully, 3 total actions INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs Pixie CLI ******************************* * ENV VARS * PX_CLOUD_ADDR=testing.getcosmic.ai:443 ******************************* Logs written to pixie_logs_20241218164734.zip $ cat px_agent_diagnostics.txt {"_tableName_":"output","headers_installed_percent":1} ``` - [x] `px collect-logs` identifies when kernel headers are missing when `px/agent_status_diagnostics` present ``` $ Logs written to pixie_logs_20241223165214.zip $ bazel run -c opt --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs [ ... ] WARN[0012] healthcheck script detected the following warnings: error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes." $ cat px_agent_diagnostics.txt {"_tableName_":"output","headers_installed_percent":0.5} ``` - [x] Artificially forcing context deadline (timeout) results in an error ``` $ git diff diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go index 7d3b7e008..c957b8943 100644 --- a/src/pixie_cli/pkg/vizier/script.go +++ b/src/pixie_cli/pkg/vizier/script.go @@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus execScript = br.MustGetScript(script.AgentStatusScript) } - ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second) $ bazel run src/pixie_cli:px -- collect-logs WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script error="context deadline exceeded" Logs written to pixie_logs_20241218165033.zip ``` - [x] `px collect-logs` prompts auth flow when credentials don't match current cloud ``` $ PX_CLOUD_ADDR=new-cloud bazel run src/pixie_cli:px -- collect-logs ******************************* * ENV VARS * PX_CLOUD_ADDR=new-cloud ******************************* Failed to authenticate. Please retry `px auth login`. ``` - [x] `px deploy` on pre v0.14.14 (older) vizier with existing bundle warns that kernel headers should be installed ``` # Additional flags provided to speed up vizier bootstrapping $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json ``` - [x] `px deploy` on pre v0.14.14 (older) vizier with latest bundle warns that kernel headers should be installed ``` # Additional flags provided to speed up vizier bootstrapping $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json [ ... ] Waiting for Pixie to pass healthcheck ✔ Wait for PEMs/Kelvin ✔ Wait for PEMs/Kelvin ✕ Wait for healthcheck ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes. Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes. [ ...] ``` - [x] `px deploy` on v0.14.14 vizier with latest bundle warns appropriate when kernel headers are missing ``` $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0 [ ... ] Waiting for Pixie to pass healthcheck ✔ Wait for PEMs/Kelvin ✕ Wait for healthcheck ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes. Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes. ``` </details> Changelog Message: Enhanced the `px` cli's `deploy` and `collect-logs` commands to surface when kernel headers aren't installed. This is a common source of bugs that can only be addressed by installing your distro's kernel headers. Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Summary: Update self-hosted cloud's cli, operator and vizier versions The latest vizier (v0.14.14) and cli (v0.8.5) include support for detect missing kernel headers (#2051). This detection is only enabled when the `px/agent_diagnostic_status` script is present. Since this script will become available in the next release, this change ensures that self hosted users can benefit from this additional diagnostic information as soon as possible. Relevant Issues: #2051 Type of change: /kind feature Test Plan: Verified the version numbers Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…ect missing kernel headers (#2065)" (#2081) Summary: Revert "Use `px/agent_status_diagnostics` script within px cli to detect missing kernel headers (#2065)" This reverts commit 3c9c4bd. While testing the latest cloud release, I noticed that the functionality added in #2065 can cause `px deploy` to hang indefinitely at the "Wait for healthcheck" step. The deploy will finish successfully, but it's a poor user experience. There seems to be a goroutine blocking issue that is dependent on cluster size. On 1 and 2 node clusters that I tested #2065 on, the issue doesn't surface. However, the issue reproduces reliably on the larger clusters that have pixie deployed to. Let's revert and release a new cli version once this is tracked down. Relevant Issues: #2051 Type of change: /kind bugfix Test Plan: N/A Changelog Message: Reverted the recent advanced diagnostics added during `px deploy` as in some cases it can cause that caused `px deploy` to hang Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
…ing kernel headers (pixie-io#2065) Summary: Use `px/agent_status_diagnostics` script within px cli to detect missing kernel headers This PR leverages the script added in pixie-io#2064 to detect missing kernel headers during cli deploys and `px collect-logs` commands. This solves 2/3 of the use cases I was hoping to identify for pixie-io#2051 (the last being helm installs). A recent example of this problem is pixie-io#1986, where a Go TLS tracing bug went undiagnosed for months (August to December). Amazon Linux 2023's headers are different enough that it breaks Go TLS tracing when pixie's pre-packaged headers are used. The tooling in this PR would have provided a few opportunities for this to be caught. Relevant Issues: pixie-io#2051 Type of change: /kind feature Test Plan: Verified the following scenarios <details><summary>Test cases</summary> - [x] `px collect-logs` works against a cloud that doesn't have a `px/agent_status_diagnostics` script ``` $ bazel run -c opt --stamp src/pixie_cli:px -- collect-logs WARN[0006] healthcheck script detected the following warnings: error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes." Logs written to pixie_logs_20241223165214.zip # zip file contains px/agent_status output $ cat px_agent_diagnostics.txt {"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true} ``` - [x] `px collect-logs` works against a cloud that does have a `px/agent_status_diagnostics` script ``` $ bazel run src/pixie_cli:px -- collect-logs INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured). INFO: Found 1 target... Target //src/pixie_cli:px up-to-date: bazel-bin/src/pixie_cli/px_/px INFO: Elapsed time: 4.240s, Critical Path: 3.89s INFO: 3 processes: 1 internal, 2 linux-sandbox. INFO: Build completed successfully, 3 total actions INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs Pixie CLI ******************************* * ENV VARS * PX_CLOUD_ADDR=testing.getcosmic.ai:443 ******************************* Logs written to pixie_logs_20241218164734.zip $ cat px_agent_diagnostics.txt {"_tableName_":"output","headers_installed_percent":1} ``` - [x] `px collect-logs` identifies when kernel headers are missing when `px/agent_status_diagnostics` present ``` $ Logs written to pixie_logs_20241223165214.zip $ bazel run -c opt --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs [ ... ] WARN[0012] healthcheck script detected the following warnings: error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes." $ cat px_agent_diagnostics.txt {"_tableName_":"output","headers_installed_percent":0.5} ``` - [x] Artificially forcing context deadline (timeout) results in an error ``` $ git diff diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go index 7d3b7e008..c957b8943 100644 --- a/src/pixie_cli/pkg/vizier/script.go +++ b/src/pixie_cli/pkg/vizier/script.go @@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus execScript = br.MustGetScript(script.AgentStatusScript) } - ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second) $ bazel run src/pixie_cli:px -- collect-logs WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script error="context deadline exceeded" Logs written to pixie_logs_20241218165033.zip ``` - [x] `px collect-logs` prompts auth flow when credentials don't match current cloud ``` $ PX_CLOUD_ADDR=new-cloud bazel run src/pixie_cli:px -- collect-logs ******************************* * ENV VARS * PX_CLOUD_ADDR=new-cloud ******************************* Failed to authenticate. Please retry `px auth login`. ``` - [x] `px deploy` on pre v0.14.14 (older) vizier with existing bundle warns that kernel headers should be installed ``` # Additional flags provided to speed up vizier bootstrapping $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json ``` - [x] `px deploy` on pre v0.14.14 (older) vizier with latest bundle warns that kernel headers should be installed ``` # Additional flags provided to speed up vizier bootstrapping $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json [ ... ] Waiting for Pixie to pass healthcheck ✔ Wait for PEMs/Kelvin ✔ Wait for PEMs/Kelvin ✕ Wait for healthcheck ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes. Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes. [ ...] ``` - [x] `px deploy` on v0.14.14 vizier with latest bundle warns appropriate when kernel headers are missing ``` $ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0 [ ... ] Waiting for Pixie to pass healthcheck ✔ Wait for PEMs/Kelvin ✕ Wait for healthcheck ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes. Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes. ``` </details> Changelog Message: Enhanced the `px` cli's `deploy` and `collect-logs` commands to surface when kernel headers aren't installed. This is a common source of bugs that can only be addressed by installing your distro's kernel headers. Signed-off-by: Dom Del Nano <ddelnano@gmail.com> (cherry picked from commit 3c9c4bd)
… missing kernel headers (#2091) Summary: Re-introduce use of `px/agent_status_diagnostics` in px cli to detect missing kernel headers The original version of this was reverted since it resulted in hung `px deploy` and `px collect-logs` commands on larger clusters. This PR reintroduces the change with the fixes necessary to prevent the previous issue and is best reviewed commit by commit as outlined below: Commit 1: Cherry-pick of #2065 Commit 2: Fix for goroutine deadlock Commit 3: Add bundle flag to the `px collect-logs` command Commit 4: Introduce `PX_LOG_FILE` env var for redirecting `px` log output to a file -- useful for debugging the cli since its terminal spinners complicate logging to stdout Commits 1, 3 and 4 should be self explanatory. As for Commit 2, the goroutine deadlock occurred from the `streamCh` channel consumer. The previous version read a single value from the `streamCh` channel, parsed the result and [terminated](2ec63c8#diff-4da8f48b4c664d330cff34e70f907d6015289797c832587b0b14004875ef0831R363) its goroutine. Thus future sends to the `streamCh` channel could block and prevent the pipe receiving the pxl script results to be fully consumed. Since the stream adapter writes to the pipe, it couldn't flush all of its results and the deadlock occurred. The original testing was performed on clusters with 1 and 2 nodes -- max of 2 PEMs and 2 results from `px/agent_status`. This deadlock issue didn't surface in those situations because `streamCh` was a buffered channel with capacity of 1 and the consumer would read a single record before terminating. This meant that the pipe reader would hit EOF before it would initiate a channel send that would deadlock as outlined below: 2 Node cluster situation: 1. `px` cli executes `px/agent_status` as `px/agent_status_diagnostics` is not in the canonical bundle yet 2. streamCh producer sends 1st PEMs result -- streamCh at capacity 3. streamCh consumer reads the value and exits -- streamCh ready to accept 1 value 4. streamCh producer sends 2nd and final PEM result -- streamCh at capacity and future sends would block! 5. Program exits since pxl script is complete Relevant Issues: #2051 Type of change: /kind feature Test Plan: Verified that the deadlock no longer occurs on clusters with 3-6 nodes - [x] Used the [following](https://github.com/user-attachments/files/18457105/deadlocked-goroutines.txt) pprof goroutine stack dump to understand the deadlock described above -- see blocked goroutine on `streamCh` channel send on `script.go:337` - [x] Re-tested all of the scenarios from #2065 Changelog Message: Re-introduce enhanced diagnostics for `px deploy` and `px collect-logs` commands used to detect common sources of environment incompatibilities --------- Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
Users running Vizier v0.14.14 and Cloud v0.1.9 will now see these kernel warning checks when using the We need to add this warning check to Helm installs, but I'll probably be revisiting that at a later time. |
While Pixie has invested in its prepackaged linux headers and working without upstream headers, it is highly recommended to install the given distro's kernel header package. Upstream distros patch and backport many changes, which make the prepackaged option susceptible issues that are hard to anticipate and work around. Examples of these inconsistencies can be seen in #1863, #252 and the recent openSUSE (2037) and Amazon Linux 2023 (#1986) issues.
Some of these issues get reported, but my suspicion is that this poor experience causes people to fail to fully evaluate Pixie since their initial impression shows that the socket tracer isn't functional (the most common way this problem manifests). For example, the openSUSE case mentioned above was only determined through my outreach and was in a position where the end user had moved on from evaluating Pixie.
If Pixie had the ability to detect when kernel headers aren't installed, we could warn the user that it is recommended to do so and link to common problems caused by the lack of headers. This will provide the end user with quick feedback on an area that's currently arcane to debug and hopefully prevent people from having a poor experience in these cases.
The text was updated successfully, but these errors were encountered: