rpk: Improve k8s bundle errors + better admin API fallback #19473

r-vasquez · 2024-06-11T01:09:12Z

Debug bundles are often collected when things are not working properly, so it is normal that rpk debug bundle hits some errors along the collection steps. This PR aims to improve the error messages and provide better hints when errors occur, it focuses on the Kubernetes experience.

Fixes #18057

Main Changes:

rpk uses the k8s API to 'find' the admin API addresses and collect the logs, this is the first step that leads many other steps, if the service account does not have the authorization to access some resources it will fail. Now we check for the authorization before executing the steps, reducing the clutter and providing a better error message:

# Before:
	* unable to get pods in the "redpanda" namespace: pods is forbidden: User "system:serviceaccount:redpanda:default" cannot list resource "pods" in API group "" in the namespace "redpanda"
	* unable to get pods in the "redpanda" namespace: pods is forbidden: User "system:serviceaccount:redpanda:default" cannot list resource "pods" in API group "" in the namespace "redpanda"

# Now:

	* skipping log collection and collecting Kubernetes resources (such as pods, services, etc.) in the namespace "permission denied to list services". To enable this you may need to grant additional permissions to your service account; visit https://docs.redpanda.com/current/manage/kubernetes/troubleshooting/k-diagnostics-bundle/

Our fallback in the case of (1) was to use localhost:9644 for the admin API addresses, we are now using the loaded profile's addresses as the primary fallback since it includes TLS information as such. This does have a big impact on clusters that were created using our helm chart/operator since we now populate the redpanda.yaml with the cluster admin API addresses:

# Before
	* unable to issue request for "admin/disk_stat_cache_127.0.0.1-9644.json": Get "https://127.0.0.1:9644/v1/debug/storage/disk_stat/cache": tls: failed to verify certificate: x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs
	* unable to issue request for "metrics/127.0.0.1-9644/t0_public_metrics.txt": Get "https://127.0.0.1:9644/public_metrics": tls: failed to verify certificate: x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs
	* unable to issue request for "admin/node_config_127.0.0.1-9644.json": Get "https://127.0.0.1:9644/v1/node_config": tls: failed to verify certificate: x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs
	* unable to issue request for "admin/raft_status_127.0.0.1-9644.json": Get "https://127.0.0.1:9644/v1/raft/recovery/status": tls: failed to verify certificate: x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs

# Now: use the profile, which would have the TLS configuration in place :smile:

/proc/slabinfo collection often fails because rpk debug bundle is not being executed with root permissions:

# Before: 
open /proc/slabinfo: permission denied
# Now:
open /proc/slabinfo: permission denied; you may need to run the command as root to read this file

Controller log collection requires the presence of redpanda.data_directory in the configuration file (redpanda.yaml), this is also necessary to start Redpanda, so it is often a sign of a corrupted or invalid config file. The error we were printing was not a clear indication of that

# Before:
	* lstat redpanda/controller/0_0: no such file or directory
# Now: 
	* failed to save controller logs: 'redpanda.data_directory' is empty on the provided configuration file

If a command execution failed (du, top, etc...) we would print that the command exited with status 1, and the error (stderr) is saved in the file. Our error did not provide a hint that this was the behavior, this is now clear:

# Before:
* couldn't save 'utils/dmidecode.txt': exit status 1

# Now:
* couldn't save 'utils/dmidecode.txt': exit status 1; utils/dmidecode.txt contains the full error message

$ cat utils/dmidecode.txt
# dmidecode 3.3
/sys/firmware/dmi/tables/smbios_entry_point: Permission denied
Scanning /dev/mem for entry point.
/dev/mem: Permission denied

Backports Required

Release Notes

Improvements

rpk debug bundle now fallback to loaded profile's admin API URLs if we fail to discover the cluster in the collection steps.

src/go/rpk/pkg/cli/debug/bundle/bundle_k8s_linux.go

andrewhsu · 2024-06-11T17:06:11Z

@r-vasquez when you get the chance, can you rebase this PR on top of tip of dev branch to get the changes that were merged in PR #19625 to address gha triage job failure?

vbotbuildovich · 2024-06-11T22:42:15Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/50122#01900911-f28e-4ad4-8a01-59c50aa2f669

Most of the time this step fails due to a permission error.

If a user provides a configuration file without redpanda.data_directory, rpk won't know where to find the controller log dirs. We now provide a better error message instead of: * lstat redpanda/controller/0_0: no such file or directory Either way, a configuration file (redpanda.yaml) without a data_directory is an invalid config file,

When a command fails to run, rpk will return: - couldn't save 'foo.txt': exit status 1 And will save stderr in foo.txt for full debugging. This is not clear, so users may be lost about what happened and won't know how to get pass this error. We are adding a hint of where is the rest of the error (which might be multiple lines of text)

Clusters deployed with helm/operator will now have the rpk section of the redpanda.yaml filled with the Admin API addresses of the cluster. We fallback to these addresses in case rpk can't discover the API addresses using the k8s API.

Now we want to check if the authenticated user account has authorization to collect the k8s resources needed for the debug bundle process. If not, we avoid running all the steps and instead providing a single, meaningful error message with a hint on how to solve this (link to our docs).

r-vasquez · 2024-06-18T17:43:54Z

/backport v24.1.x

r-vasquez requested review from twmb, gene-redpanda and Deflaimun as code owners June 11, 2024 01:09

r-vasquez force-pushed the improve-k8s-bundle branch from e26c7af to 06b2eb1 Compare June 11, 2024 01:10

r-vasquez added the area/rpk label Jun 11, 2024

twmb previously approved these changes Jun 11, 2024

View reviewed changes

r-vasquez added kind/enhance New feature or request area/k8s and removed area/k8s labels Jun 11, 2024

JakeSCahill reviewed Jun 11, 2024

View reviewed changes

src/go/rpk/pkg/cli/debug/bundle/bundle_k8s_linux.go Outdated Show resolved Hide resolved

r-vasquez dismissed twmb’s stale review via 2f62cd4 June 11, 2024 17:18

r-vasquez force-pushed the improve-k8s-bundle branch from 06b2eb1 to 2f62cd4 Compare June 11, 2024 17:18

r-vasquez requested a review from twmb June 11, 2024 17:26

twmb previously approved these changes Jun 11, 2024

View reviewed changes

r-vasquez added 5 commits June 12, 2024 09:23

rpk: add hint to debug slab info collection

87927b3

Most of the time this step fails due to a permission error.

r-vasquez dismissed twmb’s stale review via e779bf3 June 12, 2024 16:23

r-vasquez force-pushed the improve-k8s-bundle branch from 2f62cd4 to e779bf3 Compare June 12, 2024 16:23

twmb approved these changes Jun 12, 2024

View reviewed changes

r-vasquez merged commit 996183e into redpanda-data:dev Jun 12, 2024
22 checks passed

This was referenced Jun 18, 2024

[v24.1.x] rpk debug bundle: Improve errors.txt / error output #19877

Closed

[v24.1.x] rpk: Improve k8s bundle errors + better admin API fallback #19878

Merged

r-vasquez deleted the improve-k8s-bundle branch July 3, 2024 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpk: Improve k8s bundle errors + better admin API fallback #19473

rpk: Improve k8s bundle errors + better admin API fallback #19473

r-vasquez commented Jun 11, 2024

andrewhsu commented Jun 11, 2024

vbotbuildovich commented Jun 11, 2024

r-vasquez commented Jun 18, 2024

rpk: Improve k8s bundle errors + better admin API fallback #19473

rpk: Improve k8s bundle errors + better admin API fallback #19473

Conversation

r-vasquez commented Jun 11, 2024

Main Changes:

Backports Required

Release Notes

Improvements

andrewhsu commented Jun 11, 2024

vbotbuildovich commented Jun 11, 2024

r-vasquez commented Jun 18, 2024