Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpk: Improve k8s bundle errors + better admin API fallback #19473

Merged
merged 5 commits into from
Jun 12, 2024

Conversation

r-vasquez
Copy link
Contributor

Debug bundles are often collected when things are not working properly, so it is normal that rpk debug bundle hits some errors along the collection steps. This PR aims to improve the error messages and provide better hints when errors occur, it focuses on the Kubernetes experience.

Fixes #18057

Main Changes:

  1. rpk uses the k8s API to 'find' the admin API addresses and collect the logs, this is the first step that leads many other steps, if the service account does not have the authorization to access some resources it will fail. Now we check for the authorization before executing the steps, reducing the clutter and providing a better error message:
# Before:
	* unable to get pods in the "redpanda" namespace: pods is forbidden: User "system:serviceaccount:redpanda:default" cannot list resource "pods" in API group "" in the namespace "redpanda"
	* unable to get pods in the "redpanda" namespace: pods is forbidden: User "system:serviceaccount:redpanda:default" cannot list resource "pods" in API group "" in the namespace "redpanda"

# Now:

	* skipping log collection and collecting Kubernetes resources (such as pods, services, etc.) in the namespace "permission denied to list services". To enable this you may need to grant additional permissions to your service account; visit https://docs.redpanda.com/current/manage/kubernetes/troubleshooting/k-diagnostics-bundle/
  1. Our fallback in the case of (1) was to use localhost:9644 for the admin API addresses, we are now using the loaded profile's addresses as the primary fallback since it includes TLS information as such. This does have a big impact on clusters that were created using our helm chart/operator since we now populate the redpanda.yaml with the cluster admin API addresses:
# Before
	* unable to issue request for "admin/disk_stat_cache_127.0.0.1-9644.json": Get "https://127.0.0.1:9644/v1/debug/storage/disk_stat/cache": tls: failed to verify certificate: x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs
	* unable to issue request for "metrics/127.0.0.1-9644/t0_public_metrics.txt": Get "https://127.0.0.1:9644/public_metrics": tls: failed to verify certificate: x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs
	* unable to issue request for "admin/node_config_127.0.0.1-9644.json": Get "https://127.0.0.1:9644/v1/node_config": tls: failed to verify certificate: x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs
	* unable to issue request for "admin/raft_status_127.0.0.1-9644.json": Get "https://127.0.0.1:9644/v1/raft/recovery/status": tls: failed to verify certificate: x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs

# Now: use the profile, which would have the TLS configuration in place :smile: 
  1. /proc/slabinfo collection often fails because rpk debug bundle is not being executed with root permissions:
# Before: 
open /proc/slabinfo: permission denied
# Now:
open /proc/slabinfo: permission denied; you may need to run the command as root to read this file
  1. Controller log collection requires the presence of redpanda.data_directory in the configuration file (redpanda.yaml), this is also necessary to start Redpanda, so it is often a sign of a corrupted or invalid config file. The error we were printing was not a clear indication of that
# Before:
	* lstat redpanda/controller/0_0: no such file or directory
# Now: 
	* failed to save controller logs: 'redpanda.data_directory' is empty on the provided configuration file
  1. If a command execution failed (du, top, etc...) we would print that the command exited with status 1, and the error (stderr) is saved in the file. Our error did not provide a hint that this was the behavior, this is now clear:
# Before:
* couldn't save 'utils/dmidecode.txt': exit status 1

# Now:
* couldn't save 'utils/dmidecode.txt': exit status 1; utils/dmidecode.txt contains the full error message

$ cat utils/dmidecode.txt
# dmidecode 3.3
/sys/firmware/dmi/tables/smbios_entry_point: Permission denied
Scanning /dev/mem for entry point.
/dev/mem: Permission denied

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

Improvements

  • rpk debug bundle now fallback to loaded profile's admin API URLs if we fail to discover the cluster in the collection steps.

twmb
twmb previously approved these changes Jun 11, 2024
@r-vasquez r-vasquez added kind/enhance New feature or request area/k8s and removed area/k8s labels Jun 11, 2024
@andrewhsu
Copy link
Member

@r-vasquez when you get the chance, can you rebase this PR on top of tip of dev branch to get the changes that were merged in PR #19625 to address gha triage job failure?

twmb
twmb previously approved these changes Jun 11, 2024
@vbotbuildovich
Copy link
Collaborator

Most of the time this step fails due to a
permission error.
If a user provides a configuration file without
redpanda.data_directory, rpk won't know where to
find the controller log dirs. We now provide a
better error message instead of:

* lstat redpanda/controller/0_0: no such file or directory

Either way, a configuration file (redpanda.yaml)
without a data_directory is an invalid config
file,
When a command fails to run, rpk will return:

- couldn't save 'foo.txt': exit status 1

And will save stderr in foo.txt for full debugging.
This is not clear, so users may be lost about
what happened and won't know how to get pass this
error. We are adding a hint of where is the rest
of the error (which might be multiple lines of text)
Clusters deployed with helm/operator will now
have the rpk section of the redpanda.yaml filled
with the Admin API addresses of the cluster. We
fallback to these addresses in case rpk can't
discover the API addresses using the k8s API.
Now we want to check if the authenticated user
account has authorization to collect the k8s
resources needed for the debug bundle process.

If not, we avoid running all the steps and instead
providing a single, meaningful error message
with a hint on how to solve this (link to our docs).
@r-vasquez r-vasquez merged commit 996183e into redpanda-data:dev Jun 12, 2024
22 checks passed
@r-vasquez
Copy link
Contributor Author

/backport v24.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rpk kind/enhance New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rpk debug bundle: Improve errors.txt / error output
5 participants