Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make flux logs more lenient #3945

Merged
merged 1 commit into from
Jun 5, 2023
Merged

Make flux logs more lenient #3945

merged 1 commit into from
Jun 5, 2023

Conversation

makkes
Copy link
Member

@makkes makkes commented Jun 1, 2023

UX changes:

  • Only print an error when a pod doesn't have a matching container
    instead of exiting early.
  • Return a non-zero status code when no pod is found at all.

Details:

In certain situations there might be 3rd-party pods running in the
Flux namespace that cause the command to fail streaming logs, e.g.
when they have multiple containers but none of them is called
manager (which all Flux-maintained pods do). An example of such a
situation is when Flux is installed with the 3rd-party Flux extension
on AKS.

The logs command is now more forgiving and merely logs an error in
these situations instead of completely bailing out. It still returns a
non-zero exit code.

For the parallel log streaming with -f the code is now a little more
complex so that errors are now written to stderr in parallel with all
other logs written to stdout. That's what asyncCopy is for.

Before:

$ flux logs
✗ container manager is not valid for pod nginx-797bf9d444-vrhdr
$ echo $?
1

After:

$ flux logs
[...]
2023-06-01T08:05:07.860Z info Kustomization/flux-system.flux-system - Discarding event, no alerts found for the involved object 
2023-06-01T08:15:08.581Z info Kustomization/flux-system.flux-system - Discarding event, no alerts found for the involved object
container manager is not valid for pod nginx-797bf9d444-tbc8m
container manager is not valid for pod nginx2-7b4b6d8d54-stdrk
2023-06-02T12:11:17.309Z info GitRepository/flux-system.flux-system - Discarding event, no alerts found for the involved object 
2023-06-02T12:11:18.811Z info Kustomization/flux-system.flux-system - Discarding event, no alerts found for the involved object
[...]
✗ failed to collect logs from all Flux pods
$ echo $?
1

Tests

I changed the testing of the logs command and split the tests into unit and e2e tests so that the tests running the actual command are now run as part of the e2e suite so that they run against an actual cluster and we can test more complex scenarios and also because the command now fails if no pod can be found to print logs from.

refs #3944

@hiddeco
Copy link
Member

hiddeco commented Jun 1, 2023

@makkes can you provide an example of the output? Mostly trying to see how the \n looks from a UX perspective.

@makkes
Copy link
Member Author

makkes commented Jun 1, 2023

@makkes can you provide an example of the output? Mostly trying to see how the \n looks from a UX perspective.

done

@stefanprodan
Copy link
Member

Can we please accommodate flux-aio. It’s a single pod, with a container per controller, the container name matches the controller name.

@makkes
Copy link
Member Author

makkes commented Jun 1, 2023

Can we please accommodate flux-aio. It’s a single pod, with a container per controller, the container name matches the controller name.

One way we could solve this is to print the logs of all containers of a matching pod. In #2844 @hiddeco deliberately chose to show only the logs from the manager container, though. I'm not sure if the option to list all containers' logs was on the table and if it was discarded for a reason I don't know.

@stefanprodan
Copy link
Member

stefanprodan commented Jun 1, 2023

We can’t print logs of random containers as we’ll end up with Envoy/Linkerd/ngnix sidecars if people run Flux in a service mesh for mTLS. What I suggest is to look beside manager for the controller names which are already declared in flux2 manifestgen

@hiddeco
Copy link
Member

hiddeco commented Jun 1, 2023

Shouldn't we first put the project under the Flux umbrella before facilitating changes for an experimental and semi-official project?

@stefanprodan
Copy link
Member

Shouldn't we first put the project under the Flux umbrella before facilitating changes for an experimental and semi-official project?

Ok fair point. I can document why flux logs is the only command that doesn’t work with aio.

Regarding Azure, I could say the same thing, they shouldn’t label something as part-of: flux that’s not under Flux org. They could use some other value like azure-flux and that would make the current log command work.

Copy link
Member

@hiddeco hiddeco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fail to recognize any Deployment and/or Pod at all, I would expect this to still return an error (and error code). While at present on a freshly bootstrapped cluster (without a flux-system namespace, nor any components installed), it returns:

$ flux logs
$ echo $?
0

@makkes
Copy link
Member Author

makkes commented Jun 2, 2023

Ok, let's settle on a way forward. The options we have:

  1. Leave the code as it is and ask 3rd-party components such as the Azure Flux extension to remove the app.kubernetes.io/part-of label or at least change its value. This will leave flux logs unusable with flux-aio (which technically is not (yet) a part of Flux).
  2. Make the command more lenient as proposed in this PR, making it compatible with the Azure Flux extension.

If we opt for 2. then we still need to decide if we want to add extra code to handle flux-aio or not. I'm on the fence about this because I don't know how many users actually use flux-aio at this point. Given it's not part of the Flux project I'd personally prefer to only handle flux-aio when it has been made part of the Flux project (or at least fluxcd-community).

@hiddeco
Copy link
Member

hiddeco commented Jun 2, 2023

As I understand it, both Stefan and I agree that it should be option 2. But without taking flux-aio into account at this moment. Because we don't have full control over people and/or projects not doing this, it should still be lenient.

However, I would still in some way like to see #3945 (review) covered.

@makkes
Copy link
Member Author

makkes commented Jun 2, 2023

As I understand it, both Stefan and I agree that it should be option 2. But without taking flux-aio into account at this moment. Because we don't have full control over people and/or projects not doing this, it should still be lenient.

@stefanprodan said

Regarding Azure, I could say the same thing, they shouldn’t label something as part-of: flux that’s not under Flux org. They could use some other value like azure-flux and that would make the current log command work.

That's why I brought option 1 up to begin with.

@hiddeco
Copy link
Member

hiddeco commented Jun 2, 2023

Think failing as hard as we do for misconfigurations, and thereby breaking the primary feature of what it tries to offer -- is bad from a UX perspective in any case. Which makes this change unarguably better in comparison to what's offered now.

@makkes
Copy link
Member Author

makkes commented Jun 2, 2023

Ok good. I'll address the remaining comment and ask for another round of reviews.

@makkes makkes requested a review from hiddeco June 2, 2023 12:32
@makkes makkes force-pushed the lenient-logs-cmd branch 2 times, most recently from 7735e7a to 926a4ce Compare June 2, 2023 16:25
Copy link
Member

@hiddeco hiddeco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better overall UX, thanks a lot @makkes 💯

cmd/flux/logs.go Outdated Show resolved Hide resolved
UX changes:

- Only print an error when a pod doesn't have a matching container
  instead of exiting early.
- Return a non-zero status code when no pod is found at all.

Details:

In certain situations there might be 3rd-party pods running in the
Flux namespace that cause the command to fail streaming logs, e.g.
when they have multiple containers but none of them is called
`manager` (which all Flux-maintained pods do). An example of such a
situation is when Flux is installed with the 3rd-party Flux extension
on AKS.

The `logs` command is now more forgiving and merely logs an error in
these situations instead of completely bailing out. It still returns a
non-zero exit code.

For the parallel log streaming with `-f` the code is now a little more
complex so that errors are now written to stderr in parallel with all
other logs written to stdout. That's what `asyncCopy` is for.

refs #3944

Signed-off-by: Max Jonas Werner <mail@makk.es>
@makkes makkes force-pushed the lenient-logs-cmd branch 2 times, most recently from 6b8afc9 to cbdd71e Compare June 5, 2023 08:07
@makkes makkes merged commit a3f2b1d into main Jun 5, 2023
@makkes makkes deleted the lenient-logs-cmd branch June 5, 2023 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants