Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Operator doesn't manage the metric exporter sidecar #72

Open
eduardchernomaz opened this issue Dec 7, 2022 · 7 comments
Open
Assignees
Labels
Bug Something isn't working Duplicate This issue or pull request already exists

Comments

@eduardchernomaz
Copy link

Describe the bug
After the CRD has been applied and the test has completed running for the duration specified, the process fails. The worker job completes, while the master job continues to run.

To Reproduce
Steps to reproduce the behavior:

  1. Apply the LocustTest manifest to start the test
  2. Once the test runs for specified duration, list available jobs and pods
  3. You should see that the worker job and worker pods have been completed. However, the master job has not been completed and the pod is in a NotReady state.

Expected behavior
Once the test has completed, both the worker and the master pods should be in a Completed state and eventually removed.

Screenshots
Pods status after the test has completed.
image

Jobs status after the test has completed.
image

Additional context
I suspect that the problem is that on the master node, the locust-metrics-exporter container never stop and continues to run. Failing to signal job completion.

@AbdelrhmanHamouda
Copy link
Owner

Hello @eduardchernomaz,
your assessment of the root cause is on point. Indeed this behavior is because of the metrics exporter. This is a known issue that i am aware and intend on solving.

Problem explanation:
The exporter is a sidecar container that the locust native image is not aware of its existence. Also Kubernetes doesn't provide native support for sidecars container behavior definition e.g. shutdown after container x exits. This is important to understand because when I solve this for the metrics exporter container, the same issue will happen if your organisation have a cluster configuration to inject other sidecars e.g. istio sidecar.

In all cases, till I push the fix, I have provided 2 simple workarounds to employ in the meanwhile.

Workaround:

  • Option 1: Use a custom image that calls the recently added /quitquitquit endpoint on the exporter to exit. It would be something like this: curl -fsI -XPOST http://localhost:9646/quitquitquit
  • Option 2: At the end of your test call the /quitquitquit endpoint. In locust term this would translate to making the call in the function / method annotated with @events.quitting.add_listener.

@AbdelrhmanHamouda AbdelrhmanHamouda added the Bug Something isn't working label Dec 12, 2022
@AbdelrhmanHamouda AbdelrhmanHamouda changed the title [BUG] [BUG] Operator doesn't manage the metric exporter sidecar Dec 12, 2022
@AbdelrhmanHamouda
Copy link
Owner

More info on the proposed fix (still under investigation), The proposed idea is to extend the Operator operation to include container management as a secondary resource to the custom resource. Meaning that after the Operator creates the main cluster resources, it need to register a reconciler to manage the secondary resources created by k8s itself. Doing so, the Operator can start reacting to to events coming from specific containers within specific pods.

From a security perspective, I also need to investigate any additional (if any) cluster privileges that maybe required by such solution.

@eduardchernomaz
Copy link
Author

I wonder if we can also just add a PreStop container hook to the master deployment which would call the /quitquitquit endpoint.

@AbdelrhmanHamouda
Copy link
Owner

AbdelrhmanHamouda commented Dec 13, 2022

it is an interesting idea. One that make a lot of sense. I will dig into that and see what is needed to have this put in place.

@AbdelrhmanHamouda
Copy link
Owner

After some investigation, PreStop hook won't fit this use case. According to the documentation, it only gets invoked if the container termination is triggered from outside and not in cases where containers exit gracefully because of there internal process.

I am moving the investigation to assess if the liveness probe can be used to secure the desired effect without marking the job as "error"

@AbdelrhmanHamouda
Copy link
Owner

Possible implementation approach, Change metrics_exporter Liveness probes to ping the locust container every 10 seconds and on failure send a curl to quitequitequite.

@AbdelrhmanHamouda AbdelrhmanHamouda added the Duplicate This issue or pull request already exists label Feb 16, 2023
@AbdelrhmanHamouda
Copy link
Owner

This will be solved with the fix for #50

@AbdelrhmanHamouda AbdelrhmanHamouda self-assigned this Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants