-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destination pods disappear before exit code can be retrieved #10934
Comments
Initially though this might be due to sweeper, but it is not. We've observed this in GKE and it was due to Kubernetes Garbage Collection of the container (verified with logs). Since we can't control GC (Kube config is not exposed to users) this might need to handled different on Airbyte side? |
@lmossman I believe that it will be fix by setting |
it is, but only when you have access to Kube config. Most cloud providers won't allow to access this (GCP and Azure for sure don't) |
@Kopiczek are you able to post some logs here showing that the Kubernetes Garbage Collector swept the pod in your case? That would be helpful for us to verify the behavior here |
We were able to find a log message in our logs indicating that the pod was swept by the kube We still aren't certain why the gargbase collector is sweeping these pods immediately after they complete, but we are exploring solutions to account for this case |
Passing acceptance tests. The main race condition related to initialization was addressed. I still haven't scale tested as per #11083 yet so this is going to bounce another week unfortunately. |
This should be fixed in 0.35.63-alpha. We've tested this for the case we were seeing on GKE. @Kopiczek please create another issue (and reference this one) if our fix in the new version doesn't address the problems you were seeing. |
Environment
Current Behavior
It seems that for long-running syncs that run on GKE, pods are sometimes swept by the GKE pod-garbage-collector immediately after they complete, before the exit code can be retrieved from the pod at the end of the DefaultReplicationWorker process.
The resulting effect is that even though a sync fully completes and the the source and destination pods both exit successfully, the following error is thrown when the DefaultReplicationWorker tries to retrieve the exit code of the pod that no longer exists:
This is affecting both cloud users (related issue) and OSS users (slack thread)
Expected Behavior
The expected behavior is that if both the source and destination pods complete successfully, then the replication process should also finish successfully.
To solve this issue, we will need to add logic to account for the case where the pod is swept before the exit code is retrieved from the pod.
Logs
I cut the uninteresting middle part of the logs out so that I was able to upload them:
logs-66220.txt
Steps to Reproduce
Are you willing to submit a PR?
Remove this with your answer.
The text was updated successfully, but these errors were encountered: