Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination pods disappear before exit code can be retrieved #10934

Closed
lmossman opened this issue Mar 8, 2022 · 7 comments
Closed

Destination pods disappear before exit code can be retrieved #10934

lmossman opened this issue Mar 8, 2022 · 7 comments
Assignees
Labels
area/platform issues related to the platform team/compose team/platform-move type/bug Something isn't working

Comments

@lmossman
Copy link
Contributor

lmossman commented Mar 8, 2022

Environment

  • Airbyte version:
  • OS Version / Instance:
  • Deployment:
  • Source Connector and version:
  • Destination Connector and version:
  • Severity:
  • Step where error happened: Sync Job

Current Behavior

It seems that for long-running syncs that run on GKE, pods are sometimes swept by the GKE pod-garbage-collector immediately after they complete, before the exit code can be retrieved from the pod at the end of the DefaultReplicationWorker process.

The resulting effect is that even though a sync fully completes and the the source and destination pods both exit successfully, the following error is thrown when the DefaultReplicationWorker tries to retrieve the exit code of the pod that no longer exists:

2022-03-07 21:37:10 �[46mreplication-orchestrator�[0m > Sync worker failed.
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.RuntimeException: Cannot find pod while trying to retrieve exit code. This probably means the Pod was not correctly created.
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073) ~[?:?]
	at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:163) [io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:56) [io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at io.airbyte.container_orchestrator.ReplicationJobOrchestrator.runJob(ReplicationJobOrchestrator.java:102) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
	at io.airbyte.container_orchestrator.ContainerOrchestratorApp.runInternal(ContainerOrchestratorApp.java:116) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
	at io.airbyte.container_orchestrator.ContainerOrchestratorApp.run(ContainerOrchestratorApp.java:146) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
	at io.airbyte.container_orchestrator.ContainerOrchestratorApp.main(ContainerOrchestratorApp.java:179) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
	Suppressed: java.lang.RuntimeException: Cannot find pod while trying to retrieve exit code. This probably means the Pod was not correctly created.
		at io.airbyte.workers.process.KubePodProcess.getReturnCode(KubePodProcess.java:714) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
		at io.airbyte.workers.process.KubePodProcess.exitValue(KubePodProcess.java:741) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
		at java.lang.Process.hasExited(Process.java:584) ~[?:?]
		at java.lang.Process.isAlive(Process.java:574) ~[?:?]
		at io.airbyte.workers.WorkerUtils.gentleClose(WorkerUtils.java:36) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
		at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.close(DefaultAirbyteDestination.java:115) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:125) [io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:56) [io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
		at io.airbyte.container_orchestrator.ReplicationJobOrchestrator.runJob(ReplicationJobOrchestrator.java:102) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
		at io.airbyte.container_orchestrator.ContainerOrchestratorApp.runInternal(ContainerOrchestratorApp.java:116) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
		at io.airbyte.container_orchestrator.ContainerOrchestratorApp.run(ContainerOrchestratorApp.java:146) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
		at io.airbyte.container_orchestrator.ContainerOrchestratorApp.main(ContainerOrchestratorApp.java:179) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Cannot find pod while trying to retrieve exit code. This probably means the Pod was not correctly created.
	at io.airbyte.workers.DefaultReplicationWorker.lambda$getDestinationOutputRunnable$6(DefaultReplicationWorker.java:366) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: java.lang.RuntimeException: Cannot find pod while trying to retrieve exit code. This probably means the Pod was not correctly created.
	at io.airbyte.workers.process.KubePodProcess.getReturnCode(KubePodProcess.java:714) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at io.airbyte.workers.process.KubePodProcess.exitValue(KubePodProcess.java:741) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at java.lang.Process.hasExited(Process.java:584) ~[?:?]
	at java.lang.Process.isAlive(Process.java:574) ~[?:?]
	at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.isFinished(DefaultAirbyteDestination.java:146) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at io.airbyte.workers.DefaultReplicationWorker.lambda$getDestinationOutputRunnable$6(DefaultReplicationWorker.java:340) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]

This is affecting both cloud users (related issue) and OSS users (slack thread)

Expected Behavior

The expected behavior is that if both the source and destination pods complete successfully, then the replication process should also finish successfully.

To solve this issue, we will need to add logic to account for the case where the pod is swept before the exit code is retrieved from the pod.

Logs

I cut the uninteresting middle part of the logs out so that I was able to upload them:
logs-66220.txt

Steps to Reproduce

  1. Run a sync for this connection: https://cloud.airbyte.io/workspaces/a2472741-7f90-4ff3-98c9-5096fd63f9a5/connections/75081114-612c-403b-b426-f0d917376613/status

Are you willing to submit a PR?

Remove this with your answer.

@lmossman lmossman added type/bug Something isn't working needs-triage labels Mar 8, 2022
@Kopiczek
Copy link

Kopiczek commented Mar 8, 2022

Initially though this might be due to sweeper, but it is not. We've observed this in GKE and it was due to Kubernetes Garbage Collection of the container (verified with logs). Since we can't control GC (Kube config is not exposed to users) this might need to handled different on Airbyte side?

@benmoriceau
Copy link
Contributor

@lmossman I believe that it will be fix by setting ttlSecondsAfterFinished to a big enough value. I am trying to figure out how.

@Kopiczek
Copy link

Kopiczek commented Mar 8, 2022

it is, but only when you have access to Kube config. Most cloud providers won't allow to access this (GCP and Azure for sure don't)

@benmoriceau benmoriceau added area/platform issues related to the platform and removed needs-triage labels Mar 9, 2022
@lmossman
Copy link
Contributor Author

lmossman commented Mar 9, 2022

@Kopiczek are you able to post some logs here showing that the Kubernetes Garbage Collector swept the pod in your case? That would be helpful for us to verify the behavior here

@lmossman
Copy link
Contributor Author

lmossman commented Mar 9, 2022

We were able to find a log message in our logs indicating that the pod was swept by the kube pod-garbage-collector: https://console.cloud.google.com/logs/query;query=%22destination-snowflake-sync-66220-1-vywlx%22%0Atimestamp%3D%222022-03-07T21:36:43.071914Z%22%0AinsertId%3D%2251dd1d26-0960-469f-8650-03bb7bcbd5cd%22;timeRange=2022-03-06T21:36:42.450Z%2F2022-03-08T21:36:42.450Z;cursorTimestamp=2022-03-07T21:36:43.071914Z?project=prod-ab-cloud-proj

We still aren't certain why the gargbase collector is sweeping these pods immediately after they complete, but we are exploring solutions to account for this case

@jrhizor jrhizor self-assigned this Mar 10, 2022
@jrhizor
Copy link
Contributor

jrhizor commented Mar 24, 2022

Passing acceptance tests. The main race condition related to initialization was addressed. I still haven't scale tested as per #11083 yet so this is going to bounce another week unfortunately.

@jrhizor
Copy link
Contributor

jrhizor commented Mar 31, 2022

This should be fixed in 0.35.63-alpha. We've tested this for the case we were seeing on GKE.

@Kopiczek please create another issue (and reference this one) if our fix in the new version doesn't address the problems you were seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform team/compose team/platform-move type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants