Destination pods disappear before exit code can be retrieved #10934

lmossman · 2022-03-08T02:04:25Z

Environment

Airbyte version:
OS Version / Instance:
Deployment:
Source Connector and version:
Destination Connector and version:
Severity:
Step where error happened: Sync Job

Current Behavior

It seems that for long-running syncs that run on GKE, pods are sometimes swept by the GKE pod-garbage-collector immediately after they complete, before the exit code can be retrieved from the pod at the end of the DefaultReplicationWorker process.

The resulting effect is that even though a sync fully completes and the the source and destination pods both exit successfully, the following error is thrown when the DefaultReplicationWorker tries to retrieve the exit code of the pod that no longer exists:

2022-03-07 21:37:10 �[46mreplication-orchestrator�[0m > Sync worker failed.
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.RuntimeException: Cannot find pod while trying to retrieve exit code. This probably means the Pod was not correctly created.
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073) ~[?:?]
	at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:163) [io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:56) [io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at io.airbyte.container_orchestrator.ReplicationJobOrchestrator.runJob(ReplicationJobOrchestrator.java:102) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
	at io.airbyte.container_orchestrator.ContainerOrchestratorApp.runInternal(ContainerOrchestratorApp.java:116) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
	at io.airbyte.container_orchestrator.ContainerOrchestratorApp.run(ContainerOrchestratorApp.java:146) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
	at io.airbyte.container_orchestrator.ContainerOrchestratorApp.main(ContainerOrchestratorApp.java:179) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
	Suppressed: java.lang.RuntimeException: Cannot find pod while trying to retrieve exit code. This probably means the Pod was not correctly created.
		at io.airbyte.workers.process.KubePodProcess.getReturnCode(KubePodProcess.java:714) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
		at io.airbyte.workers.process.KubePodProcess.exitValue(KubePodProcess.java:741) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
		at java.lang.Process.hasExited(Process.java:584) ~[?:?]
		at java.lang.Process.isAlive(Process.java:574) ~[?:?]
		at io.airbyte.workers.WorkerUtils.gentleClose(WorkerUtils.java:36) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
		at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.close(DefaultAirbyteDestination.java:115) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:125) [io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
		at io.airbyte.workers.DefaultReplicationWorker.run(DefaultReplicationWorker.java:56) [io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
		at io.airbyte.container_orchestrator.ReplicationJobOrchestrator.runJob(ReplicationJobOrchestrator.java:102) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
		at io.airbyte.container_orchestrator.ContainerOrchestratorApp.runInternal(ContainerOrchestratorApp.java:116) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
		at io.airbyte.container_orchestrator.ContainerOrchestratorApp.run(ContainerOrchestratorApp.java:146) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
		at io.airbyte.container_orchestrator.ContainerOrchestratorApp.main(ContainerOrchestratorApp.java:179) [io.airbyte-airbyte-container-orchestrator-0.35.46-alpha.jar:?]
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Cannot find pod while trying to retrieve exit code. This probably means the Pod was not correctly created.
	at io.airbyte.workers.DefaultReplicationWorker.lambda$getDestinationOutputRunnable$6(DefaultReplicationWorker.java:366) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: java.lang.RuntimeException: Cannot find pod while trying to retrieve exit code. This probably means the Pod was not correctly created.
	at io.airbyte.workers.process.KubePodProcess.getReturnCode(KubePodProcess.java:714) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at io.airbyte.workers.process.KubePodProcess.exitValue(KubePodProcess.java:741) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at java.lang.Process.hasExited(Process.java:584) ~[?:?]
	at java.lang.Process.isAlive(Process.java:574) ~[?:?]
	at io.airbyte.workers.protocols.airbyte.DefaultAirbyteDestination.isFinished(DefaultAirbyteDestination.java:146) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at io.airbyte.workers.DefaultReplicationWorker.lambda$getDestinationOutputRunnable$6(DefaultReplicationWorker.java:340) ~[io.airbyte-airbyte-workers-0.35.46-alpha.jar:?]
	at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]

This is affecting both cloud users (related issue) and OSS users (slack thread)

Expected Behavior

The expected behavior is that if both the source and destination pods complete successfully, then the replication process should also finish successfully.

To solve this issue, we will need to add logic to account for the case where the pod is swept before the exit code is retrieved from the pod.

Logs

I cut the uninteresting middle part of the logs out so that I was able to upload them:
logs-66220.txt

Steps to Reproduce

Run a sync for this connection: https://cloud.airbyte.io/workspaces/a2472741-7f90-4ff3-98c9-5096fd63f9a5/connections/75081114-612c-403b-b426-f0d917376613/status

Are you willing to submit a PR?

Remove this with your answer.

The text was updated successfully, but these errors were encountered:

Kopiczek · 2022-03-08T18:03:27Z

Initially though this might be due to sweeper, but it is not. We've observed this in GKE and it was due to Kubernetes Garbage Collection of the container (verified with logs). Since we can't control GC (Kube config is not exposed to users) this might need to handled different on Airbyte side?

benmoriceau · 2022-03-08T19:11:23Z

@lmossman I believe that it will be fix by setting ttlSecondsAfterFinished to a big enough value. I am trying to figure out how.

Kopiczek · 2022-03-08T19:51:18Z

it is, but only when you have access to Kube config. Most cloud providers won't allow to access this (GCP and Azure for sure don't)

lmossman · 2022-03-09T21:56:22Z

@Kopiczek are you able to post some logs here showing that the Kubernetes Garbage Collector swept the pod in your case? That would be helpful for us to verify the behavior here

lmossman · 2022-03-09T22:48:41Z

We were able to find a log message in our logs indicating that the pod was swept by the kube pod-garbage-collector: https://console.cloud.google.com/logs/query;query=%22destination-snowflake-sync-66220-1-vywlx%22%0Atimestamp%3D%222022-03-07T21:36:43.071914Z%22%0AinsertId%3D%2251dd1d26-0960-469f-8650-03bb7bcbd5cd%22;timeRange=2022-03-06T21:36:42.450Z%2F2022-03-08T21:36:42.450Z;cursorTimestamp=2022-03-07T21:36:43.071914Z?project=prod-ab-cloud-proj

We still aren't certain why the gargbase collector is sweeping these pods immediately after they complete, but we are exploring solutions to account for this case

jrhizor · 2022-03-24T13:51:17Z

Passing acceptance tests. The main race condition related to initialization was addressed. I still haven't scale tested as per #11083 yet so this is going to bounce another week unfortunately.

jrhizor · 2022-03-31T11:55:43Z

This should be fixed in 0.35.63-alpha. We've tested this for the case we were seeing on GKE.

@Kopiczek please create another issue (and reference this one) if our fix in the new version doesn't address the problems you were seeing.

lmossman added type/bug Something isn't working needs-triage labels Mar 8, 2022

benmoriceau added area/platform issues related to the platform and removed needs-triage labels Mar 9, 2022

jrhizor self-assigned this Mar 10, 2022

jrhizor closed this as completed Mar 31, 2022

Kopiczek mentioned this issue Apr 14, 2022

Sync job fails/retries itself after successfully transferring all the data. #5870

Closed

bleonard added team/compose team/platform-move labels Apr 15, 2022

qimumu9406 mentioned this issue Jun 29, 2023

Source AppStore: Unable to get historical data #27822

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Destination pods disappear before exit code can be retrieved #10934

Destination pods disappear before exit code can be retrieved #10934

lmossman commented Mar 8, 2022 •

edited

Loading

Kopiczek commented Mar 8, 2022

benmoriceau commented Mar 8, 2022

Kopiczek commented Mar 8, 2022

lmossman commented Mar 9, 2022

lmossman commented Mar 9, 2022

jrhizor commented Mar 24, 2022 •

edited

Loading

jrhizor commented Mar 31, 2022

Destination pods disappear before exit code can be retrieved #10934

Destination pods disappear before exit code can be retrieved #10934

Comments

lmossman commented Mar 8, 2022 • edited Loading

Environment

Current Behavior

Expected Behavior

Logs

Steps to Reproduce

Are you willing to submit a PR?

Kopiczek commented Mar 8, 2022

benmoriceau commented Mar 8, 2022

Kopiczek commented Mar 8, 2022

lmossman commented Mar 9, 2022

lmossman commented Mar 9, 2022

jrhizor commented Mar 24, 2022 • edited Loading

jrhizor commented Mar 31, 2022

lmossman commented Mar 8, 2022 •

edited

Loading

jrhizor commented Mar 24, 2022 •

edited

Loading