-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker does not finish off job. Source and Destination containers hang #5754
Comments
Just tried the same connection with |
breaking change likely from this PR , could also therefore be affecting CDC on MSSQL & Postgres in the same way. |
weird! am not sure how my PR could have caused this. |
Related? |
I spent some time trying to reproduce it and I dont think its happening cause of my changed from this PR #5600 Currently we dont handle the scenario when the worker dies. If a the worker dies then then our source and destination containers would just hang. I could not reproduce why the worker dies but perhaps its because of resource exhaustion. As part of this issue we should handle the scenario when the worker dies and what should happen to the source and destination containers in that case. @cgardens I would like to have your thoughts on this. Few things that we can do is the source and destination containers should check if the parent worker exists or not, if parent worker doesnt exist, the source and destination containers should self kill. |
We actually handle this explicitly for the Kubernetes case by having another container in the connector pod that kills it if it can't reach back to the worker pod (which is serving a heartbeat signal). It's a bit harder in docker-compose since we'd need more of a "supervisor" container to manage killing other pods (it'd need to keep track of relevant pods and handle the heartbeat checking). |
To confirm, the solution here is to kill source and destination containers? Because that doesn't seem like a viable fix if this happens every time ie never move onto the normalisation phase |
@danieldiamond agreed! we will be taking this up and releasing the correct fix |
Airbyte version: 0.29.17-alpha @subodh1810
|
tl;dr confirming Worker container is still alive but getting stuck before providing and that this issue is occurring for both MySQL |
More updates. This could be a resource issue or I lucked out on a random try but I just doubled the instance to |
So I think I've ruled out it being a resource issue as I've tried another connector with the What is interesting though is that in this hanging state (where source container is at ~100% CPU and worker container is at ~0.04% CPU) I run |
Update: I attempted to retry this connection (MySQL CDC for one table that is 60m rows and ~5gb data in source - although successful sync shows ~17gb in job sync status when trying it as STANDARD).
|
Updates: MySQL 0.4.8 |
Seeing the same thing for Postgres -> Snowflake with
|
Running with 51866fd seems to work and actually finish the job 👍. |
I've created a clean k8s deployment of airbyte with one connector, one table. And this still occurs. I'm not sure how/if any user is using airbyte to migrate large tables with CDC Airbyte version: 0.35.12-alpha |
here are four seperate users with various sources/destinations that seem to be experiencing this issue: |
Airbyte version: 0.35.15-alpha @danieldiamond I did some lengthy experiments in cloud using the same database schema as in the description. The number of rows is 300 million. In none of the cases did the connection lead to a hang, but at the same time I confirm the failed connection with errors
|
@VitaliiMaltsev are you suggesting that it doesn't work in the cloud either? separately, rereading earlier comments: @jrhizor @subodh1810
i've tried this now with k8s and an unreasonable amount of resources to ensure this isn't a resource/memory issue. the job still hangs after reading all the records. then is the issue still the situation where the worker dies (if this is explicitly handled that in k8s, then that might not be working as expected) |
FYI @VitaliiMaltsev this seems to be the exact same issue, with a lot more context if you're interested |
All syncs permanently failed with latest master branch |
Enviroment
Current Behavior
Sync job just hangs after
completed source
and whilst the source and destination containers exist, there appears to be no worker container. Additionally, the destination is not finishes with its inserts into snowflakeExpected Behavior
Worker should send close message to destination and ensure that the source and destination containers finish. Something like
Logs
If applicable, please upload the logs from the failing operation.
For sync jobs, you can download the full logs from the UI by going to the sync attempt page and
clicking the download logs button at the top right of the logs display window.
LOG
The logs above are associated with a sync that fails. What i should be expected after that last line is this: (logs from a successful sync)
snippet from a successful sync with the same source + destination connectors
Steps to Reproduce
These aren't scientific, but Description of the schema of the table. It has 60m rows, amounting to about 1GB of data on the DB server.
So presumably the repro steps would be
Implementation hints
We should:
The text was updated successfully, but these errors were encountered: