Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU utilization on CDC replication from Postgres CDC #6003

Closed
sherifnada opened this issue Sep 13, 2021 · 27 comments
Closed

100% CPU utilization on CDC replication from Postgres CDC #6003

sherifnada opened this issue Sep 13, 2021 · 27 comments
Assignees
Labels
area/connectors Connector related issues connectors/destination/snowflake connectors/destinations-warehouse lang/java team/db-dw-sources Backlog for Database and Data Warehouse Sources team type/bug Something isn't working

Comments

@sherifnada
Copy link
Contributor

sherifnada commented Sep 13, 2021

Enviroment

  • Airbyte version: 0.29.11-alpha
  • OS Version / Instance: Ubuntu
  • Deployment: EC2 instance
  • Source Connector and version: airbyte/source-mysql:0.4.3
  • Destination Connector and version: airbyte/destination-snowflake:0.3.12
  • Severity: High
  • Step where error happened: Sync job

Current Behavior

100% CPU utilization on the instance and the sync hangs. See the logs:

60m_row_table_success_but_hangs.txt

In addition, the user reports 100% CPU utilization on the node

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
30008d45fded        pedantic_morse      100.76%             851.6MiB / 30.96GiB   2.69%               0B / 0B             226MB / 1GB         38

Slack thread

Expected Behavior

I expect the sync to complete and CPU utilization not to be quite so drastic.

Logs

See above

┆Issue is synchronized with this Asana task by Unito

@sherifnada sherifnada added type/bug Something isn't working lang/java labels Sep 13, 2021
@sherifnada
Copy link
Contributor Author

This is potentially a duplicate of #5754

@sashaNeshcheret sashaNeshcheret self-assigned this Oct 1, 2021
@sashaNeshcheret
Copy link
Contributor

sashaNeshcheret commented Oct 12, 2021

Bug was reproduced with source connector version's airbyte/source-mysql:0.4.3, but with last version of source-mysql:0.4.6 connector bug is not reproduced with more than 60 millions rows. Seems changes from #5600 fix it.

[65m_rows_success_finish.txt](https://github.com/airbytehq/airbyte/files/7330477/65m_rows_success_finish.txt

Screenshot from 2021-10-12 15-12-16
)
@sherifnada, please, take a look log file

@tuliren
Copy link
Contributor

tuliren commented Oct 13, 2021

Bug was reproduced with source connector version's airbyte/source-mysql:0.4.3, but with last version of source-mysql:0.4.6 connector bug is not reproduced with more than 60 millions rows.

This sounds pretty good.

I am asking Daniel, who reported this issue, to try the newer version:

https://airbytehq.slack.com/archives/C01MFR03D5W/p1634144629454300?thread_ts=1629636715.470600&cid=C01MFR03D5W

@danieldiamond
Copy link
Contributor

Attempted to retry this after recent developments in airbyte's source + destination connectors. however, no luck.
I didn't successfully reach the end of the full load i.e. 60M rows but failing after 11M rows
https://airbytehq.slack.com/archives/C01MFR03D5W/p1634262470051300

@sherifnada
Copy link
Contributor Author

@danieldiamond thanks for giving it a shot. We'll have another go at this. For the record: DB stability is our #1 goal for this quarter. There'll be more details coming soon, but we're going to focus on creating robust benchmarks that reproduce the scenarios you've described as causing these issues and will apply a big push to get them resolved asap.

@sashaNeshcheret
Copy link
Contributor

In scope of the issue i worked with standard insert loading method as was mentioned in the related slack thread, but seems Daniel used S3 stagging loading method in last attempt, @sherifnada, @tuliren can we ask Daniel to reproduce initial issue with standard insert loading method to check whether cdc with insert method works correctly in his environment and to understand whether there is the bug with such description or not. Based on it we can update issue's description, because now it doesn't correspond properly to the real issue.

@danieldiamond
Copy link
Contributor

I can confirm that using STANDARD loading method works. here's some additional context https://airbytehq.slack.com/archives/C01MFR03D5W/p1634376853105300?thread_ts=1634262470.051300&cid=C01MFR03D5W

you can probably find more in the slack comments on other threads (i've used this table in various attempts for various issues)

@sashaNeshcheret
Copy link
Contributor

As far as initial bug with STANDARD loading method is not reproduced, i will close the issue if there are no objections.

@sherifnada
Copy link
Contributor Author

clarifying problem scope: I thought this issue was about the source connector utilizing 100% CPU and getting stuck. If that is the case, why would the destination connector's loading method make a difference? was the issue was actually with the destination connector?

@tuliren
Copy link
Contributor

tuliren commented Oct 18, 2021

As far as initial bug with STANDARD loading method is not reproduced, i will close the issue if there are no objections.

@sashkalife, even though the standard mode works, the issue still exists for the staging mode. We should not close the issue before the root cause was figured out and fixed.

@sashaNeshcheret
Copy link
Contributor

sashaNeshcheret commented Oct 19, 2021

@sherifnada, i totally agree with you that we have moved away from the root issue. I reproduced bug about getting stuck with airbyte/source-mysql:0.4.3 connector, but with airbyte/source-mysql:0.4.6 version it works correctly. I assumed that changes from #5600 fixed the issue. But during checking this assumption, another bug was found with Snowlake staging CDC that prevent from checking root issue.
@tuliren, bug with Snowlake staging CDC was fixed and merged in #7074

@tuliren
Copy link
Contributor

tuliren commented Oct 19, 2021

@sashaNeshcheret, got it. Thanks for the explanation.

To summarize all the related issues:

Based on Daniel's latest update, the 100% CPU usage issue still exists for the latest MySQL source (0.4.8). We can close #5277, but should leave this one open.

@danieldiamond
Copy link
Contributor

Following up here, after the job was hanging, I stopped the source container. which triggered off the next phase of normalization (loading the temp tables in snowflake etc.)
121.66 GB | 222,127,098 records | 10h 55m 57s

However, even though the job synced successfully in the end. It decided to retry (from the start, which was painful, i didnt realize it would retry and jumped in after a few hours to cancel it). I have since updated the environment variable for retries.

I then cancelled this retry and performed another sync, which appears to kick off at the last read position, which seems great!
So i think the current workaround is to wait until the job hangs, stop the source container, wait till it successfully syncs and cancel the retry.

@danieldiamond
Copy link
Contributor

Screen Shot 2021-10-20 at 7 36 53 am

@sashaNeshcheret
Copy link
Contributor

Thank you @danieldiamond, I will use your scenario with huge db volume, hope I will catch job's hanging.
@tuliren thanks for summarizing, I will move ticket to in progress and play with it.

@sashaNeshcheret
Copy link
Contributor

@danieldiamond can you, please, provide logs for last attempts.

@danieldiamond
Copy link
Contributor

danieldiamond commented Oct 20, 2021

files are too large, will send through slack. are you in the community?

@sashaNeshcheret
Copy link
Contributor

sashaNeshcheret commented Oct 20, 2021

files are too large, will send through slack. are you in the community?

Sure, please share logs and tag me oleksandrNeshcheret in slack

@danieldiamond
Copy link
Contributor

@danieldiamond
Copy link
Contributor

Confirming this is still occurring with:
Airbyte version: 0.30.23-alpha
OS Version / Instance: AWS m5.2xlarge
Deployment: Docker
Source Connector and version: airbyte/source-mysql:0.4.9
Destination Connector and version: airbyte/destination-snowflake:0.3.16
Severity: Critical
Step where error happened: Sync job

@sherifnada
Copy link
Contributor Author

sherifnada commented Nov 8, 2021

@tuliren cc since you're managing the DB rehaul efforts

@tuliren
Copy link
Contributor

tuliren commented Nov 8, 2021

Got it.

Please note that we may not have bandwidth to fix all the CDC issues in Q4. They are likely our major focus in Q1 2022.

@tuliren
Copy link
Contributor

tuliren commented Jan 19, 2022

Although we originally planned to work on database CDC this quarter, our priorities have changed, and unfortunately we won't have the bandwidth to work on CDC this quarter any more. We recommend anyone watching this issue to use the non CDC mode for incremental sync for now.

@danieldiamond
Copy link
Contributor

Appreciate the limited resources and existing priorities but suggesting to defer to non-CDC mode for large tables isn't a great alternative. Particularly if users already had existing internal batch migration jobs. Part of Airbyte's draw is "commoditising DB replication by open-sourcing CDC" - I see this issue relating to a core functionality/expectation of using Airbyte so here's to hoping for something next quarter 🤞

@grishick grishick added the team/db-dw-sources Backlog for Database and Data Warehouse Sources team label Sep 27, 2022
@bleonard
Copy link
Contributor

It's been awhile for this issue. @danieldiamond, have you tried this recently? We've made many improvements in CDC.

@danieldiamond
Copy link
Contributor

@bleonard appreciate you reaching out on this.
I'd say let's close this out. I think don't think MySQL CDC is GA ready but I also don't think the current issues are linked to this one.

@bleonard
Copy link
Contributor

Ok @danieldiamond sounds good. Feel free to ping me on any current issues. We want to make it the best it can be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues connectors/destination/snowflake connectors/destinations-warehouse lang/java team/db-dw-sources Backlog for Database and Data Warehouse Sources team type/bug Something isn't working
Projects
No open projects
Status: On Hold
Development

No branches or pull requests

7 participants