100% CPU utilization on CDC replication from Postgres CDC #6003

sherifnada · 2021-09-13T03:10:48Z

Enviroment

Airbyte version: 0.29.11-alpha
OS Version / Instance: Ubuntu
Deployment: EC2 instance
Source Connector and version: airbyte/source-mysql:0.4.3
Destination Connector and version: airbyte/destination-snowflake:0.3.12
Severity: High
Step where error happened: Sync job

Current Behavior

100% CPU utilization on the instance and the sync hangs. See the logs:

In addition, the user reports 100% CPU utilization on the node

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
30008d45fded        pedantic_morse      100.76%             851.6MiB / 30.96GiB   2.69%               0B / 0B             226MB / 1GB         38

Slack thread

Expected Behavior

I expect the sync to complete and CPU utilization not to be quite so drastic.

Logs

See above

┆Issue is synchronized with this Asana task by Unito

The text was updated successfully, but these errors were encountered:

sherifnada · 2021-09-24T02:21:57Z

This is potentially a duplicate of #5754

sashaNeshcheret · 2021-10-12T13:55:09Z

Bug was reproduced with source connector version's airbyte/source-mysql:0.4.3, but with last version of source-mysql:0.4.6 connector bug is not reproduced with more than 60 millions rows. Seems changes from #5600 fix it.

[65m_rows_success_finish.txt](https://github.com/airbytehq/airbyte/files/7330477/65m_rows_success_finish.txt

)
@sherifnada, please, take a look log file

tuliren · 2021-10-13T17:21:03Z

Bug was reproduced with source connector version's airbyte/source-mysql:0.4.3, but with last version of source-mysql:0.4.6 connector bug is not reproduced with more than 60 millions rows.

This sounds pretty good.

I am asking Daniel, who reported this issue, to try the newer version:

https://airbytehq.slack.com/archives/C01MFR03D5W/p1634144629454300?thread_ts=1629636715.470600&cid=C01MFR03D5W

danieldiamond · 2021-10-15T01:53:10Z

Attempted to retry this after recent developments in airbyte's source + destination connectors. however, no luck.
I didn't successfully reach the end of the full load i.e. 60M rows but failing after 11M rows
https://airbytehq.slack.com/archives/C01MFR03D5W/p1634262470051300

sherifnada · 2021-10-15T03:45:49Z

@danieldiamond thanks for giving it a shot. We'll have another go at this. For the record: DB stability is our #1 goal for this quarter. There'll be more details coming soon, but we're going to focus on creating robust benchmarks that reproduce the scenarios you've described as causing these issues and will apply a big push to get them resolved asap.

sashaNeshcheret · 2021-10-15T11:40:57Z

In scope of the issue i worked with standard insert loading method as was mentioned in the related slack thread, but seems Daniel used S3 stagging loading method in last attempt, @sherifnada, @tuliren can we ask Daniel to reproduce initial issue with standard insert loading method to check whether cdc with insert method works correctly in his environment and to understand whether there is the bug with such description or not. Based on it we can update issue's description, because now it doesn't correspond properly to the real issue.

danieldiamond · 2021-10-16T09:47:31Z

I can confirm that using STANDARD loading method works. here's some additional context https://airbytehq.slack.com/archives/C01MFR03D5W/p1634376853105300?thread_ts=1634262470.051300&cid=C01MFR03D5W

you can probably find more in the slack comments on other threads (i've used this table in various attempts for various issues)

sashaNeshcheret · 2021-10-18T12:50:50Z

As far as initial bug with STANDARD loading method is not reproduced, i will close the issue if there are no objections.

sherifnada · 2021-10-18T17:40:19Z

clarifying problem scope: I thought this issue was about the source connector utilizing 100% CPU and getting stuck. If that is the case, why would the destination connector's loading method make a difference? was the issue was actually with the destination connector?

tuliren · 2021-10-18T18:10:55Z

As far as initial bug with STANDARD loading method is not reproduced, i will close the issue if there are no objections.

@sashkalife, even though the standard mode works, the issue still exists for the staging mode. We should not close the issue before the root cause was figured out and fixed.

sashaNeshcheret · 2021-10-19T08:25:05Z

@sherifnada, i totally agree with you that we have moved away from the root issue. I reproduced bug about getting stuck with airbyte/source-mysql:0.4.3 connector, but with airbyte/source-mysql:0.4.6 version it works correctly. I assumed that changes from #5600 fixed the issue. But during checking this assumption, another bug was found with Snowlake staging CDC that prevent from checking root issue.
@tuliren, bug with Snowlake staging CDC was fixed and merged in #7074

tuliren · 2021-10-19T18:15:06Z

@sashaNeshcheret, got it. Thanks for the explanation.

To summarize all the related issues:

MySQL source
- Data sync hanging: fixed in remove sleep logic when the queue is full from CDC #5600
- 100% CPU usage: this issue
Snowflake destination
- S3 staging OOME: fixed in 🐛 fixed OOM error when splitting a stream into several files #7074

Based on Daniel's latest update, the 100% CPU usage issue still exists for the latest MySQL source (0.4.8). We can close #5277, but should leave this one open.

danieldiamond · 2021-10-19T20:23:37Z

Following up here, after the job was hanging, I stopped the source container. which triggered off the next phase of normalization (loading the temp tables in snowflake etc.)
121.66 GB | 222,127,098 records | 10h 55m 57s

However, even though the job synced successfully in the end. It decided to retry (from the start, which was painful, i didnt realize it would retry and jumped in after a few hours to cancel it). I have since updated the environment variable for retries.

I then cancelled this retry and performed another sync, which appears to kick off at the last read position, which seems great!
So i think the current workaround is to wait until the job hangs, stop the source container, wait till it successfully syncs and cancel the retry.

danieldiamond · 2021-10-19T20:37:13Z

sashaNeshcheret · 2021-10-19T21:37:03Z

Thank you @danieldiamond, I will use your scenario with huge db volume, hope I will catch job's hanging.
@tuliren thanks for summarizing, I will move ticket to in progress and play with it.

sashaNeshcheret · 2021-10-19T22:53:50Z

@danieldiamond can you, please, provide logs for last attempts.

danieldiamond · 2021-10-20T06:10:19Z

files are too large, will send through slack. are you in the community?

sashaNeshcheret · 2021-10-20T07:58:07Z

files are too large, will send through slack. are you in the community?

Sure, please share logs and tag me oleksandrNeshcheret in slack

danieldiamond · 2021-10-20T22:02:47Z

posted here https://airbytehq.slack.com/archives/C01MFR03D5W/p1634767335269000?thread_ts=1634284099.068600&cid=C01MFR03D5W

danieldiamond · 2021-11-08T05:38:09Z

Confirming this is still occurring with:
Airbyte version: 0.30.23-alpha
OS Version / Instance: AWS m5.2xlarge
Deployment: Docker
Source Connector and version: airbyte/source-mysql:0.4.9
Destination Connector and version: airbyte/destination-snowflake:0.3.16
Severity: Critical
Step where error happened: Sync job

sherifnada · 2021-11-08T06:26:34Z

@tuliren cc since you're managing the DB rehaul efforts

tuliren · 2021-11-08T07:58:35Z

Got it.

Please note that we may not have bandwidth to fix all the CDC issues in Q4. They are likely our major focus in Q1 2022.

tuliren · 2022-01-19T17:21:41Z

Although we originally planned to work on database CDC this quarter, our priorities have changed, and unfortunately we won't have the bandwidth to work on CDC this quarter any more. We recommend anyone watching this issue to use the non CDC mode for incremental sync for now.

danieldiamond · 2022-01-20T12:36:20Z

Appreciate the limited resources and existing priorities but suggesting to defer to non-CDC mode for large tables isn't a great alternative. Particularly if users already had existing internal batch migration jobs. Part of Airbyte's draw is "commoditising DB replication by open-sourcing CDC" - I see this issue relating to a core functionality/expectation of using Airbyte so here's to hoping for something next quarter 🤞

bleonard · 2022-09-30T00:44:15Z

It's been awhile for this issue. @danieldiamond, have you tried this recently? We've made many improvements in CDC.

danieldiamond · 2022-09-30T07:29:01Z

@bleonard appreciate you reaching out on this.
I'd say let's close this out. I think don't think MySQL CDC is GA ready but I also don't think the current issues are linked to this one.

bleonard · 2022-09-30T22:49:38Z

Ok @danieldiamond sounds good. Feel free to ping me on any current issues. We want to make it the best it can be.

sherifnada added type/bug Something isn't working lang/java labels Sep 13, 2021

sashaNeshcheret self-assigned this Oct 1, 2021

alexandr-shegeda mentioned this issue Oct 19, 2021

Out of memory error when using S3 as staging storage for Snowflake #5277

Closed

sherifnada added the area/connectors Connector related issues label Nov 15, 2021

sherifnada added this to GL Roadmap Jan 12, 2022

sherifnada moved this to On hold in GL Roadmap Jan 12, 2022

karinakuz added the connectors/destinations-warehouse label Jan 17, 2022

karinakuz added the connectors/destination/snowflake label Jan 17, 2022

bleonard added the team/connectors-java label Apr 19, 2022

grishick added team/databases and removed team/connectors-java labels Jun 28, 2022

grishick added the team/db-dw-sources Backlog for Database and Data Warehouse Sources team label Sep 27, 2022

bleonard closed this as completed Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

100% CPU utilization on CDC replication from Postgres CDC #6003

100% CPU utilization on CDC replication from Postgres CDC #6003

sherifnada commented Sep 13, 2021 •

edited by sync-by-unito bot

Loading

sherifnada commented Sep 24, 2021

sashaNeshcheret commented Oct 12, 2021 •

edited

Loading

tuliren commented Oct 13, 2021 •

edited

Loading

danieldiamond commented Oct 15, 2021

sherifnada commented Oct 15, 2021

sashaNeshcheret commented Oct 15, 2021

danieldiamond commented Oct 16, 2021

sashaNeshcheret commented Oct 18, 2021

sherifnada commented Oct 18, 2021

tuliren commented Oct 18, 2021

sashaNeshcheret commented Oct 19, 2021 •

edited

Loading

tuliren commented Oct 19, 2021 •

edited

Loading

danieldiamond commented Oct 19, 2021

danieldiamond commented Oct 19, 2021

sashaNeshcheret commented Oct 19, 2021

sashaNeshcheret commented Oct 19, 2021

danieldiamond commented Oct 20, 2021 •

edited

Loading

sashaNeshcheret commented Oct 20, 2021 •

edited

Loading

danieldiamond commented Oct 20, 2021

danieldiamond commented Nov 8, 2021

sherifnada commented Nov 8, 2021 •

edited

Loading

tuliren commented Nov 8, 2021

tuliren commented Jan 19, 2022

danieldiamond commented Jan 20, 2022

bleonard commented Sep 30, 2022

danieldiamond commented Sep 30, 2022

bleonard commented Sep 30, 2022

100% CPU utilization on CDC replication from Postgres CDC #6003

100% CPU utilization on CDC replication from Postgres CDC #6003

Comments

sherifnada commented Sep 13, 2021 • edited by sync-by-unito bot Loading

Enviroment

Current Behavior

Expected Behavior

Logs

sherifnada commented Sep 24, 2021

sashaNeshcheret commented Oct 12, 2021 • edited Loading

tuliren commented Oct 13, 2021 • edited Loading

danieldiamond commented Oct 15, 2021

sherifnada commented Oct 15, 2021

sashaNeshcheret commented Oct 15, 2021

danieldiamond commented Oct 16, 2021

sashaNeshcheret commented Oct 18, 2021

sherifnada commented Oct 18, 2021

tuliren commented Oct 18, 2021

sashaNeshcheret commented Oct 19, 2021 • edited Loading

tuliren commented Oct 19, 2021 • edited Loading

danieldiamond commented Oct 19, 2021

danieldiamond commented Oct 19, 2021

sashaNeshcheret commented Oct 19, 2021

sashaNeshcheret commented Oct 19, 2021

danieldiamond commented Oct 20, 2021 • edited Loading

sashaNeshcheret commented Oct 20, 2021 • edited Loading

danieldiamond commented Oct 20, 2021

danieldiamond commented Nov 8, 2021

sherifnada commented Nov 8, 2021 • edited Loading

tuliren commented Nov 8, 2021

tuliren commented Jan 19, 2022

danieldiamond commented Jan 20, 2022

bleonard commented Sep 30, 2022

danieldiamond commented Sep 30, 2022

bleonard commented Sep 30, 2022

sherifnada commented Sep 13, 2021 •

edited by sync-by-unito bot

Loading

sashaNeshcheret commented Oct 12, 2021 •

edited

Loading

tuliren commented Oct 13, 2021 •

edited

Loading

sashaNeshcheret commented Oct 19, 2021 •

edited

Loading

tuliren commented Oct 19, 2021 •

edited

Loading

danieldiamond commented Oct 20, 2021 •

edited

Loading

sashaNeshcheret commented Oct 20, 2021 •

edited

Loading

sherifnada commented Nov 8, 2021 •

edited

Loading