Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL CDC stress testing #3969

Closed
subodh1810 opened this issue Jun 8, 2021 · 4 comments
Closed

MySQL CDC stress testing #3969

subodh1810 opened this issue Jun 8, 2021 · 4 comments
Assignees
Labels
area/connectors Connector related issues priority/high High priority

Comments

@subodh1810
Copy link
Contributor

subodh1810 commented Jun 8, 2021

There has been a report of OOM for MySQL CDC. We should stress test the CDC connector to make sure it can handle millions of records. A user reported OOM while syncing a table with following schema :

CREATE TABLE `structured_brief` (
  `pk` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `id` char(16) COLLATE utf8mb4_unicode_520_ci NOT NULL,
  `user_id` int(10) unsigned NOT NULL,
  `state` varchar(30) COLLATE utf8mb4_unicode_520_ci NOT NULL,
  `schema_id` varchar(32) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
  `data` text COLLATE utf8mb4_unicode_520_ci NOT NULL,
  `created_at` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `launched_at` datetime DEFAULT NULL,
  `api_app` varchar(64) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
  `external_order_item_id` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`pk`),
  UNIQUE KEY `id` (`id`),
  KEY `user_id` (`user_id`),
  KEY `schema_id` (`schema_id`)
) ENGINE=InnoDB AUTO_INCREMENT=2007265 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci

The table had 1M records. We should try to sync a similar table with similar number of records and observe the memory consumption of our docker instance and find bottlenecks.

Airbyte was running on a t3.xlarge instance (4vCPU's and 16GB).

Screen Shot 2021-06-01 at 3 22 08 pm

There are few things that might be causing the OOM (its a guess, we need to validate it) :

  1. We maintain an in memory queue here in which debezium pushes the records from 1 thread and Airbyte reads from this queue from another thread. If debezium is pushing records at much faster pace compared to Airbyte consuming the records, the in memory data structure would grow
  2. The records from the database might be big in size (values with lots of VARCHAR/TEXT/BLOB columns) which would again result in the size of the queue getting bigger
  3. Debezium internally is doing some stuff which is hogging up the memory.
@subodh1810 subodh1810 added the area/connectors Connector related issues label Jun 8, 2021
@cgardens cgardens added the priority/high High priority label Jun 9, 2021
@cgardens cgardens added this to the Core - 2021-06-16 milestone Jun 9, 2021
@marcosmarxm
Copy link
Member

@subodh1810 I didn't run a stress test yet but made a small research about errors happening in #4010.
From the author of issue #4010:

I scaled up a machine to 256GB to TRY to pull 89M rows but would only get to 11M rows before all of the errors began showing up. It gets to 11M rows very quickly but then slows to almost a halt once the first "ERROR i.d.e.EmbeddedEngine(commitOffsets):973 - {} - Timed out waiting to flush EmbeddedEngine" begins.

He had run two syncs. In the first one I collected two ERRORs from Debezium/Airbyte, the first one:

  • Can''t call rollback when autocommit=true

Searching I found this messages 1 / 2 in the Debezium Gitter chat:

[2019-11-12 02:41:06,570] ERROR Failed due to error: Aborting snapshot due to error when last running 'UNLOCK TABLES': Can''t call rollback when autocommit=true (io.debezium.connector.mysql.SnapshotReader) org.apache.kafka.connect.errors.ConnectException: Can''t call rollback when autocommit=true Error code: 0; SQLSTATE: 08003. at io.debezium.connector.mysql.AbstractReader.wrap(AbstractReader.java:230) at io.debezium.connector.mysql.AbstractReader.failed(AbstractReader.java:208)

we try to ingest table with 13mill records. We have set "snapshot.fetch.size" to 10000. Kafka Connector created fine. But we can see in the LadoopUI that the cluster is rebalancing indefinitely. We couldn't see any errors in the Kafka Connect logs,but after awhile it started throwing OOM.


the second one is:

  • Timed out waiting to flush EmbeddedEngine{id=<redacted>} offsets to storage

For this the recommendation is to increase Debezium Engine Properties:

  • offset.flush.interval.ms
  • offset.flush.timeout.ms
    Currently the offset.flush.interval.ms value is set to 1000 there is any reason to use such value, Debezium default is 60000.

@subodh1810
Copy link
Contributor Author

I was able to reproduce this problem.
Testing set up :
I set up a mysql db in my local docker and created a table with 2 columns. I populated the table with 10M records and then triggered a sync in CDC mode for this table. The sync was also running on my local machine.

Problems : There are mainly two problems -

  1. The throughput of the connector becomes very slow. If you refer to the logs that I have attached, we were able to process
    8550000 records in around 2 minutes but towards the end of the sync we took around 18 minutes to process 250K records before the sync failed with OOM.
2021-06-18 08:38:28 INFO i.d.c.m.SnapshotReader(lambda$execute$16):677 - {dbz.connectorContext=snapshot, dbz.connectorName=test_database, dbz.connectorType=MySQL} - Step 7: - 8550000 of 9753196 rows scanned from table 'test_database.id_and_name' after 00:02:00.515
2021-06-18 08:58:25 INFO i.d.c.m.SnapshotReader(lambda$execute$16):677 - {dbz.connectorContext=snapshot, dbz.connectorName=test_database, dbz.connectorType=MySQL} - Step 7: - 8800000 of 9753196 rows scanned from table 'test_database.id_and_name' after 00:20:32.992

Also, the following logs start to pop up towards the end denoting that debezium is not able to write the cursor information in the offset file 2021-06-18 08:49:20 ERROR i.d.e.EmbeddedEngine(commitOffsets):973 - {} - Timed out waiting to flush EmbeddedEngine{id=test_database} offsets to storage

  1. The second problem is the sync failing with OOM.

Analysis :
I did cpu profiling of the process to understand why the throughput decreases. I also enabled HeapDumpOnOutOfMemoryError flag which generated heapdump when OOM happened.

Both the problems are related and caused by the LinkedBlockingQueue that we use to store records in memory here . AS you can see from the below screenshots from heap dump analysis, the heap size became of 4GB in size and 99% of the space was occupied by the queue. This is the main cause of the OOM.

Screenshot 2021-06-18 at 5 46 52 PM

Screenshot 2021-06-18 at 5 47 04 PM

The decrease in speed is because of Garbage collection. Take a look at the screenshot from the cpu profiling analysis.
The screenshot shows the cpu time spent in different parts of the code.
Screenshot 2021-06-18 at 5 54 43 PM

As you can see we spent 99% time in Garbage collection which explains why towards the end of the sync the speed decreased substantially. Once we processed tons of records and populated the queue very quickly, the heap size became huge and cause of that the GC had to pause the main execution thread often to do cleanup. Since the main thread was being paused for longer and longer durations, debezium was also not able to flush the offset and thats why these errors started popping up 2021-06-18 08:49:20 ERROR i.d.e.EmbeddedEngine(commitOffsets):973 - {} - Timed out waiting to flush EmbeddedEngine{id=test_database} offsets to storage

I am attaching the cpu profiling report and the logs from the sync in this zip. The heapdump is of 6GB in size so its not possible to attach.
mysql_oom_analysis.zip.

@cgardens
Copy link
Contributor

Very cool analysis. Nice work.

@hrm13
Copy link

hrm13 commented Jun 21, 2021

Fantastic to see this work done. I've been trying to replicate data between MySQL (Aurora) and Snowflake for the last couple of weeks to no avail, always running into the OOM problem. Most of our tables are small but we have a handful of tables in the 10-30M range and trying to sync them one by one did not help. It's the one of the few things holding me back from using Airbyte more widely and as a replacement to our cloud based platform. Look forward to updates on this issue and happy to provide what I can on our use case to help assist development with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues priority/high High priority
Projects
None yet
Development

No branches or pull requests

4 participants