MySQL CDC stress testing #3969

subodh1810 · 2021-06-08T19:56:46Z

There has been a report of OOM for MySQL CDC. We should stress test the CDC connector to make sure it can handle millions of records. A user reported OOM while syncing a table with following schema :

CREATE TABLE `structured_brief` (
  `pk` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `id` char(16) COLLATE utf8mb4_unicode_520_ci NOT NULL,
  `user_id` int(10) unsigned NOT NULL,
  `state` varchar(30) COLLATE utf8mb4_unicode_520_ci NOT NULL,
  `schema_id` varchar(32) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
  `data` text COLLATE utf8mb4_unicode_520_ci NOT NULL,
  `created_at` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `launched_at` datetime DEFAULT NULL,
  `api_app` varchar(64) COLLATE utf8mb4_unicode_520_ci DEFAULT NULL,
  `external_order_item_id` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`pk`),
  UNIQUE KEY `id` (`id`),
  KEY `user_id` (`user_id`),
  KEY `schema_id` (`schema_id`)
) ENGINE=InnoDB AUTO_INCREMENT=2007265 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci

The table had 1M records. We should try to sync a similar table with similar number of records and observe the memory consumption of our docker instance and find bottlenecks.

Airbyte was running on a t3.xlarge instance (4vCPU's and 16GB).

There are few things that might be causing the OOM (its a guess, we need to validate it) :

We maintain an in memory queue here in which debezium pushes the records from 1 thread and Airbyte reads from this queue from another thread. If debezium is pushing records at much faster pace compared to Airbyte consuming the records, the in memory data structure would grow
The records from the database might be big in size (values with lots of VARCHAR/TEXT/BLOB columns) which would again result in the size of the queue getting bigger
Debezium internally is doing some stuff which is hogging up the memory.

The text was updated successfully, but these errors were encountered:

marcosmarxm · 2021-06-11T02:43:40Z

@subodh1810 I didn't run a stress test yet but made a small research about errors happening in #4010.
From the author of issue #4010:

I scaled up a machine to 256GB to TRY to pull 89M rows but would only get to 11M rows before all of the errors began showing up. It gets to 11M rows very quickly but then slows to almost a halt once the first "ERROR i.d.e.EmbeddedEngine(commitOffsets):973 - {} - Timed out waiting to flush EmbeddedEngine" begins.

He had run two syncs. In the first one I collected two ERRORs from Debezium/Airbyte, the first one:

Can''t call rollback when autocommit=true

Searching I found this messages 1 / 2 in the Debezium Gitter chat:

[2019-11-12 02:41:06,570] ERROR Failed due to error: Aborting snapshot due to error when last running 'UNLOCK TABLES': Can''t call rollback when autocommit=true (io.debezium.connector.mysql.SnapshotReader) org.apache.kafka.connect.errors.ConnectException: Can''t call rollback when autocommit=true Error code: 0; SQLSTATE: 08003. at io.debezium.connector.mysql.AbstractReader.wrap(AbstractReader.java:230) at io.debezium.connector.mysql.AbstractReader.failed(AbstractReader.java:208)

we try to ingest table with 13mill records. We have set "snapshot.fetch.size" to 10000. Kafka Connector created fine. But we can see in the LadoopUI that the cluster is rebalancing indefinitely. We couldn't see any errors in the Kafka Connect logs,but after awhile it started throwing OOM.

the second one is:

Timed out waiting to flush EmbeddedEngine{id=<redacted>} offsets to storage

For this the recommendation is to increase Debezium Engine Properties:

offset.flush.interval.ms
offset.flush.timeout.ms
Currently the offset.flush.interval.ms value is set to 1000 there is any reason to use such value, Debezium default is 60000.

subodh1810 · 2021-06-18T12:37:32Z

I was able to reproduce this problem.
Testing set up :
I set up a mysql db in my local docker and created a table with 2 columns. I populated the table with 10M records and then triggered a sync in CDC mode for this table. The sync was also running on my local machine.

Problems : There are mainly two problems -

The throughput of the connector becomes very slow. If you refer to the logs that I have attached, we were able to process
8550000 records in around 2 minutes but towards the end of the sync we took around 18 minutes to process 250K records before the sync failed with OOM.

2021-06-18 08:38:28 INFO i.d.c.m.SnapshotReader(lambda$execute$16):677 - {dbz.connectorContext=snapshot, dbz.connectorName=test_database, dbz.connectorType=MySQL} - Step 7: - 8550000 of 9753196 rows scanned from table 'test_database.id_and_name' after 00:02:00.515

2021-06-18 08:58:25 INFO i.d.c.m.SnapshotReader(lambda$execute$16):677 - {dbz.connectorContext=snapshot, dbz.connectorName=test_database, dbz.connectorType=MySQL} - Step 7: - 8800000 of 9753196 rows scanned from table 'test_database.id_and_name' after 00:20:32.992

Also, the following logs start to pop up towards the end denoting that debezium is not able to write the cursor information in the offset file 2021-06-18 08:49:20 ERROR i.d.e.EmbeddedEngine(commitOffsets):973 - {} - Timed out waiting to flush EmbeddedEngine{id=test_database} offsets to storage

The second problem is the sync failing with OOM.

Analysis :
I did cpu profiling of the process to understand why the throughput decreases. I also enabled HeapDumpOnOutOfMemoryError flag which generated heapdump when OOM happened.

Both the problems are related and caused by the LinkedBlockingQueue that we use to store records in memory here . AS you can see from the below screenshots from heap dump analysis, the heap size became of 4GB in size and 99% of the space was occupied by the queue. This is the main cause of the OOM.

The decrease in speed is because of Garbage collection. Take a look at the screenshot from the cpu profiling analysis.
The screenshot shows the cpu time spent in different parts of the code.

As you can see we spent 99% time in Garbage collection which explains why towards the end of the sync the speed decreased substantially. Once we processed tons of records and populated the queue very quickly, the heap size became huge and cause of that the GC had to pause the main execution thread often to do cleanup. Since the main thread was being paused for longer and longer durations, debezium was also not able to flush the offset and thats why these errors started popping up 2021-06-18 08:49:20 ERROR i.d.e.EmbeddedEngine(commitOffsets):973 - {} - Timed out waiting to flush EmbeddedEngine{id=test_database} offsets to storage

I am attaching the cpu profiling report and the logs from the sync in this zip. The heapdump is of 6GB in size so its not possible to attach.
mysql_oom_analysis.zip.

cgardens · 2021-06-18T21:12:36Z

Very cool analysis. Nice work.

hrm13 · 2021-06-21T07:40:29Z

Fantastic to see this work done. I've been trying to replicate data between MySQL (Aurora) and Snowflake for the last couple of weeks to no avail, always running into the OOM problem. Most of our tables are small but we have a handful of tables in the 10-30M range and trying to sync them one by one did not help. It's the one of the few things holding me back from using Airbyte more widely and as a replacement to our cloud based platform. Look forward to updates on this issue and happy to provide what I can on our use case to help assist development with this.

subodh1810 added the area/connectors Connector related issues label Jun 8, 2021

cgardens added the priority/high High priority label Jun 9, 2021

cgardens added this to the Core - 2021-06-16 milestone Jun 9, 2021

cgardens assigned subodh1810 Jun 9, 2021

cgardens modified the milestones: Core - 2021-06-16, Core - 2021-06-23 Jun 16, 2021

subodh1810 mentioned this issue Jun 18, 2021

🐛 fix mysql-cdc OOM: use capped queue #4203

Merged

24 tasks

subodh1810 mentioned this issue Jun 21, 2021

Identify if Postgres CDC requires capped queue #4253

Closed

subodh1810 closed this as completed Jun 23, 2021

subodh1810 mentioned this issue Jun 23, 2021

mySQL Connector fails to load any records after "Timed out waiting to flush EmbeddedEngine" #4010

Closed

subodh1810 mentioned this issue Jul 7, 2021

Postgres to Redshift Connector fails but sync shows succeeded #4538

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MySQL CDC stress testing #3969

MySQL CDC stress testing #3969

subodh1810 commented Jun 8, 2021 •

edited

Loading

marcosmarxm commented Jun 11, 2021

subodh1810 commented Jun 18, 2021

cgardens commented Jun 18, 2021

hrm13 commented Jun 21, 2021

MySQL CDC stress testing #3969

MySQL CDC stress testing #3969

Comments

subodh1810 commented Jun 8, 2021 • edited Loading

marcosmarxm commented Jun 11, 2021

subodh1810 commented Jun 18, 2021

cgardens commented Jun 18, 2021

hrm13 commented Jun 21, 2021

subodh1810 commented Jun 8, 2021 •

edited

Loading