About large transactions and memory usage #2050

solraxy · 2018-06-22T02:15:56Z

solraxy
Jun 22, 2018

Hi,
I run maxwell with docker.
logs in maxwell container:
09:18:54,971 INFO AbstractSchemaStore - storing schema @position[BinlogPosition[mysql-bin.001323:363446621], lastHeartbeat=1529572731838] after applying "CREATE TABLE IF NOT EXISTS block_site_custom_category (id INTEGER (11) PRIMARY KEY AUTO_INCREMENT NOT NULL, name VARCHAR (255) NOT NULL, forbidden INTEGER (1) NOT NULL DEFAULT 1)" to test_db, new schema id is 263
09:18:54,982 INFO AbstractSchemaStore - storing schema @position[BinlogPosition[mysql-bin.001323:363450605], lastHeartbeat=1529572731838] after applying "ALTER TABLE block_site_custom ADD COLUMN categoryID INTEGER (11) NOT NULL DEFAULT 1" to test_db, new schema id is 264
09:18:54,990 INFO AbstractSchemaStore - storing schema @position[BinlogPosition[mysql-bin.001323:363453573], lastHeartbeat=1529572731838] after applying "ALTER TABLE terminal ADD COLUMN pci_mac_list VARCHAR (255) NOT NULL DEFAULT ''" to test_db, new schema id is 265
12:18:14,676 INFO BinaryLogClient - Trying to restore lost connection to 10.0.0.11:3306
12:18:16,453 WARN BinlogConnectorReplicator - replicator stopped at position: mysql-bin.001324:729514106 -- restarting
12:18:16,482 INFO BinlogConnectorLifecycleListener - Binlog disconnected.
Exception in thread "blc-10.0.0.11:3306" java.lang.IllegalStateException: BinaryLogClient is already connected
at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:473)
at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:793)
at java.lang.Thread.run(Thread.java:748)
12:18:26,241 INFO BinaryLogClient - Connected to 10.0.0.11:3306 at mysql-bin.001324/729514106 (sid:6379, cid:647968501)
12:18:26,242 INFO BinlogConnectorLifecycleListener - Binlog connected.
01:44:18,897 INFO MaxwellContext - Sending final heartbeat: 1529631856281

The value binlog_file in maxwell.positions is still mysql-bin.001324 before I restart maxwell container.Now is mysql-bin.001328

Any idea?

liulikun · 2018-06-22T06:08:56Z

liulikun
Jun 22, 2018

mysql-bin.001328 is the latest binlog, it means maxwell has replicated all the binlog files from mysql-bin.001324 to mysql-bin.001328.

0 replies

solraxy · 2018-06-22T06:09:02Z

solraxy
Jun 22, 2018
Author

Also,
"Exception in thread "blc-10.0.0.1:3306" java.lang.OutOfMemoryError: GC overhead limit exceeded"

0 replies

solraxy · 2018-06-22T06:13:38Z

solraxy
Jun 22, 2018
Author

@liulikun
Hi
Thanks for your reply.
but maxwell is getting stuck before i restart maxwell container.
This is what i'm trying to figure out.

0 replies

liulikun · 2018-06-22T06:26:27Z

liulikun
Jun 22, 2018

Is it because of OOM error? How much memory do you have in the container? Did you have a gigantic SQL transaction at the stopped binlog position?

0 replies

solraxy · 2018-06-22T07:06:36Z

solraxy
Jun 22, 2018
Author

@liulikun

CONTAINER	CPU %	MEM USAGE / LIMIT	MEM %	NET I/O	BLOCK I/O
maxwell-container	12.25%	1.81 GiB / 7.671 GiB	23.59%	2.938 GB / 3.191 GB	770 kB / 0 B

0 replies

liulikun · 2018-06-22T13:36:14Z

liulikun
Jun 22, 2018

That seems close to the default -Xmx value if you haven't set it explicitly. Maybe try to set -Xmx to a bigger value. Also check mysql binlog with mysqlbinlog utility to see if the transaction is way too big, in that case you may need to skip that transaction by updating positions table with binlog position of next transaction.

0 replies

solraxy · 2018-06-23T01:23:02Z

solraxy
Jun 23, 2018
Author

@liulikun Thanks!
But maxwell had been exited at least 6 hrs till I start it manually.

The error log is a little bit different.
7:54:17,232 ERROR MaxwellKafkaProducer - TimeoutException @ Position[BinlogPosition[mysql-bin.001339:58303755], lastHeartbeat=1529688637285] -- {"database":"db","table":"device","pk.hostid":16661}
17:54:17,232 ERROR MaxwellKafkaProducer - Expiring 13 record(s) for prod_topic-0: 35237 ms has passed since batch creation plus linger time
17:54:06,097 ERROR TaskManager - cause:
java.util.concurrent.TimeoutException: BinaryLogClient was unable to connect in 5000ms
at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:817) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.zendesk.maxwell.replication.BinlogConnectorReplicator.ensureReplicatorThread(BinlogConnectorReplicator.java:116) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.replication.BinlogConnectorReplicator.getTransactionRows(BinlogConnectorReplicator.java:160) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.replication.BinlogConnectorReplicator.getRow(BinlogConnectorReplicator.java:293) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.replication.AbstractReplicator.work(AbstractReplicator.java:153) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.util.RunLoopProcess.runLoop(RunLoopProcess.java:27) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.Maxwell.startInner(Maxwell.java:202) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.Maxwell.start(Maxwell.java:152) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.Maxwell.main(Maxwell.java:223) ~[maxwell-1.14.7.jar:1.14.7]
17:54:14,654 ERROR MaxwellContext - Termination reason:
java.util.concurrent.TimeoutException: BinaryLogClient was unable to connect in 5000ms
at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:817) ~[mysql-binlog-connector-java-0.16.1.jar:0.16.1]
at com.zendesk.maxwell.replication.BinlogConnectorReplicator.ensureReplicatorThread(BinlogConnectorReplicator.java:116) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.replication.BinlogConnectorReplicator.getTransactionRows(BinlogConnectorReplicator.java:160) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.replication.BinlogConnectorReplicator.getRow(BinlogConnectorReplicator.java:293) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.replication.AbstractReplicator.work(AbstractReplicator.java:153) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.util.RunLoopProcess.runLoop(RunLoopProcess.java:27) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.Maxwell.startInner(Maxwell.java:202) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.Maxwell.start(Maxwell.java:152) ~[maxwell-1.14.7.jar:1.14.7]
at com.zendesk.maxwell.Maxwell.main(Maxwell.java:223) ~[maxwell-1.14.7.jar:1.14.7]
17:54:17,318 INFO TaskManager - Stopping: com.zendesk.maxwell.schema.PositionStoreThread@7f32b1ca
17:54:17,319 INFO StoppableTaskState - com.zendesk.maxwell.schema.PositionStoreThread requestStop() called (in state: RUNNING)
17:54:17,319 INFO TaskManager - Stopping: com.zendesk.maxwell.producer.MaxwellKafkaProducerWorker@34b36dc8
17:54:17,319 INFO StoppableTaskState - MaxwellKafkaProducerWorker requestStop() called (in state: RUNNING)
Is it kafka issue or something??

0 replies

osheroff · 2018-06-23T02:32:15Z

osheroff
Jun 23, 2018
Collaborator

@liulikun, big transactions should just spool to disk, advising to skip them is not generally good practice.

@olraxy, I think it's quite possible you're still encountering out-of-memory issues that bubble up in weird ways. I'd be interested to see a snapshot of top with maxwell running, as well as the output (inside the container) of ps or some such that could show memory usage. In theory, with 8gb of ram java's max-heap should default to 2gb, which is in most most cases more than enough to run maxwell.

0 replies

solraxy · 2018-06-23T20:17:17Z

solraxy
Jun 23, 2018
Author

@osheroff
ps -aux

USER	PID	%CPU	%MEM	VSZ	RSS	TTY	STAT	START	TIME	COMMAND
root	1	39.2	27.5	5271344	2219292	?	Ssl+	00:38	268:49	/docker-java-ho
root	46	0.1	0.0	4288	712	?	Ss	12:02	0:00	sh
root	52	0.0	0.0	38384	3052	?	R+	12:02	0:00	ps -aux

0 replies

osheroff · 2018-06-25T18:16:53Z

osheroff
Jun 25, 2018
Collaborator

wow, that's slammed. what kind of transaction volume are we talking about here?

thinking https://github.com/patric-r/jvmtop/blob/master/doc/ConsoleProfiler.md might be a good line of investigation here...

0 replies

liulikun · 2018-06-26T00:50:03Z

liulikun
Jun 26, 2018

mysql uses 4 bytes to store the end_log_pos in the header for each event in binlog: https://github.com/mysql/mysql-server/blob/4f1d7cf5fcb11a3f84cff27e37100d7295e7d5ca/sql/log_event.h#L609

If the transaction is too big, it can overflow the end_log_pos, the end_log_pos will start from 0 again. If there are any errors happened after the overflow during the replication, the saved position is not valid anymore (not be able to connect to mysql server with those positions as a client).

If you can use mysqlbinlog to decode the problem binlog file, go to the end of the binlog file, check if the end_log_pos match with the at position next line, if they don't match, you've had an overflow issue like the following:

#180516 19:19:28 server id 174326900  end_log_pos 1234 CRC32 0x1628b022   Update_rows: table id 614995514
# at 6241399090

0 replies

solraxy · 2018-06-26T03:12:41Z

solraxy
Jun 26, 2018
Author

@osheroff @liulikun
How to limit maxwell down to 2gb using -Xmx2g?
Is there any option in maxwell configuration?

0 replies

liulikun · 2018-06-26T03:49:50Z

liulikun
Jun 26, 2018

You can set environment variable JAVA_OPTS to -Xmx2g.

0 replies

osheroff · 2018-06-26T15:59:30Z

osheroff
Jun 26, 2018
Collaborator

@olraxy

wow, that's slammed. what kind of transaction volume are we talking about here?

0 replies

solraxy · 2018-07-03T03:34:14Z

solraxy
Jul 3, 2018
Author

@osheroff

JvmTop 0.8.0 alpha - 10:02:38, amd64, 4 cpus, Linux 4.4.0-104, load avg 6.43
http://code.google.com/p/jvmtop

PID	MAIN-CLASS	HPCUR	HPMAX	NHCUR	NHMAX	CPU	GC	VM	USERNAME	#T
22573	m.jvmtop.JvmTop	11m	1746m	21m	n/a	0.25%	0.00%	O8U17	ubuntu	12

0 replies

osheroff · 2021-02-16T19:25:39Z

osheroff
Feb 16, 2021
Collaborator

Hi @ypereirareis what version of maxwell are you on? Is this the BinaryLogClient is already connected error?

@surendra-outreach it depends on your workload (esp how big your transactions are, as well as how large your schemas are) but for a normal workload a 2GB heap should be fine.

0 replies

surendra-outreach · 2021-02-16T19:37:58Z

surendra-outreach
Feb 16, 2021

@osheroff - our current configuration is like 200 databases on a mysql server with each data base having 175 tales. Our pod are configured with 6GB, with two full cores.

My question is more about good Java GC configuration settings, like:

-server -Xms2G -Xmx6G -XX:PermSize=512m -XX:+UseG1GC -XX:MaxGCPauseMillis=? -XX:ParallelGCThreads=? -XX:ConcGCThreads=? -XX:InitiatingHeapOccupancyPercent=?

What is your recommendations for GC settings?

https://docs.oracle.com/cd/E55119_01/doc.71/e55122/cnf_jvmgc.htm#WSEAD420
https://www.oracle.com/java/technologies/javase/vmoptions-jsp.html

0 replies

osheroff · 2021-02-16T21:02:12Z

osheroff
Feb 16, 2021
Collaborator

@surendra-outreach I don't have direct experience running with such a large schema. G1 is a good choice of garbage collector, but I wouldn't twiddle any of the knobs (except of course -Xmx) unless I was trying to solve a particular problem. With a deployment like yours it's likely you have a very large older-generation heap, so it's possible that tuning InitiatingHeapOccupancyPercent upwards will prevent some GCs, but I really don't know.

Are you experiencing any issues or excessive CPU due to GC? I do also recommend that you collect some JVM metrics about GC'ing, especially if you start to tweak these knobs.

0 replies

surendra-outreach · 2021-02-17T00:22:21Z

surendra-outreach
Feb 17, 2021

@osheroff - thanks for the suggestions. One data point I forgot to mention was some transactions are large, usually their size range from 250k - 500K.

We do observe excessive CPU, but not sure if it is correlated to GC. We are currently recycling PODs when memory usage is above 90%. Does this any lead to data loss?

Also, does MD use producer, consumer, and a queue model design pattern to process binlog? And Is there any chance for memory leak in some lib?

0 replies

osheroff · 2021-02-17T19:58:28Z

osheroff
Feb 17, 2021
Collaborator

We are currently recycling PODs when memory usage is above 90%. Does this any lead to data loss?

No data loss should happen. If maxwell is hard-killed you may get some data duplication.

Also, does MD use producer, consumer, and a queue model design pattern to process binlog?

yes. the binlog replicator is on its own thread and produces rows into a queue. these are consumed by maxwell and fed to the producer. Some producers also have a queue (kafka, kinesis, others), whereas some just write the row directly to the producer.

And Is there any chance for memory leak in some lib?

Sure, why not? We fixed a memory leak in mysql-binlog-connector some releases back. Further leaks are possible, but it's hard to know for sure without proper JVM instrumentation; just because Maxwell is using 6GB of memory does not mean that it has a 6gb active heap.

0 replies

surendra-outreach · 2021-02-17T22:01:25Z

surendra-outreach
Feb 17, 2021

@osheroff - thanks a lot for the response.

0 replies

osheroff · 2021-02-17T22:05:44Z

osheroff
Feb 17, 2021
Collaborator

@surendra-outreach what version of maxwell are you running?

0 replies

surendra-outreach · 2021-02-17T22:21:48Z

surendra-outreach
Feb 17, 2021

v1.29.2

0 replies

osheroff · 2021-02-17T23:53:47Z

osheroff
Feb 17, 2021
Collaborator

no *known* memory leaks in 1.29.2. If you find a maxwell process that you believe to be leaking, go ahead and capture a heap dump using `jmap -dump:live PID`.

…

On Wed, Feb 17, 2021 at 2:22 PM surendra-outreach ***@***.***> wrote: v1.29.2 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1027 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB7P5AR2QRGLTMBO2QBFBLS7Q6QXANCNFSM4FGMFAWA> .

0 replies

surendra-outreach · 2021-02-21T17:12:56Z

surendra-outreach
Feb 21, 2021

@osheroff - Here is what I observed in our case with OOM issue. The binlog reader is fetching changes a lot faster into queue than the Kafka producer (we use most reliable config for at least once processing) can keep up. As a result, unbounded queue is growing and leading to OOM event. Is there way limit the queue size? Or any future plan to implementing back pressure strategy? Thanks for all the help and guidance.

0 replies

osheroff · 2021-02-21T23:41:26Z

osheroff
Feb 21, 2021
Collaborator

The queues inside maxwell all have bounds iirc. I’ll double check, but I think they’re all limited to like 20-30 rows. You might want to check your Kafka queue sizes. Also you might want to get a heap dump.

…

On Feb 21, 2021, at 09:16, surendra-outreach ***@***.***> wrote: @osheroff - Here is what I observed in our case with OOM issue. The binlog reader is fetching changes a lot faster into queue than the Kafka producer (we use most reliable config for at least once processing) can keep up. As a result, unbounded queue is growing and leading to OOM event. Is there way limit the queue size? Or any future plan to implementing back pressure strategy? Thanks for all the help and guidance. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

0 replies

surendra-outreach · 2021-02-22T18:04:03Z

surendra-outreach
Feb 22, 2021

@osheroff - Are you referring to buffer.memory settings for Kafka producer. Our current MD config for producer is aimed for max reliability, like

name: MD_KAFKA_ACKS
value: all
- name: MD_KAFKA_RETRIES
value: "5"
- name: MD_KAFKA_MAX_REQUEST_SIZE
value: "10000000"
- name: MD_KAFKA_COMPRESSION_TYPE
value: snappy

But don't see configuring anything related to Kafka producer client buffer sizes. Thanks

0 replies

osheroff · 2021-02-22T19:31:19Z

osheroff
Feb 22, 2021
Collaborator

so there's three queues:

binlog-connector produces events to a queue. BinlogConnectorReplicator consumes it This is 20 items long.
we then push RowMap entries into a queue for the kafka producer. this queue is 100 items long.
Internal kafka queues. controlled by the buffer.memory property, see https://docs.confluent.io/platform/current/installation/configuration/producer-configs.html

There's also a transaction buffer that can take up a lot of memory (up to 25% of configured max).

Again, though, without a heap dump (captured via jmap) we're really flying blind here.

0 replies

surendra-outreach · 2021-02-23T01:11:24Z

surendra-outreach
Feb 23, 2021

@osheroff - Thanks for the details. We are only seeing this behavior during the huge transaction surge. If we bump up the POD memory, it works and takes a while to gradually delete the queue. We usefully run with 4GB or 6GB. For example, last week, someone ran a bulk update on a large table that resulted in 10+ million CDC log events. We increased the POD memory to 16GB, then it was able to handle the load and clear the backlog gradually. Usually the memory usage comes to 2GB, which is a normal state. At this point I suspect that Kafka producer (consumer) is not able to keep up with the bin log reader (producer) rate.

"There's also a transaction buffer that can take up a lot of memory (up to 25% of configured max)." - Is this related to Kafka? Can you please clarify.

Yes, next time i will post heap dump from jmap. Thanks

0 replies

osheroff · 2023-09-14T13:04:39Z

osheroff
Sep 14, 2023
Collaborator

idk how relevant this info all still is, but here it is anyway

0 replies

About large transactions and memory usage #2050

Replies: 37 comments

solraxy Jun 22, 2018 Author

solraxy Jun 22, 2018 Author

solraxy Jun 22, 2018 Author

solraxy Jun 23, 2018 Author

osheroff Jun 23, 2018 Collaborator

solraxy Jun 23, 2018 Author

osheroff Jun 25, 2018 Collaborator

solraxy Jun 26, 2018 Author

osheroff Jun 26, 2018 Collaborator

solraxy Jul 3, 2018 Author

osheroff Feb 16, 2021 Collaborator

osheroff Feb 16, 2021 Collaborator

osheroff Feb 17, 2021 Collaborator

osheroff Feb 17, 2021 Collaborator

osheroff Feb 17, 2021 Collaborator

osheroff Feb 21, 2021 Collaborator

osheroff Feb 22, 2021 Collaborator

osheroff Sep 14, 2023 Collaborator

solraxy
Jun 22, 2018
Author

solraxy
Jun 22, 2018
Author

solraxy
Jun 22, 2018
Author

solraxy
Jun 23, 2018
Author

osheroff
Jun 23, 2018
Collaborator

solraxy
Jun 23, 2018
Author

osheroff
Jun 25, 2018
Collaborator

solraxy
Jun 26, 2018
Author

osheroff
Jun 26, 2018
Collaborator

solraxy
Jul 3, 2018
Author

osheroff
Feb 16, 2021
Collaborator

osheroff
Feb 16, 2021
Collaborator

osheroff
Feb 17, 2021
Collaborator

osheroff
Feb 17, 2021
Collaborator

osheroff
Feb 17, 2021
Collaborator

osheroff
Feb 21, 2021
Collaborator

osheroff
Feb 22, 2021
Collaborator

osheroff
Sep 14, 2023
Collaborator