Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't commit GTID to GtidSet until we hit COMMIT #250

Merged
merged 7 commits into from
Dec 10, 2018

Conversation

osheroff
Copy link
Collaborator

@osheroff osheroff commented Dec 6, 2018

Hey @shyiko,

Been working on zendesk/maxwell#1129, the tl;dr of which is that a maxwell user has a GTID-enabled connection to the mysql master that seems to drop fairly frequently. If this connection drops in the middle of processing a transaction, maxwell will attempt to reconnect the binlog-connector, but the binlog-connector's GTID position has already incremented, leading to data loss.

I tried to do this all maxwell-side, but it got a little ugly; maxwell had to maintain the current GTID string and then splice it into the GTID string if we crash in the middle of processing a transaction. It's possible, but then I had a think and figured that this approach probably makes more sense for 99% of binlog-connector use cases. I think. LMK what you think.

cheers,
osheroff

otherwise, clients that disconnect in the middle of a GTID transaction
will be pointing at an incorrect binlog position.
Copy link
Owner

@shyiko shyiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gtidSet.add(gtidEventData.getGtid());
GtidEventData gtidEventData = (GtidEventData) unwrapEventData(event.getData());
currentGtid = gtidEventData.getGtid();
} else if ( gtidSet != null ) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gtidSet is no longer protected by gtidSetAccessLock

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is, in addGtidToSet

@shyiko
Copy link
Owner

shyiko commented Dec 6, 2018

Hey Ben.
Just to make sure I understand this correctly.
If I try reconnecting in the middle of transaction when GTID is on I'm not going to get the remaining events in that transaction / only the next one? This sounds weird (inconsistent with the behavior when GTID turned off) but believable (MariaDB case - #53).

@osheroff
Copy link
Collaborator Author

osheroff commented Dec 6, 2018

Stanley,
yeah, you've got it - it's more or less the same as the mariadb issue in that w/ GTID you're not allowed to index straight into the middle of a transaction, but instead are constrained to the boundaries.

I'll make the fixes you ask for.

@osheroff
Copy link
Collaborator Author

osheroff commented Dec 8, 2018

bah. @shyiko I'm stuck.

have managed to get the vagrant VM into GTID mode, and even writing these tests has forced me to re-work the code, but now I'm stuck with the tests.

In the tests I'm calling

                eventListener.waitFor(XidEventData.class, 1, TimeUnit.SECONDS.toMillis(4));

but for some damn reason this always seems to return before the actual event gets logged, and before the action that I'm waiting for. My expectation is that this test-code should return after updateGtidSet is called, but it's not the truth.

Am I mis-understanding how to use the test suite? I think I must be.

@osheroff
Copy link
Collaborator Author

osheroff commented Dec 8, 2018

nevermind, I'm an idiot and didn't see there were two client instances. rubber duck debugging wins again.

@osheroff
Copy link
Collaborator Author

osheroff commented Dec 8, 2018

btw for my integration tests I use https://github.com/osheroff/onetimeserver, which can quickly bring up a mysql server in whatever configuration you want, and at the end of the test we throw the whole server away. It's been generally less headache for me than maintaining vagrant and trying to reset the server back into a good state before the next test. any interest in trying that?

was confused about client vs clientKeepAlive.
@shyiko
Copy link
Owner

shyiko commented Dec 10, 2018

Vagrant is being phased out by docker/docker-compose (specifically by #238). https://github.com/osheroff/onetimeserver does look interesting but I feel like switching to docker should be enough here.

Anyway, thank you for yet another PR, Ben!
Merging in!

@shyiko shyiko merged commit ddaabfd into shyiko:master Dec 10, 2018
@osheroff
Copy link
Collaborator Author

thanks for the merge! any chance of a release here?

@shyiko
Copy link
Owner

shyiko commented Dec 17, 2018

Absolutely. I'll try to publish a new release soon (over the next couple of days).

@osheroff
Copy link
Collaborator Author

osheroff commented Jan 2, 2019

hey @shyiko imma bugging you for a release in the new year

shyiko pushed a commit that referenced this pull request Jan 7, 2019
@shyiko
Copy link
Owner

shyiko commented Jan 7, 2019

0.17.0 is finally here 🐌

@osheroff
Copy link
Collaborator Author

that's a lovely snail you've got there, stanley. I've got my own snail-like PR going.

BTW, I ran into a weird gotcha while doing reconnect logic; check out
https://github.com/zendesk/maxwell/pull/1186/files#diff-6c0364725d8f72f2d70f68c6c6115747R343...

Basically in order to reconnect a GTID-enabled binlog connector using GTID positioning instead of file/offset positioning, I have to clear out the filename and position that's stored. I'm not utterly sure what the right thing to do when reconnecting a GTID client is... using file/offset could be more accurate, but if you're connecting to the master through a VIP and the reason your connection got dropped is that the master changed, using GTID positioning would be better.

In my case, a Maxwell user (going to the master via kubernetes networking) would get disconnects that happened right in the middle of a GTID transaction. When that happened and we tried to reconnect with file/offset positioning all hell would break loose.

Anyway, not sure if you want to do something about it but I thought I'd let ya know.

Happy new year!
-ben

@shyiko
Copy link
Owner

shyiko commented Jan 11, 2019

Heh :) Happy New Year to you too, Ben 🎄

Thanks for pointing that out. It kinda makes sense (if you have multiple MySQL pods accessible through the Service client might end up connecting to a different pod on "reconnect"). This needs to be fixed on mysql-binlog-connector-java side (issue - #254).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants