Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Batch operation in Cassandra Delete() #4054

Merged
merged 3 commits into from
Mar 23, 2018

Conversation

kalafut
Copy link
Contributor

@kalafut kalafut commented Feb 28, 2018

This fixes failing Cassandra backend tests. It is also probably the
better route, since Batch operations aren't the preferred approach
unless necessary (https://docs.datastax.com/en/cql/3.3/cql/cql_using/useBatch.html).

This fixes failing Cassandra backend tests. It is also probably the
better route, since Batch operations aren't the preferred approach
unless necessary (https://docs.datastax.com/en/cql/3.3/cql/cql_using/useBatch.html).
@kalafut kalafut added this to the 0.9.6 milestone Feb 28, 2018

for _, bucket := range buckets {
go func(bucket string) {
results <- c.sess.Query(stmt, bucket, key).Exec()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it this way means that the delete operation is not really atomic, but I see that Put() is also done this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The alternative approach, which also fixes the test, is:

batch := c.sess.NewBatch(gocql.LoggedBatch)
for _, bucket := range c.buckets(key) {
	batch.Query(stmt, bucket, key)
}

This is the Batch creation process shown in: https://github.com/gocql/gocql/blob/master/batch_test.go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the above. I think that batch.Query() is an append on the list of queries that will be executed, but we should double check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. So that change is just leveraging the lib function, but we were doing the same thing I believe. c.sess.NewBatch() instead of gocql.NewBatch() has some significant differences though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are those differences/why should/would we pick one over the other? It feels way nicer to use a lib's batch operations than goroutines so if this works (one way or another via c.sess.NewBatch or gocql.NewBatch) probably Put should be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now understand this more fully. c.sess.NewBatch() creates a batch that uses all of the session default parameters, whereas gocql.NewBatch() is just a Batch zero value (and is deprecated). The critical member is "defaultTimestamp", which defaults to true in the session, but is false if the Batch is created with gocql.NewBatch(). When true, the client specifies the query timestamp. All of the other queries (none of which use Batch) would be using client time, but the Batch before this PR will use server time:

If the client does not specify any timestamp, the server will assign one based on the time it receives the query. This can be a problem when the order of the writes matter
(https://docs.datastax.com/en/developer/java-driver/2.1/manual/query_timestamps/)

I agree Batch seems like the simpler approach, but both datastax docs (see link in commit message) and @obeattie have raised concerns about it. Specifically: if there are more than 65535 statements in the batch, it will simply fail. More generally, it places a heavier load on the cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the statement limit does exist as you pointed out, the more problematic limit I've run into is batch_size_fail_threshold_in_kb, which by default is set to 50KB.

@obeattie
Copy link
Contributor

👋 Hi, I'm the original author of this plugin and we use it extensively in production. What is the specific test failure this fixes? I can't seem to get Travis to show me the output properly.

@kalafut
Copy link
Contributor Author

kalafut commented Feb 28, 2018

Hi @obeattie. The errors were varied, but things like:

=== RUN   TestCassandraBackend
--- FAIL: TestCassandraBackend (0.20s)
    cassandra_test.go:37: bad: <nil> expected: &{foo [116 101 115 116] false}
    cassandra_test.go:37: keys[0] did not equal foo: []
    cassandra_test.go:37: expected 2 keys [foo, foo/]: [foo/]
    cassandra_test.go:38: root expected [foo foo/]: [foo/]
    cassandra_test.go:38: level 1 expected [bar bar/]: [bar/]
=== RUN   TestCassandraBackendBuckets
--- PASS: TestCassandraBackendBuckets (0.00s) FAIL FAIL    github.com/hashicorp/vault/physical/cassandra    0.210s

They were seen when running make test, relying on Docker for Mac. Colleagues using linux didn't see the errors, and I suspect this is due to timing differences.

I played a little with disabling certain parts of tests and it seemed like Deletes were always the contributor, and that they were effectively happening sometime after their Go invocation (and after other DB ops). For example, disabling a Delete command that should have been a no-op (there was no entry yet) would allow a subsequent Get test to pass. I noticed that both updating the Batch syntax and just switching it out for the syntax you used in Put() allowed tests to pass.

@obeattie
Copy link
Contributor

obeattie commented Feb 28, 2018

That is quite odd and I'm not sure I understand how removing batching would fix that specific problem. Cassandra uses wallclock timestamps to sequence operations, and deletes are effectively "just" writes – is it possible there was clock skew between the Docker VM and your host Mac?

Separately, I don't think removing the batching here is a bad thing – batching has some other slightly unexpected behaviours (for example, batch sizes are limited aggressively and will fail if the statements sum to larger than the limit) – but I am sceptical that it is a fix for the root cause of whatever behaviour you were seeing.

@calvn
Copy link
Contributor

calvn commented Feb 28, 2018

Now that you mentioned it, it's very likely that the issue is caused by clock skew between the VM and the host.

@jefferai
Copy link
Member

jefferai commented Mar 1, 2018

I know @chrishoffman has been seeing massive skew problems on Mac OS lately.

It seems like batching isn't the best idea; however, @obeattie thoughts on having delete use the same goroutine based approach as Put? Basically: it is a bad idea, even if it's not usually necessary? Also, I was going to make a comment about Cassandra relying on client-side timestamps but apparently it defaults to server-side, which makes this problem a bit more mystifying to me; even if skew is getting worse, as long as the delete requests are sent in order I would think from the server perspective it'd still maintain that order.

@kalafut
Copy link
Contributor Author

kalafut commented Mar 2, 2018

@jefferai re: the timestamps, what I observed in the tests was effectively: a test may do [DELETE, PUT, GET], but the what is happening is [PUT, DELETE, GET]. Only DELETE is getting server side timestamps, the PUT & GET are using the session, not the batch, so they're getting client timestamp. I'd suspect that a series of DELETEs probably retains order, and a series of {PUT | GET} retain order, but the mix of DELETE into the other ops is skewed.

@jefferai
Copy link
Member

jefferai commented Mar 2, 2018

I read https://www.datastax.com/dev/blog/why-cassandra-doesnt-need-vector-clocks about this. I'm not super convinced. The fact that Cassandra does updates column-wise is great -- it's almost like RDBMSes were onto something -- but the fact that conflict resolution is dependent on a mix of server and client timestamps feels like asking for trouble. The article claims timestamps are nice to have but not critical, but it didn't take much to see unintended behavior. At a minimum you could imagine that a server holding open a session could figure out at the start of the session what the expected skew from that client is, and maybe heartbeat regularly to keep that skew value updated.

@jefferai jefferai modified the milestones: 0.9.6, 0.10 Mar 20, 2018
@kalafut kalafut merged commit d396328 into master Mar 23, 2018
@kalafut kalafut deleted the cassandra-delete-no-batch branch March 23, 2018 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants