Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CompareAndSet overrides the value when passed counter is 0 #113

Closed
hnakamur opened this issue Jul 22, 2017 · 18 comments
Closed

CompareAndSet overrides the value when passed counter is 0 #113

hnakamur opened this issue Jul 22, 2017 · 18 comments

Comments

@hnakamur
Copy link
Contributor

First of all, thanks for creating and sharing a great software!

I noticed CompareAndSet overrides the value when passed counter is 0.
Is this the expected behavior?
If so, I think it should be documented.
And then whether the casCounter ever becomes zero or not by calling Set.

I created a test code to reproduce this problem.
https://github.com/hnakamur/badger/commit/9b4439446e999cada58ae17869e3ffdc056b5a3e

I got the following test result.

$ go test -v -run=TestCASNoBatch
=== RUN   TestCASNoBatch
=== RUN   TestCASNoBatch/same_counter_updates_the_value
=== RUN   TestCASNoBatch/different_counter_does_not_override_the_value
=== RUN   TestCASNoBatch/zero_counter_should_not_override_the_value_but_it_does
--- FAIL: TestCASNoBatch (0.92s)
    --- PASS: TestCASNoBatch/same_counter_updates_the_value (0.00s)
    --- PASS: TestCASNoBatch/different_counter_does_not_override_the_value (0.00s)
    --- FAIL: TestCASNoBatch/zero_counter_should_not_override_the_value_but_it_does (0.00s)
        kv_test.go:185: error expected but got no error
FAIL
exit status 1
FAIL    github.com/dgraph-io/badger     0.927s
$ go version
go version go1.9beta2 linux/amd64
@hnakamur
Copy link
Contributor Author

I looked at code and understand
Set and CompareAndSet with casCounter=0 passes the same Entry value to BatchSet.
So it seems an expected behavior.

And looking at newCASCounter, I understand cas counters never be 0.

As a side point, I noticed the difference in error handlings in Set and CompareAndSet.
Could you explain the reason for this difference?

@manishrjain
Copy link
Contributor

CAS can fail if the counter doesn't match. Set would only fail if there's some internal issue, like disk write failure etc. Does that clarify things?

@cardbot
Copy link

cardbot commented Jul 22, 2017

The current CAS implementation uses a random number between 0 and 65535. But what happens if there are multiple writers racing using the same CAS, and the random number generator picks the same number (1 in 65535 chance of that)? This would allow multiple writers to set their value, even though only one writer should win.

@hnakamur
Copy link
Contributor Author

Your comment clarify most of things except the difference in error handling in Set and CompareAndSet.

I looked at the code and learned the reason for the difference by rewriting the code and run the test.

I looked at the code of BatchSet

badger/kv.go

Lines 739 to 747 in 2fb2716

var err error
for _, req := range reqs {
req.Wg.Wait()
if req.Err != nil {
err = req.Err
}
requestPool.Put(req)
}
return err

and learned the error returned by BatchSet is the last error set on reqs.

I tried to rewrite

	if err := s.BatchSet([]*Entry{e}); err != nil {
		return err
	}
	return e.Error

in BatchSet to

	return s.BatchSet([]*Entry{e})

like code in Set.

Then I run the test TestCAS and I got the test error.

$ go test -v -run TestCAS
=== RUN   TestCAS
--- FAIL: TestCAS (1.02s)
        Error Trace:    kv_test.go:159
        Error:          An error is expected but got nil.
FAIL
exit status 1

And I learned the CasMismatch is set to Entry.Error and it is not returned by BatchSet.
Now I'm clear about the difference.

@hnakamur
Copy link
Contributor Author

@cardbot Just a nitpick, actually it is a random number between 1 and 65535.

badger/util.go

Lines 151 to 153 in 5ae0851

func newCASCounter() uint16 {
return uint16(1 + mod65535(rand.Uint32()))
}

func newCASCounter() uint16 {
	return uint16(1 + mod65535(rand.Uint32()))
}

I am also worried about an accidental casCounter matches.
It's possibility is 0.0015%.

>>> 1.0 / 65535 * 100
0.0015259021896696422

Is this low enough?
Maybe casCounter type should be changed to uint32?
Then the possibility becomes 2.32e-08 %.

>>> 1.0 / (2**32-1) * 100
2.3283064370807974e-08

@srh
Copy link

srh commented Jul 23, 2017

Using [1, 2**32-1] is also no good for the same reason. Using [1, 2**64-1] would be fine. Instead of random generation you could also use a single global 64-bit counter, though the choice around that is questionable.

Another option is to get rid of CompareAndSwap.

It looks like we store the CASCounter in the value log anyway. So if users want CAS they can do it at the application layer without performance loss, if Badger exposes something like MaybeSet(k []key, f func(oldVal []byte) (newVal []byte, ok bool) in its API (and that user-supplied func had better be relatively simple).

@hnakamur
Copy link
Contributor Author

Passing the old value does not cause writing the old value to log, which is good, however the application must pass the old value to badger, so if the old value size is large (say, 1GB) it causes performance loss, right?

Off the topic, what is the recommended max size for keys and values which does not cause serious performance loss?

How about using time.UnixNano() for the casCounter?
The limitation is that you cannot update multiple times in one nanosecond, however it doesn't matter since it will take more than one nanosecond for updating one value anyway.

@srh
Copy link

srh commented Jul 24, 2017

If you use time.UnixNano() to generate the casCounter, and if you have two threads writing the key once per millisecond, it will only take 1000 seconds for them to generate the same CAS value, with big potential for bugs because they happen at the same time.

If we could guarantee that two threads aren't generating a casCounter at the same time, you could instead just increment a counter. For all I know, that might be possible, it might be a super-easy fix.

@hnakamur
Copy link
Contributor Author

This is just an idea, how about using atomic.AddUint64 for casCounter?
If we have a lot of keys that do not fit in memory, we can save the casCounter, load it back and create the counter on memory with atomic.StoreUint64.
I'm not sure the last part can be done safely when multiple goroutines load the casCounter for the same key, though.

@manishrjain
Copy link
Contributor

The reason we didn't just go with an increasing counter per key, is because we want to avoid doing a read for every write. Note that for CAS to work, every Set must also have stored a counter. Hence, every write would then involve at least one read to find the last counter, so we can store a new one. This obviously gets expensive.

Using 64 bits for CAS is very expensive as well. Currently, we store the cas counter in the LSM tree; which we suggest folks to store in RAM for great performance. Instead of storing only 2 bytes, if we now store 8 bytes per key, that's definitely going to decrease the number of keys we can store per .sst file.

We could consider doing atomic.AddUint16, but that could still cause issues -- two writes done exactly at the right counter value apart, could still cause an unexpected overwrite. This approach might be slightly better than what we're doing, arguably.

I don't see much issue with the current approach -- we'd need to have multiple CAS operations for the same key; and the random generator must in this data race, produce the same number; to potentially cause an issue. Note that this random generator is being shared across all the keys, so just spewing out the same number twice won't really cut it.

@srh
Copy link

srh commented Jul 24, 2017

The current approach makes it a completely useless feature.

An incrementing counter can be done on a per-store basis, so there's no read cost.

@manishrjain
Copy link
Contributor

manishrjain commented Jul 24, 2017

The current approach makes it a completely useless feature.

That's an overstatement.

An incrementing counter can be done on a per-store basis, so there's no read cost.

Please explain. Do you mean atomic.AddUint16?

@cardbot
Copy link

cardbot commented Jul 24, 2017

I agree with @srh. Imagine if atomic.CompareAndSwapInt64() did not work correctly every 1 in 65535 times? It'd be a major bug that the golang developers would patch immediately. To address another part of your statement, races can happen in other cases as well, not just when the rng produces the same number twice in a row:

  1. Caller Compaction logic #1 invokes Get and fetches record with CAS counter = 1001.
  2. Caller Various small tweaks #2 invokes Get and fetches record with CAS counter = 1001.
  3. Caller Various small tweaks #2 invokes Set and updates record, which gets assigned counter = 12345.
  4. Caller Implement bloom filters #3 invokes Get and fetches record, now with CAS counter = 12345.
  5. Caller Implement bloom filters #3 invokes Set and updates record, which gets assigned counter = 1001 (because rng produced 12345 followed by 1001).
  6. Caller Compaction logic #1 invokes Set and updates it, because the counters erroneously match.

This is just one scenario; many others are possible. In scenarios with a lot of concurrent access, this will fail from time to time and result in data corruption. I believe you need to fix.

In the current code, you're already "getting" the value in order to check CAS:

		if entry.CASCounterCheck != 0 {
			oldValue, err := s.get(entry.Key)
			if err != nil {
				return errors.Wrap(err, "writeToLSM")
			}
			// No need to decode existing value. Just need old CAS counter.
			if oldValue.CASCounter != entry.CASCounterCheck {
				entry.Error = CasMismatch
				continue
			}
		}

Why can't you move that code up into writeRequest and merge it with this:

	for _, req := range reqs {
		for _, e := range req.Entries {
			e.casCounter = newCASCounter()
		}
	}

That way, you could use an incrementing counter, which would be better (though still not perfect since it can roll over).

In the case where CASCounterCheck = 0, I'm not sure that you have to worry about setting a new value of CASCounter. The caller doesn't care about CAS in that case, so I don't see a need to fetch the old value in order to set the new. I'm struggling to think of a scenario where a caller would want to mix CAS calls with non-CAS calls on the same value.

@manishrjain
Copy link
Contributor

manishrjain commented Jul 24, 2017

I think if we want to fix it with no chance of failure, then the only solution I can think of to do this right, would involve doing a read to get the cas counter for every write. Even when you're doing a Set, without CAS, that Set needs to have a counter update. Otherwise, if one uses Set and CAS on the same key, then the counter has to represent the latest value, not some older value.

I'm struggling to think of a scenario where a caller would want to mix CAS calls with non-CAS calls on the same value.

We can't assume that one would never mix these calls. Even if we avoid this mixing, to know that this particular key should not have a CAS, we'll still have to do a get.

I'm still not sure if @srh had a better solution here. But, I think if LSM tree is already in memory, then doing a read before every write might be ok. If not, then this would severely reduce our write throughput.

@srh
Copy link

srh commented Jul 24, 2017

That's an overstatement.

Maybe I could imagine a scenario where some sort of lossy CAS operation was acceptable, where corrupting data once every 65535 CAS races was OK and avoiding the corruption was also generally useful. If such a use case exists, it's a very slim portion of them, and even then such users would probably much prefer 32-bit or 64-bit random CAS values, or 32-bit incrementing values. It would be difficult to write a test suite for such a program.

Please explain. Do you mean atomic.AddUint16?

I meant a 32- or 64-bit global incrementing counter (or several partitioned ones). You don't have to read to generate CAS values. A 64-bit rng would have the same effect, but a 64-bit incrementing counter could have ancillary uses (like efficient snapshot retrieval for Dgraph).

@cardbot
Copy link

cardbot commented Jul 24, 2017

Maybe the best way to resolve this is by enabling your callers to CAS themselves:

// CompareAndSet conditionally writes a key/value record only if its current CAS
// value equals oldCas. If the record does not yet exist, then it will be written only
// if oldCas = 0.
func (s *KV) CompareAndSet(key []byte, val []byte, oldCas, newCas uint16) error

Doing it this way also means you don't need a separate SetIfNotExists method.

The existing Set method would just always set CAS equal to 0. It's up to users to use CompareAndSet if they want to control the CAS values.

@manishrjain
Copy link
Contributor

@srh makes a great point that given that writes are happening serially in Badger, we can use uint64 counter, which even on a 4 GHz server incrementing at the rate of 1 per nanosecond, would take 146 years to reach uint64 limit.

So, using a global incrementing counter should be a great solution here, which can avoid our read overhead, while giving us all the wins expected from CAS.

@manishrjain
Copy link
Contributor

Closing this issue, as @srh is tracking the changes required in another.

spongedu pushed a commit to spongedu/badger that referenced this issue Jan 14, 2021
We don’t need to wait for the buffer on the storage device to flush on macOS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants