CompareAndSet overrides the value when passed counter is 0 #113

hnakamur · 2017-07-22T09:05:27Z

First of all, thanks for creating and sharing a great software!

I noticed CompareAndSet overrides the value when passed counter is 0.
Is this the expected behavior?
If so, I think it should be documented.
And then whether the casCounter ever becomes zero or not by calling Set.

I created a test code to reproduce this problem.
https://github.com/hnakamur/badger/commit/9b4439446e999cada58ae17869e3ffdc056b5a3e

I got the following test result.

$ go test -v -run=TestCASNoBatch
=== RUN   TestCASNoBatch
=== RUN   TestCASNoBatch/same_counter_updates_the_value
=== RUN   TestCASNoBatch/different_counter_does_not_override_the_value
=== RUN   TestCASNoBatch/zero_counter_should_not_override_the_value_but_it_does
--- FAIL: TestCASNoBatch (0.92s)
    --- PASS: TestCASNoBatch/same_counter_updates_the_value (0.00s)
    --- PASS: TestCASNoBatch/different_counter_does_not_override_the_value (0.00s)
    --- FAIL: TestCASNoBatch/zero_counter_should_not_override_the_value_but_it_does (0.00s)
        kv_test.go:185: error expected but got no error
FAIL
exit status 1
FAIL    github.com/dgraph-io/badger     0.927s
$ go version
go version go1.9beta2 linux/amd64

The text was updated successfully, but these errors were encountered:

hnakamur · 2017-07-22T09:31:45Z

I looked at code and understand
Set and CompareAndSet with casCounter=0 passes the same Entry value to BatchSet.
So it seems an expected behavior.

And looking at newCASCounter, I understand cas counters never be 0.

As a side point, I noticed the difference in error handlings in Set and CompareAndSet.
Could you explain the reason for this difference?

manishrjain · 2017-07-22T18:23:18Z

CAS can fail if the counter doesn't match. Set would only fail if there's some internal issue, like disk write failure etc. Does that clarify things?

cardbot · 2017-07-22T23:42:02Z

The current CAS implementation uses a random number between 0 and 65535. But what happens if there are multiple writers racing using the same CAS, and the random number generator picks the same number (1 in 65535 chance of that)? This would allow multiple writers to set their value, even though only one writer should win.

hnakamur · 2017-07-22T23:42:55Z

Your comment clarify most of things except the difference in error handling in Set and CompareAndSet.

I looked at the code and learned the reason for the difference by rewriting the code and run the test.

I looked at the code of BatchSet

badger/kv.go

Lines 739 to 747 in 2fb2716

    
           var err error 
        
           for _, req := range reqs { 
        
           	req.Wg.Wait() 
        
           	if req.Err != nil { 
        
           		err = req.Err 
        
           	} 
        
           	requestPool.Put(req) 
        
           } 
        
           return err

and learned the error returned by BatchSet is the last error set on reqs.

I tried to rewrite

	if err := s.BatchSet([]*Entry{e}); err != nil {
		return err
	}
	return e.Error

in BatchSet to

	return s.BatchSet([]*Entry{e})

like code in Set.

Then I run the test TestCAS and I got the test error.

$ go test -v -run TestCAS
=== RUN   TestCAS
--- FAIL: TestCAS (1.02s)
        Error Trace:    kv_test.go:159
        Error:          An error is expected but got nil.
FAIL
exit status 1

And I learned the CasMismatch is set to Entry.Error and it is not returned by BatchSet.
Now I'm clear about the difference.

hnakamur · 2017-07-22T23:49:22Z

@cardbot Just a nitpick, actually it is a random number between 1 and 65535.

badger/util.go

Lines 151 to 153 in 5ae0851

    
           func newCASCounter() uint16 { 
        
           	return uint16(1 + mod65535(rand.Uint32())) 
        
           }

func newCASCounter() uint16 {
	return uint16(1 + mod65535(rand.Uint32()))
}

I am also worried about an accidental casCounter matches.
It's possibility is 0.0015%.

>>> 1.0 / 65535 * 100
0.0015259021896696422

Is this low enough?
Maybe casCounter type should be changed to uint32?
Then the possibility becomes 2.32e-08 %.

>>> 1.0 / (2**32-1) * 100
2.3283064370807974e-08

srh · 2017-07-23T21:55:47Z

Using [1, 2**32-1] is also no good for the same reason. Using [1, 2**64-1] would be fine. Instead of random generation you could also use a single global 64-bit counter, though the choice around that is questionable.

Another option is to get rid of CompareAndSwap.

It looks like we store the CASCounter in the value log anyway. So if users want CAS they can do it at the application layer without performance loss, if Badger exposes something like MaybeSet(k []key, f func(oldVal []byte) (newVal []byte, ok bool) in its API (and that user-supplied func had better be relatively simple).

hnakamur · 2017-07-23T22:59:17Z

Passing the old value does not cause writing the old value to log, which is good, however the application must pass the old value to badger, so if the old value size is large (say, 1GB) it causes performance loss, right?

Off the topic, what is the recommended max size for keys and values which does not cause serious performance loss?

How about using time.UnixNano() for the casCounter?
The limitation is that you cannot update multiple times in one nanosecond, however it doesn't matter since it will take more than one nanosecond for updating one value anyway.

srh · 2017-07-24T00:59:44Z

If you use time.UnixNano() to generate the casCounter, and if you have two threads writing the key once per millisecond, it will only take 1000 seconds for them to generate the same CAS value, with big potential for bugs because they happen at the same time.

If we could guarantee that two threads aren't generating a casCounter at the same time, you could instead just increment a counter. For all I know, that might be possible, it might be a super-easy fix.

hnakamur · 2017-07-24T01:11:43Z

This is just an idea, how about using atomic.AddUint64 for casCounter?
If we have a lot of keys that do not fit in memory, we can save the casCounter, load it back and create the counter on memory with atomic.StoreUint64.
I'm not sure the last part can be done safely when multiple goroutines load the casCounter for the same key, though.

manishrjain · 2017-07-24T06:50:21Z

The reason we didn't just go with an increasing counter per key, is because we want to avoid doing a read for every write. Note that for CAS to work, every Set must also have stored a counter. Hence, every write would then involve at least one read to find the last counter, so we can store a new one. This obviously gets expensive.

Using 64 bits for CAS is very expensive as well. Currently, we store the cas counter in the LSM tree; which we suggest folks to store in RAM for great performance. Instead of storing only 2 bytes, if we now store 8 bytes per key, that's definitely going to decrease the number of keys we can store per .sst file.

We could consider doing atomic.AddUint16, but that could still cause issues -- two writes done exactly at the right counter value apart, could still cause an unexpected overwrite. This approach might be slightly better than what we're doing, arguably.

I don't see much issue with the current approach -- we'd need to have multiple CAS operations for the same key; and the random generator must in this data race, produce the same number; to potentially cause an issue. Note that this random generator is being shared across all the keys, so just spewing out the same number twice won't really cut it.

srh · 2017-07-24T07:11:40Z

The current approach makes it a completely useless feature.

An incrementing counter can be done on a per-store basis, so there's no read cost.

manishrjain · 2017-07-24T07:20:20Z

The current approach makes it a completely useless feature.

That's an overstatement.

An incrementing counter can be done on a per-store basis, so there's no read cost.

Please explain. Do you mean atomic.AddUint16?

cardbot · 2017-07-24T15:08:23Z

I agree with @srh. Imagine if atomic.CompareAndSwapInt64() did not work correctly every 1 in 65535 times? It'd be a major bug that the golang developers would patch immediately. To address another part of your statement, races can happen in other cases as well, not just when the rng produces the same number twice in a row:

Caller Compaction logic #1 invokes Get and fetches record with CAS counter = 1001.
Caller Various small tweaks #2 invokes Get and fetches record with CAS counter = 1001.
Caller Various small tweaks #2 invokes Set and updates record, which gets assigned counter = 12345.
Caller Implement bloom filters #3 invokes Get and fetches record, now with CAS counter = 12345.
Caller Implement bloom filters #3 invokes Set and updates record, which gets assigned counter = 1001 (because rng produced 12345 followed by 1001).
Caller Compaction logic #1 invokes Set and updates it, because the counters erroneously match.

This is just one scenario; many others are possible. In scenarios with a lot of concurrent access, this will fail from time to time and result in data corruption. I believe you need to fix.

In the current code, you're already "getting" the value in order to check CAS:

		if entry.CASCounterCheck != 0 {
			oldValue, err := s.get(entry.Key)
			if err != nil {
				return errors.Wrap(err, "writeToLSM")
			}
			// No need to decode existing value. Just need old CAS counter.
			if oldValue.CASCounter != entry.CASCounterCheck {
				entry.Error = CasMismatch
				continue
			}
		}

Why can't you move that code up into writeRequest and merge it with this:

	for _, req := range reqs {
		for _, e := range req.Entries {
			e.casCounter = newCASCounter()
		}
	}

That way, you could use an incrementing counter, which would be better (though still not perfect since it can roll over).

In the case where CASCounterCheck = 0, I'm not sure that you have to worry about setting a new value of CASCounter. The caller doesn't care about CAS in that case, so I don't see a need to fetch the old value in order to set the new. I'm struggling to think of a scenario where a caller would want to mix CAS calls with non-CAS calls on the same value.

manishrjain · 2017-07-24T21:07:59Z

I think if we want to fix it with no chance of failure, then the only solution I can think of to do this right, would involve doing a read to get the cas counter for every write. Even when you're doing a Set, without CAS, that Set needs to have a counter update. Otherwise, if one uses Set and CAS on the same key, then the counter has to represent the latest value, not some older value.

I'm struggling to think of a scenario where a caller would want to mix CAS calls with non-CAS calls on the same value.

We can't assume that one would never mix these calls. Even if we avoid this mixing, to know that this particular key should not have a CAS, we'll still have to do a get.

I'm still not sure if @srh had a better solution here. But, I think if LSM tree is already in memory, then doing a read before every write might be ok. If not, then this would severely reduce our write throughput.

srh · 2017-07-24T21:27:53Z

That's an overstatement.

Maybe I could imagine a scenario where some sort of lossy CAS operation was acceptable, where corrupting data once every 65535 CAS races was OK and avoiding the corruption was also generally useful. If such a use case exists, it's a very slim portion of them, and even then such users would probably much prefer 32-bit or 64-bit random CAS values, or 32-bit incrementing values. It would be difficult to write a test suite for such a program.

Please explain. Do you mean atomic.AddUint16?

I meant a 32- or 64-bit global incrementing counter (or several partitioned ones). You don't have to read to generate CAS values. A 64-bit rng would have the same effect, but a 64-bit incrementing counter could have ancillary uses (like efficient snapshot retrieval for Dgraph).

cardbot · 2017-07-24T23:06:25Z

Maybe the best way to resolve this is by enabling your callers to CAS themselves:

// CompareAndSet conditionally writes a key/value record only if its current CAS
// value equals oldCas. If the record does not yet exist, then it will be written only
// if oldCas = 0.
func (s *KV) CompareAndSet(key []byte, val []byte, oldCas, newCas uint16) error

Doing it this way also means you don't need a separate SetIfNotExists method.

The existing Set method would just always set CAS equal to 0. It's up to users to use CompareAndSet if they want to control the CAS values.

manishrjain · 2017-07-25T03:48:50Z

@srh makes a great point that given that writes are happening serially in Badger, we can use uint64 counter, which even on a 4 GHz server incrementing at the rate of 1 per nanosecond, would take 146 years to reach uint64 limit.

So, using a global incrementing counter should be a great solution here, which can avoid our read overhead, while giving us all the wins expected from CAS.

manishrjain · 2017-07-26T04:15:52Z

Closing this issue, as @srh is tracking the changes required in another.

We don’t need to wait for the buffer on the storage device to flush on macOS.

srh mentioned this issue Jul 25, 2017

Make CompareAndSet values use a global 64-bit incrementing counter. #126

Closed

manishrjain closed this as completed Jul 26, 2017

spongedu pushed a commit to spongedu/badger that referenced this issue Jan 14, 2021

use syscall.Fsync for darwin platform. (dgraph-io#113)

1aecb98

We don’t need to wait for the buffer on the storage device to flush on macOS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CompareAndSet overrides the value when passed counter is 0 #113

CompareAndSet overrides the value when passed counter is 0 #113

hnakamur commented Jul 22, 2017

hnakamur commented Jul 22, 2017

manishrjain commented Jul 22, 2017

cardbot commented Jul 22, 2017

hnakamur commented Jul 22, 2017

hnakamur commented Jul 22, 2017

srh commented Jul 23, 2017 •

edited

Loading

hnakamur commented Jul 23, 2017

srh commented Jul 24, 2017

hnakamur commented Jul 24, 2017

manishrjain commented Jul 24, 2017

srh commented Jul 24, 2017

manishrjain commented Jul 24, 2017 •

edited

Loading

cardbot commented Jul 24, 2017 •

edited

Loading

manishrjain commented Jul 24, 2017 •

edited

Loading

srh commented Jul 24, 2017

cardbot commented Jul 24, 2017 •

edited

Loading

manishrjain commented Jul 25, 2017

manishrjain commented Jul 26, 2017

CompareAndSet overrides the value when passed counter is 0 #113

CompareAndSet overrides the value when passed counter is 0 #113

Comments

hnakamur commented Jul 22, 2017

hnakamur commented Jul 22, 2017

manishrjain commented Jul 22, 2017

cardbot commented Jul 22, 2017

hnakamur commented Jul 22, 2017

hnakamur commented Jul 22, 2017

srh commented Jul 23, 2017 • edited Loading

hnakamur commented Jul 23, 2017

srh commented Jul 24, 2017

hnakamur commented Jul 24, 2017

manishrjain commented Jul 24, 2017

srh commented Jul 24, 2017

manishrjain commented Jul 24, 2017 • edited Loading

cardbot commented Jul 24, 2017 • edited Loading

manishrjain commented Jul 24, 2017 • edited Loading

srh commented Jul 24, 2017

cardbot commented Jul 24, 2017 • edited Loading

manishrjain commented Jul 25, 2017

manishrjain commented Jul 26, 2017

srh commented Jul 23, 2017 •

edited

Loading

manishrjain commented Jul 24, 2017 •

edited

Loading

cardbot commented Jul 24, 2017 •

edited

Loading

manishrjain commented Jul 24, 2017 •

edited

Loading

cardbot commented Jul 24, 2017 •

edited

Loading