Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC doesn't work? (not cleaning up SST files properly) #1228

Closed
adwinsky opened this issue Feb 25, 2020 · 25 comments
Closed

GC doesn't work? (not cleaning up SST files properly) #1228

adwinsky opened this issue Feb 25, 2020 · 25 comments
Labels
area/gc Garbage Collection issues. kind/bug Something is broken. priority/P1 Serious issue that requires eventual attention (can wait a bit) status/accepted We accept to investigate or work on it.

Comments

@adwinsky
Copy link

What version of Go are you using (go version)?

$ go version
1.13.8

What version of Badger are you using?

v1.6.0

opts := badger.DefaultOptions(fmt.Sprintf(dir + "/" + name))
opts.SyncWrites = false
opts.ValueLogLoadingMode = options.FileIO

Does this issue reproduce with the latest master?

With the latest master GC becomes much slower

What are the hardware specifications of the machine (RAM, OS, Disk)?

2TB NVME drive, 128 GB RAM

What did you do?

I have a Kafka topic with 12 partitions. For every partition I create a database. Each database grows quite quickly (about 12*30GB per hour) and the TTL for most of the events is 1h, so the size should stay at constant level. Now for every partition I create a separate transaction and I process read and write operations sequentially, there is no concurrency, when the transaction is getting to big I commit it, in separate go-routine I start RunValueLogGC(0.5). Most of GC runs end up with ErrNoRewrite. Even tried to repeat RunValueLogGC until I have 5 errors in the row, but still I was running out of disk space quite quickly. My current fix is to patch the Badger GC, make it run on every fid that is before the head. This works fine, but eventually becomes slow when I have too many log files.

What did you expect to see?

The size of each of twelve databases I created, should stay at constant level and has less then 20 GB

What did you see instead?

After running it for a day, if I look at one of twelve databases, I see 210 sst files, 68 vlog files, db size is 84 GB (and these numbers keep growing).

If I run badger histogram it shows me this stats:

Histogram of key sizes (in bytes)
Total count: 4499955
Min value: 13
Max value: 108
Mean: 22.92
Range Count
[ 8, 16) 2
[ 16, 32) 4499939
[ 64, 128) 14

Histogram of value sizes (in bytes)
Total count: 4499955
Min value: 82
Max value: 3603
Mean: 2428.16
Range Count
[ 64, 128) 1
[ 256, 512) 19301
[ 512, 1024) 459
[ 1024, 2048) 569
[ 2048, 4096) 4479625

2428*4479625=10GB

@adwinsky
Copy link
Author

adwinsky commented Feb 25, 2020

I should also add that before I patched GC I was suffering the same issue.

@jarifibrahim
Copy link
Contributor

Hey @adwinsky, the GC definitely works :) The GC doesn't instantly clean up space (which we should definitely improve) but the disk space should eventually decrease. Here's a related issue which shows that GC works #1006

I've replied to your original discuss post https://discuss.dgraph.io/t/how-to-clean-up-values-for-old-versions-of-items/6020/4

@jarifibrahim jarifibrahim added the kind/question Something requiring a response label Feb 26, 2020
@adwinsky
Copy link
Author

adwinsky commented Feb 26, 2020

Hello @jarifibrahim, I opened second ticket because I am pretty sure I see a bug here.

I made a backup of the database for which I added histogram in my previous post. In the meantime most of the entries has expired. I changed the value of RunGC function to RunValueLogGC(0.001). I started a script running GC in a loop (without any modifications of the badger code), initially it cleaned up some segments (as some of the records has expired since the last run), but after 30 minutes of returning:

Value log GC attempt didn't result in any cleanup

My database size is 48 GB,has 34 vlog files 213 sst files and this is a result of histogram:

Histogram of key sizes (in bytes)
Total count: 16811
Min value: 20
Max value: 26
Mean: 22.92
Range Count
[ 16, 32) 16811

Histogram of value sizes (in bytes)
Total count: 16811
Min value: 2093
Max value: 3301
Mean: 2706.77
Range Count
[ 2048, 4096) 16811

Badger needs 50 GB to store 16k records? Clearly there is a problem with GC :(

@adwinsky
Copy link
Author

adwinsky commented Mar 3, 2020

I figured the version I am using (1.6.0) was not including the #1006 fix you liked to. I cherry picked that and now it seems vlog files are cleaned up correctly. I see an issue with sst files though. In my database I have:

15 vlog files, 124 sst files

Histogram shows me following stats:

[Summary]
Level 0 size: 0 B
Level 1 size: 17 MB
Level 2 size: 0 B
Level 3 size: 7.7 GB
Total index size: 7.7 GB
Value log size: 15 GB

Histogram of key sizes (in bytes)
Total count: 5872848
Min value: 13
Max value: 26
Mean: 22.91
Range Count
[ 8, 16) 1
[ 16, 32) 5872847

So the size of the index is 7.7 GB while it has only 5872847*22.91 bytes of data, which is about 130 MB.

Is there any other commit I need to pick up to my 1.6.0 version in order to fix that? I tried using the most recent version but it was way slower for me.

@jarifibrahim
Copy link
Contributor

Is there any other commit I need to pick up to my 1.6.0 version in order to fix that? I tried using the most recent version but it was way slower for me.

What was slow? The GC? It would be very useful if you can file another issue so that I can look into this slowness issue.

So the size of the index is 7.7 GB while it has only 5872847*22.91 bytes of data, which is about 130 MB.

The total index size is the total amount of data stored in the SSTs. You have 7.7 GB of data in the SST which includes

  1. Internal keys (move keys if GC was running)
  2. Older versions of the key (badger will store version along with each key and these older versions are eventually removed by compaction)

The histogram shows only valid keys. At any given time, badger would have some valid keys, some internal keys, and some invalid keys.

@adwinsky
Copy link
Author

adwinsky commented Mar 4, 2020

My application reads events from Kafka streams. If event is new, it writes it into Badger DB with TTL 1h, if event exists it bumps its TTL to 24h. Only about 1% of event has a duplicate, so 99% of events in DB will expire after an hour.

What was slow? The GC? It would be very useful if you can file another issue so that I can look into this slowness issue.

Both. GC is way slower, with 1.6.0 it takes about 4s to run GC over the segments that doesn't need compaction (returning ErrNoRewrite). For 2.0.1 sometimes it even takes 1m to return same error.

In 1.6.0 my service has great performance and is able to process the stream, in 2.0.1 my service is lagging.

Haven't looked much into that, but I can collect some pprof output and create a ticket for that.

The total index size is the total amount of data stored in the SSTs. You have 7.7 GB of data in the SST which includes

In my app SST compaction seems to be very inefficient. To demonstrate it, I plotted some data on a chart:

Screenshot 2020-03-04 at 10 32 18

I started an app and kept it running till 16:10. Then I stopped the service and I run a simple script, that opens the database, and it runs RunValueLogGC and Flatten in a loop, until there is nothing left that can be compacted. You can see that after, stopping an app, after an hour it reached constant size, but SST files were never compacted. Histogram shows this:

[Summary]
Level 0 size: 0 B
Level 1 size: 3.7 MB
Level 2 size: 0 B
Level 3 size: 674 MB
Total index size: 678 MB
Value log size: 6.4 GB

Histogram of key sizes (in bytes)
Total count: 2337252
Min value: 13
Max value: 26
Mean: 22.89
Range Count
[ 8, 16) 1
[ 16, 32) 2337251

Histogram of value sizes (in bytes)
Total count: 2337252
Min value: 80
Max value: 3263
Mean: 2438.90
Range Count
[ 64, 128) 1
[ 512, 1024) 1
[ 1024, 2048) 113
[ 2048, 4096) 2337137

@adwinsky
Copy link
Author

#1257 fixed my performance problems and I was able to switch to v2 badger. Unfortunately the problem with increasing SST files happens also in this version. I am creating events with short TTL. After reaching that TTL time, the number of VLOG files stays constant but SST files keep growing linearly

@adwinsky
Copy link
Author

I created small script that reproduces the problem:

https://gist.github.com/adwinsky/058a9972db502899b9fa097d76d04c70

After running it for 35 mins, it shows me these stats:

Adams-MBP:sandbox3 adwinsky$ go run main.go 2> /tmp/errs  | grep ==
==== sst files: 0 vlog files: 1, 250.288µs
==== sst files: 8 vlog files: 5, 5m0.000977494s
==== sst files: 12 vlog files: 3, 10m0.003889869s
==== sst files: 20 vlog files: 3, 15m0.00808114s
==== sst files: 24 vlog files: 3, 20m0.010168021s
==== sst files: 28 vlog files: 3, 25m0.011417726s
==== sst files: 35 vlog files: 6, 30m0.017283408s
==== sst files: 39 vlog files: 2, 35m0.020704423s

@adwinsky adwinsky changed the title GC doesn't work? GC doesn't work? (not cleaning up SST files properly) Mar 16, 2020
@adwinsky
Copy link
Author

@jarifibrahim I narrowed down the issue and I hotfixed it for myself.

The problem is that during the compaction none of the records are not marked as 'to skip'.

I am using NumVersionsToKeep = 1 and because of that in this line:

numVersions == s.kv.opt.NumVersionsToKeep

lastValidVersion is always true. So instead of skip the record here:

https://github.com/dgraph-io/badger/blob/master/levels.go#L601

It is expected to be skipped here:

https://github.com/dgraph-io/badger/blob/master/levels.go#L550

But this will never happen, since I have only 1 version of items so y.SameKey(it.Key(), skipKey) is always false.

Unfortunately I couldn't figure the way how to properly fix this, I already invested a lot of time into this, maybe I will be able to look into this later. But this issue is quite important as it affects the performance of GC. More SST files is created more time is spend in GC..

@jarifibrahim jarifibrahim added area/gc Garbage Collection issues. kind/bug Something is broken. priority/P1 Serious issue that requires eventual attention (can wait a bit) status/accepted We accept to investigate or work on it. and removed kind/question Something requiring a response labels Mar 18, 2020
@jarifibrahim
Copy link
Contributor

@adwinsky Thank you so much for debugging the issue and nailing down the bug. I'm will try to find some time and look at it.

This definitely looks like something that should be fixed. The SST size should drop after compaction.

@sana-jawad
Copy link
Contributor

@adwinsky can you please share your fix?

I am running into a similar issue and contemplating the right fix. In my scenario we don't use TTL. We have a space limitation condition and whenever the db gets bigger than it we collect all the keys that need deletion in batch using
keyToDel := it.Item().KeyCopy(nil)

and then in a separate transaction call delete on all these items.
err := txn.Delete(key)

And I see similar issue that you have pointed out. I don't see SST table size dropping and VLog file count continues to rise. As the condition of size is not met the deletion continues. Our deletion logic runs after =every 30mins.

@ou05020
Copy link
Contributor

ou05020 commented Mar 20, 2020

Should lastValidVersion be set true if the value has been deleted or expired?

After updating the code to only set lastValidVersion if not deleted, the number of SST files did not grow during testing.

The test performed the following steps:

  1. Add 250K entries with 8 byte keys and 1K values.
  2. Delete previous 250K entries.
  3. Flatten
  4. ValueLog GC w/0.001 Discard Ratio
  5. Scan all entries
  6. Repeat 30 times

Here is the change made to https://github.com/dgraph-io/badger/blob/master/levels.go#L577-L584:
`

deleted := isDeletedOrExpired(vs.Meta, vs.ExpiresAt)

lastValidVersion := false
if !deleted {
	// Keep the current version and discard all the next versions
	// - The `discardEarlierVersions` bit is set OR
	// - We've already processed `NumVersionsToKeep` number of versions
	// (including the current item being processed)
	lastValidVersion = vs.Meta&bitDiscardEarlierVersions > 0 ||
		numVersions == s.kv.opt.NumVersionsToKeep
}
if deleted || lastValidVersion {

`

@jarifibrahim
Copy link
Contributor

jarifibrahim commented Mar 30, 2020

Edit - Do not set number of versions to keep to zero. If you do so, only the deleted keys will be removed and badger will retain all the other keys. This means badger would store unnecessary data. There were multiple bugs that fixed now. See #1228 (comment)

hey @adwinsky, the keys are not being removed because you've set number of versions to keep to 1. If you set it to zero, the keys will be dropped instantly. I changed your script to set number of versions to keep to 0 https://gist.github.com/jarifibrahim/076afb3a9c054f8b72e093a65cc14828#file-main-go and here's what I see

==== sst files: 4 vlog files: 6, 259.849µs
==== sst files: 4 vlog files: 6, 1m19.617627623s
...
==== sst files: 4 vlog files: 6, 14m13.249432996s
==== sst files: 4 vlog files: 6, 15m18.989894738s
==== sst files: 4 vlog files: 6, 16m30.082259208s
...
==== sst files: 4 vlog files: 6, 55m2.905999518s
==== sst files: 4 vlog files: 6, 57m21.64958008s
==== sst files: 4 vlog files: 6, 58m33.623188144s
...
==== sst files: 4 vlog files: 6, 1h13m4.00361087s
==== sst files: 4 vlog files: 6, 1h14m12.102449542s
==== sst files: 4 vlog files: 6, 1h15m18.292868386s

@jarifibrahim
Copy link
Contributor

Should lastValidVersion be set true if the value has been deleted or expired?

@ou05020 For expired keys, we can drop them but for deleted keys, it might be possible that there's an older version in the lower levels and then dropping the delete key would mean your lookups will see the older version stored in the tree.

For instance, Let's say you inserted foo which is stored as foo1 on level 2. Now you deleted foo so it is stored as foo2 in level 0. If we drop foo2 while compacting it to level 1, we'll lose the delete marker. So your LSM tree would have foo1 left. This means a deleted key would up again when you do a lookup.

@jarifibrahim
Copy link
Contributor

jarifibrahim commented Mar 30, 2020

Edit - Do not set number of versions to keep to zero. If you do so, only the deleted keys will be removed and badger will retain all the other keys. This means badger would store unnecessary data. There were multiple bugs that fixed now. See #1228 (comment)

@sana-jawad, the default value of number of versions to keep is one. So badger will keep one version of the key. If you set the number of versions to keep to zero, badger will drop the deleted key when compaction runs.

@adwinsky
Copy link
Author

If you set the number of versions to keep to zero, badger will drop the deleted key when compaction runs.

Thank you very much @jarifibrahim for your reply. Seems that NumVersionsToKeep=0 works indeed, but only if it's set from the beginning. If I run a program with NumVersionsToKeep=1 keep it for a while until I have some duplicates created, and then restart the program with NumVersionsToKeep=0, it will never remove old SST files.

But I must say, to me this behaviour seems counter intuitive. NumVersionsToKeep=1 is the default value, if the entry is expired there won't be any access to it anymore. For users who use TTL, produce a lot of events and expect to have constant size of the database it will be problematic. Maybe it is worth considering to change the default to 0?

@jarifibrahim
Copy link
Contributor

jarifibrahim commented Mar 31, 2020

Edit - Do not set number of versions to keep to zero. If you do so, only the deleted keys will be removed and badger will retain all the other keys. This means badger would store unnecessary data. There were multiple bugs that fixed now. See #1228 (comment)

If I run a program with NumVersionsToKeep=1 keep it for a while until I have some duplicates created, and then restart the program with NumVersionsToKeep=0, it will never remove old SST files.

@adwinsky No. The keys will be dropped when compaction runs. It might take some time for some specific range to be compacted but eventually, all the duplicate keys will be dropped.

For users who use TTL, produce a lot of events and expect to have constant size of the database it will be problematic. Maybe it is worth considering to change the default to 0?

Yes, I also feel the default should be set to 0 but there has to be a reason why we have it at 1. I will try to find out why it is at 1 and then post an update here.

sana-jawad added a commit to salesforce/sloop that referenced this issue Apr 24, 2020
…o plays a big role in memory consumption. (#114)

Upgraded Badger version.
Added flattening at start up time.
Fixed the event count spreading issue which resulted in uneven data distribution across partitions.
Moved to drop prefix as it yields better space claim.
Added feature flag for switching to delete prefix. Also changed the numberfversions to 0 so that delete prefix would reclaim space. dgraph-io/badger#1228
Fixed the issue of unclaimed !badger!move prefixes which are never cleaned up. Details: dgraph-io/badger#1288
Added support in debugging pages to see internal keys.
@luca-moser
Copy link

@jarifibrahim can you point us to the correct setting for letting BadgerDB clean deleted keys correctly? My understanding is that #1300 was closed because setting NumVersionsToKeep to zero results actually in infinite version to be kept around. So what does one have to define in order to let BadgerDB clean deleted keys?

@jarifibrahim
Copy link
Contributor

Hey @luca-moser, we had a bunch of bugs that were preventing SSTs/vlog to be cleaned up. We've merged all the fixes and the disk-space usage should be much better.
The following are the bugs found by the community and fixed by them. My sincere thanks to all of them

  1. Move keys were not being removed from the system
  2. Bug in compaction due to which deleted/expired keys were not being removed - Fixed via fd89894 . Thanks to @damz and @ou05020

@adwinsky I ran your script on master and here's what I see

==== sst files: 2 vlog files: 2, 5m0.000397361s
==== sst files: 3 vlog files: 2, 10m0.000730725s
==== sst files: 4 vlog files: 2, 15m0.001013494s
==== sst files: 2 vlog files: 2, 20m0.00130372s
==== sst files: 4 vlog files: 2, 25m0.001578543s
==== sst files: 5 vlog files: 1, 30m0.00201562s
==== sst files: 2 vlog files: 1, 35m0.002256777s
==== sst files: 4 vlog files: 2, 40m0.002542917s
==== sst files: 1 vlog files: 2, 45m0.0029382s
==== sst files: 2 vlog files: 2, 50m0.003103286s

It looks like SST cleanup and GC are working correctly. Please do try out the master version of Badger and let me know how it goes.

@jarifibrahim
Copy link
Contributor

This issue has been fixed and I'm going to close it. Please feel free to re-open,

@crphang
Copy link

crphang commented Jun 27, 2020

@jarifibrahim really great work on the fixes, are there any plans for another release soon? I am helping maintain TalariaDB and have checked that latest master resolved many of the issues our users have been facing with GC of value log. Great work on that 🎉.

@jarifibrahim
Copy link
Contributor

@crphang Actually, the community fixed it. The main PRs that fixed the GC issues were from community users 🎉

@crphang
Copy link

crphang commented Jun 27, 2020

@jarifibrahim, sorry to ask again 😅. Are there plans to create a new release with the fixes? I think it'll help a lot for other Badger users 😄

@jarifibrahim
Copy link
Contributor

The release is blocked by #1350 . I haven't been able to reproduce the failure easily so far.

@jarifibrahim
Copy link
Contributor

@crphang FYI, we found 3 more crashes on master and release is blocked by them.
#1387
#1388
#1389

#1350 has been identified and will be fixed once dgraph-io/ristretto#168 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gc Garbage Collection issues. kind/bug Something is broken. priority/P1 Serious issue that requires eventual attention (can wait a bit) status/accepted We accept to investigate or work on it.
Development

No branches or pull requests

6 participants