-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Implement a PersistentDatastore by adding DiskUsage method #27
Conversation
That's a bummer for the gateways -- can't have a full repo scan on every restart there. Can we make the bookkeeping persistent so that it survives restart? |
@lgierth is it so bad? In my gateway tests a while ago it did take a few seconds, but it's once during boot and while ipfs is not hammering the disk at the same time. That said, I can leave note of the size on a file, sure, and only re-scan when not present. |
@lgierth there you go |
There are two downsides to caching disk usage:
An alternative is to calculate the initial size lazily. The first call to DiskUsage will be slow but it won't halt the world. @lgierth does this work for you? (I don't feel that strongly about this issue, just thought it should be considered). |
Without looking at So it also has some downsides... (and the fact that it may return really low sizes at the beginning which may be to whoever is looking at that metric). I lean towards caching even though it's not perfect... |
Hm. You're right. We'd have to track which parts we've walked and which parts we haven't. That would be horrible. Yeah, this sounds like the best solution. Note: If we want a more accurate solution, we could cache the size of the files and recompute the sizes of the directories but it's probably not worth it. |
agreed. Will you be so kind to greenlight the code or leave feedback @Stebalien ? |
flatfs.go
Outdated
@@ -33,6 +37,9 @@ type Datastore struct { | |||
|
|||
// sychronize all writes and directory changes for added safety | |||
sync bool | |||
|
|||
diskUsage uint64 | |||
diskUsageCh chan int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use atomic.AddInt64
(and store it as an int64)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, indeed..
flatfs.go
Outdated
} | ||
} | ||
|
||
func fileSize(path string) int64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could actually be slightly more accurate by adding in the size of the filename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you think? in my tests du
on the disk matched the values we produce here. aren't filenames in ext* part of the directory entry, and the directory entry counts as fulll 4096 bytes block even if empty. So im not sure if adding filenames is double-counting some bytes..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to account for the case where we add a bunch of files without re-stating the directories (in which case, they'll be much larger than 4096). However, that case may not really be worth considering. If it comes to it, we can always consider recomputing once in a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, leaving this as it is. To my usecase, it doesn't really matter. We can improve it when we need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
flatfs.go
Outdated
} | ||
|
||
duB := []byte(fmt.Sprintf("%d", du)) | ||
ioutil.WriteFile(filepath.Join(fs.path, diskUsageFile), duB, 0644) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should really use a tempfile with a rename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can do, can you explain the logic behind it though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, my /tmp folder lives in a partition different from my ipfs storage. In the case when you have a system with an ssd and a spinning disk, it is not unusual that /tmp /var etc are mounted on the spinning disk (to reduce the number of writes to the ssd) and the rest of things on the ssd, so imagine the performance hit of current implementation with tmp files and renames..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no. I mean an ioutil.TempFile
as we do in doPut
. That is, create a temporary file, write to it, close it, and then atomically rename it into place (so we don't end up with a half-written or corrupted file if we crash while recording the size). Probably not much of an issue in this case but still good practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. i just realized that tmpfiles are created in the same folder. yeah makes sense. Thanks!
So, reasoning through this carefully, there are a few annoying race conditions:
For the delete case, we should:
For the write case... I think we'll have to use There are a few os-specific ways to make this faster but cross-platform low-level filesystem stuff always kind of sucks. |
@Stebalien you're right. I have followed your indications. I'm not super happy though:
My feelings overall:
Thus, if we can live with the non-overwrite limitation i'd leave things like this, otherwise, I'd return to the buggy behaviour before (fixing the delete part, which is easy). What do you think? |
Overall, I agree. Performance is already poor enough as it is.
My worry is that simultaneous writes/deletes are probably quite common and, if they're not perfectly balanced, the estimated disk usage will skew over time. We might, actually, just want to use a synchronization mechanism. I wouldn't have a global lock, but we can track in-progress puts/deletes on a per-file basis. That is, I'd have a map (either a That is, given a timeline like:
If op1 succeeds, we can serialize these operations as This should actually be faster than the existing implementation given concurrent operations and should be just as "correct". Does this sound too complicated? Any better ideas? |
This implements the approach suggested by @Stebalien in #27 Write operations (delete/put) to the same key are tracked in a map which provides a shared lock. Concurrent operations to that key will share that lock. If one operation succeeds, it will remove the lock from the map and the others using it will automatically succeed. If one operation fails, it will let the others waiting for the lock try. New operations to that key will request a new lock. A new test for putMany (batching) has been added. Worth noting: a concurrent Put+Delete on a non-existing key always yields Put as the winner (delete will fail if it comes first, or will skipped if it comes second).
@Stebalien ok, I have implemented the approach you suggested. Works well for my tests (added an extra one too for putMany()). Hopefully we can settle here :) The tricky code should be sufficiently commented, but let me know if something's not clear. Note, Mac tests fail because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in addition to my comments, there's a bit of a problem concerning deletes. If we get a concurrent delete and put, the delete will succeed even if there was nothing to delete. If we have multiple deletes and a put, all the deletes will succeed (even though at most one should).
However, as far as I can tell, badger-db doesn't return errors on deletes of non-existent keys (idempotent deletes). Personally, I prefer those.
So, we need to decide: do we have idempotent deletes or do we return an error if there is nothing to delete.
flatfs.go
Outdated
// on the same key. We use opLock | ||
// for that. | ||
opLock.Lock() | ||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just call defer opLock.Unlock()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ops, left over from some debugging
|
||
// Destination exists, we need to discount it from diskUsage | ||
if fs != nil && err == nil { | ||
atomic.AddInt64(&fs.diskUsage, -fi.Size()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to happen after the rename (could fail).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will still need to be addressed (the remove could fail).
flatfs.go
Outdated
return fs.renameAndUpdateDiskUsage(oper.tmp, oper.path) | ||
default: | ||
panic("bad operation, this is a bug") | ||
return errors.New("bad operation, this is a bug") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need this (go understands that panic aborts).
flatfs.go
Outdated
// operations if one of them does. In such case, | ||
// we assume that the first suceeding operation | ||
// on that key was the last one to happen after | ||
// two successful others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"after all successful others"
flatfs.go
Outdated
return err | ||
} | ||
|
||
//time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented out code.
flatfs.go
Outdated
// failed, even if they did not compete with another | ||
// one. If datastore operations fail all the time | ||
// opMap will grow as it retains all keys of failed operations | ||
// and only deletes them on success. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. This is a bit unfortunate. Especially with respect to deletes. We can easily have delete(x), sleep(), delete(x)
.
The other way to do this is to use one-off locks. That is,
type OpMap struct {
ops map[string]*OpResult
mu sync.RWMutex
}
type OpResult struct {
mu sync.RWMutex
success bool
opMap *opMap
name string
}
// Returns nil if there's nothing to do.
func (m *OpMap) Begin(name string) *OpResult {
for {
m.mu.Lock()
op, ok := m.ops[name]
if !ok {
op := &OpResult{opMap: m, name: name}
op.mu.Lock()
m.ops[name] = op
m.mu.Unlock()
return op
}
m.mu.Unlock()
op.mu.RLock()
if op.success {
return nil
}
}
}
func (o *OpResult) Finish(ok bool) {
o.success = ok
delete(o.opMap.ops, o.name)
o.mu.Unlock()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let me try this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stebalien the delete
in Finish
causes panics with the map (fixed by locking/unlocking around it). I have implemented the whole thing with a sync.Map
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running something that might be very expensive (after a ipfs crash, power loss or out of memory) during bootstrap sequence of go-ipfs doesn't sit with me right.
In past I've worked on servers where scan like that would almost bring the whole system down (few TB of data on the same almost dying HDD as the system, they were able to pull 10 IOPS on lucky day). go-ipfs would be unusable on those servers as it would take very long time to start up while looking like it has hanged.
In my opinion it is too high of a price to pay for a feature that might not be used. Having it does lazily, on request and reporting an error when the scan has not yet finished would be far better.
flatfs.go
Outdated
) | ||
|
||
const ( | ||
putOp = iota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO use named type of opT
.
flatfs.go
Outdated
|
||
// opMap handles concurrent write operations (put/delete) | ||
// to the same key | ||
opMap map[string]*sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to use concurrent map. It should be essentially free when we are doing concurrent operations on different keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 although I think the OS will be the bottleneck here.
We've discussed this but it's prohibitively complicated as we need a snapshot size. One solution would be to disable writes until the scan is complete but that complicates matters and isn't really a solution. |
We could write the DiskUsage-file regularly (say when disk usage difference from last checkpoint >1%) so that we can read that on start. It may not be 100% accurate in the case of a crash, but it might be close enough for our needs. |
@hsanjuan SGWM, it also seems to me that triggering full re-scan on Any idea how to nicely wire it in? |
@Kubuxu that will need interface support in go-datastore. Should we have an |
I am thinking if PersistentDatastore is only one needing this and it seems so. I also recall Bager needing hook after GC was performed so it could perform its own internal GC. I don't know if it was solved (@magik6k). |
The GC interface is wired into go-ipfs master already (merged in ipfs/kubo#4578) I created 2 more interfaces in go-datastore when making the GC one - https://github.com/ipfs/go-datastore/blob/master/datastore.go#L88-L103.
|
Could we get some benchmarks on how this will affect the overall performance of the datastore? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, we still need to address the issue of idempotent deletes. Currently, this will sometimes return an error when deleting something that isn't there and sometimes won't. Personally, I'd rather just always return a success if either the delete succeeded or there wasn't something there.
flatfs.go
Outdated
} | ||
|
||
type opMap struct { | ||
ops *sync.Map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I wouldn't use a pointer here (no point).
} | ||
} | ||
|
||
func (fs *Datastore) checkpointDiskUsage(newDuInt int64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a serious drawback to this. We could write a bunch of data but crash before checkpointing. The old way was much safer because we didn't leave the file on disk while it could be invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would need to crash precisely in the write operation which would do a checkpoint while having written a really large entry (with respects to the current size) for the checkpoint to be really skewed from actual size. I think it's not so much of a big deal as soon as the size of the datastore is sufficiently big to the size of the entries written to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
|
||
// Destination exists, we need to discount it from diskUsage | ||
if fs != nil && err == nil { | ||
atomic.AddInt64(&fs.diskUsage, -fi.Size()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will still need to be addressed (the remove could fail).
From my tests, |
I can't comment directly there. There is no |
Ideally this only happens once, if the checkpoint file is not there (which should be in subsequent starts). It might be that stat-walking large repos is too expensive even to do it once. We could be smart and list entries in folders and multiply by 256K. This would give an approximate ballpark in a hopefully way faster way. I'm happy sacrificing accuracy as in a large-file context most blocks will probably be full. |
Progress and tests so far:
Are we happy with this? |
Still need to fix #30 but we can do that in a subsequent PR. |
Okay there seams to be a lot going on here. Please give me a day or two too look over this. Note that I don't think identity hashes will cause any problems with sampling because if we implement them as I envisioned the data stores won't have to worry about them because they won't even be stored in the datastore (as there is no need). |
Thanks @kevina , will wait |
If we are going to estimate the size I think instead of taking a fixed sample size we should iterate until we get an acceptable level of accuracy. The accuracy can be estimated using the formula for the standard error for finite population (the finite population correction factor is the important part, otherwise the error won't shrink as we take larger samples). We can continue to take samples until this number is within an acceptable margin (I would say 1%). Since we are not even reading all the directories we have two estimate numbers we are working with (1) The estimate number of files and (2) the estimate mean file size. The standard error needs to be computed independently for each and then property combined to produce a final estimate. The population size for (1) is the number of directories. The population size for (2) is the number of files. Since the population size for (2) is only an estimate, this might need to be taken into account when calculating the correction factor. I would prefer that DiskUsage() is exact. If it is not, I think it is important we have a high confidence on the accuracy of the estimate, we can't get this confidence by simply taking a fixed sample size and calling it good. |
Note, that I am not a stat's expert. If anyone is then please fell free to correct me if I got something wrong as the advice above is based on only a few hours of research. |
I should also note that as far as I know standard error assumes a normal distribution which may not be the case in terms of block sizes, in particular there are likely to be a lot of very small and an possible even a larger number of 256K blocks. I still think we should iterate, I am just not 100% sure when to know to stop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should limit sampling to huge datastores where computing the size will take hours. For most datastores (of maybe say 100 GB of less) this calculation should only take a few minutes and we should provide an exact number.
If we do sample we should consider iterating until we get an acceptable accuracy if that is at all possible, I left comments outside of this review.
flatfs.go
Outdated
// folders are also considered files with respect to | ||
// DiskUsageFilesAverage (therefore this value only makes sense | ||
// if it's lower). | ||
DiskUsageFoldersAverage = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a good threshold, even modest datastores will have a large number of directories. We should only consider skipping folders after there are a certain number of files in a directory and reading them truly becomes a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevina the directories contain random sample of the whole datastore (due to how prefix is selected) so it isn't necessary to sample many of them. In theory if the datastore is big enough we could sample just random files from one directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kubuxu that is true, but if the datastore is small, this will still be a approximation and not an exact value.
I can already tell you that, based on my tests, getting within 1% of error and being sure about it during sampling is going to take too long. What I can do instead is to set a time limit rather than a folder limit and read all random samples possible (I'll play with larger number of files per folder) until the acceptable time limit is hit (say 5 minutes). This should give accuracy with small datastores (or those equipped with fast disks) and do our best on large/slow ones. Also, simplicity. Would this make you happy enough @kevina ? |
@Kubuxu @hsanjuan, I have a problem with Perhaps a better way forward is if it is not possible to exactly calculate the DiskUsage() in a reasonable time, we should leave leave it undefined and provide a tool to calculate the DiskUsage or to optionally make an estimate and update the value based on a command line flag. This way the user is at least aware that it is an approximation. They can also schedule to run the command at a convenient time. In order to estimate if we can calculate the DiskUsage() in a reasonable time suggestion we sample a few random directories and count the number of files, if the count is above a threshold we give up and somehow mark the datastore as having an undefined DiskUsage(). We can also use a timeout. Sorry, but I just can't in good consonance approve this P.R. as written. However, if people can show me other examples where DiskUsage calculation for a Database is an approximation, I would likely be more okay with it. We would still need to update the documentation to make it clear that this is not an exact number. @whyrusleeping (and others) thoughts |
However, if people can show me other examples where DiskUsage calculation for a Database is an approximation, I would likely be more okay with it. We would still need to update the documentation to make it clear that this is not an exact number. |
As far as I am concerned the badger implementation of Size doesn't return the exact size on disk (at least not for every file-system) as doing so with the many specifics of it would be quite hard. The problem the DiskUsage method is trying to solve is to provide a fast way to get the size of data stored within ipfs datastore. One use case is described in ipfs/kubo#4550, the other is to display repository size in the desktop app where this was a problem too. For both of those cases being within .1% (or even 1%) should be perfectly acceptable imo. Having this method take more than few seconds is not as it makes it unusable for those use cases. Yes, we should document that the result may be an approximation, but making return 'undefined' when it would take too long doesn't solve ether of the use cases above and just adds a feature which can't be used in most realistic scenarios (most datastores where you'd care about size will be too big to calculate exact size in a reasonable time) |
(1) Based on the latest estimates the accuracy will likely be more like 5%, if it was 1% (and with reasonable confidence we know its 1% I would object less) (2) If I understand things correctly this method will only take long when the method has not been called before. On new datastores (after this lands) and after this method is called the first time, the DiskUsage will be kept in sync. Thus my proposal is if it takes too long in it initial calculation, don't do it and leave it up to the user to update manually, after which DiskUsage will remain usable. |
I think:
Please let's remember how badly broken this is right now and take this as a step to improving this. The vast majority of daemons with small storages attached won't be affected by the accuracy problems. Any new ipfs daemons won't be affected. Only nodes with huge existing storages will be affected. In these nodes it's impossible to run |
Then it that case I think a better solution is to return an undefined value for the DiskUsage possible informing the user on how to rectify it for large repos rather than return an estimate that can be 5-10% off without informing the user of the inaccuracy. |
I will inform the user |
I think it would be better to to keep track of if the disk usage is an estimate or an accurate count and inform the user of which. I am okay if we delay this to another P.R. as long as it gets done sometime soonish. Also please note, that my option is just that. I will not block this P.R. if it is impeding important progress and will not object if someone dismisses my review. |
This provides a disk estimation deadline. We will stat() as many files as possible until we run out of time. If that happens, the rest will be calculated as an average. The user is informed of the slow operation and, if we ran out of time, about how to obtain better accuracy.
I have implemented time deadline in the last commit (set a default of 5 minutes). It will look like this when starting the datastore and no cache file exists:
The actual reference size in biham is: 7536515831332 (so it's pretty close). |
It is possible to miss this warning. This information should at least be stored in the datastore somehow. Providing an interface to get to determine if the value stored is an approximation or an exact value can be a separate p.r. |
Comments noted. Follow up in ipfs/go-datastore#85
I'm sorry, but what is the standard policy in regard to an author merging there own pull request? Especially when it is non-trivial. Also see #31. |
This adds DiskUsage().
This datastore would have a big performance hit if we walked the
filesystem to calculate disk usage everytime.
Therefore I have opted to keep tabs of current disk usage by
walking the filesystem once during "Open" and then adding/subtracting
file sizes on Put/Delete operations.
On the plus:
On the minus:
(at least on ext4 systems). We don't track directory-size changes,
only use the creation size.