Feat: Implement a PersistentDatastore by adding DiskUsage method #27

hsanjuan · 2018-02-16T15:12:03Z

This adds DiskUsage().

This datastore would have a big performance hit if we walked the
filesystem to calculate disk usage everytime.

Therefore I have opted to keep tabs of current disk usage by
walking the filesystem once during "Open" and then adding/subtracting
file sizes on Put/Delete operations.

On the plus:

Small perf impact
Always up to date values
No chance that race conditions will leave DiskUsage with wrong values

On the minus:

Slower Open() - it run Stat() on all files in the datastore
Size does not match real size if a directory grows large
(at least on ext4 systems). We don't track directory-size changes,
only use the creation size.

ghost · 2018-02-16T15:16:33Z

Slower Open() - it run Stat() on all files in the datastore

That's a bummer for the gateways -- can't have a full repo scan on every restart there.

Can we make the bookkeeping persistent so that it survives restart?

hsanjuan · 2018-02-16T15:34:07Z

@lgierth is it so bad? In my gateway tests a while ago it did take a few seconds, but it's once during boot and while ipfs is not hammering the disk at the same time.

That said, I can leave note of the size on a file, sure, and only re-scan when not present.

hsanjuan · 2018-02-16T16:02:34Z

@lgierth there you go

Stebalien · 2018-02-16T16:09:37Z

There are two downsides to caching disk usage:

Downgrade, do something, upgrade (as we haven't bumped the repo version).
It'll be even more out-of-date directory sizes (meh?).

An alternative is to calculate the initial size lazily. The first call to DiskUsage will be slow but it won't halt the world. @lgierth does this work for you?

(I don't feel that strongly about this issue, just thought it should be considered).

hsanjuan · 2018-02-16T18:55:51Z

Without looking at Walk internals, walking lazily probably suffers from race conditions where a file may be counted twice (one through Put as soon as we Open the datastore and the other one as the lazy walk finally gets to its folder). Also, the other way around with in the case of delete.

So it also has some downsides... (and the fact that it may return really low sizes at the beginning which may be to whoever is looking at that metric).

I lean towards caching even though it's not perfect...

Stebalien · 2018-02-16T19:41:15Z

walking lazily probably suffers from race conditions where a file may be counted twice

Hm. You're right. We'd have to track which parts we've walked and which parts we haven't. That would be horrible.

Yeah, this sounds like the best solution. Note: If we want a more accurate solution, we could cache the size of the files and recompute the sizes of the directories but it's probably not worth it.

hsanjuan · 2018-02-16T19:49:05Z

probably not worth it.

agreed. Will you be so kind to greenlight the code or leave feedback @Stebalien ?

Stebalien · 2018-02-16T19:44:13Z

flatfs.go

@@ -33,6 +37,9 @@ type Datastore struct {

 	// sychronize all writes and directory changes for added safety
 	sync bool
+
+	diskUsage   uint64
+	diskUsageCh chan int64


Why not just use atomic.AddInt64 (and store it as an int64)?

yes, indeed..

Stebalien · 2018-02-16T19:47:36Z

flatfs.go

+	}
+}
+
+func fileSize(path string) int64 {


We could actually be slightly more accurate by adding in the size of the filename.

you think? in my tests du on the disk matched the values we produce here. aren't filenames in ext* part of the directory entry, and the directory entry counts as fulll 4096 bytes block even if empty. So im not sure if adding filenames is double-counting some bytes..

I was trying to account for the case where we add a bunch of files without re-stating the directories (in which case, they'll be much larger than 4096). However, that case may not really be worth considering. If it comes to it, we can always consider recomputing once in a while.

ok, leaving this as it is. To my usecase, it doesn't really matter. We can improve it when we need to.

Fair enough.

Stebalien · 2018-02-16T19:48:16Z

flatfs.go

+	}
+
+	duB := []byte(fmt.Sprintf("%d", du))
+	ioutil.WriteFile(filepath.Join(fs.path, diskUsageFile), duB, 0644)


Should really use a tempfile with a rename.

can do, can you explain the logic behind it though?

btw, my /tmp folder lives in a partition different from my ipfs storage. In the case when you have a system with an ssd and a spinning disk, it is not unusual that /tmp /var etc are mounted on the spinning disk (to reduce the number of writes to the ssd) and the rest of things on the ssd, so imagine the performance hit of current implementation with tmp files and renames..

Ah, no. I mean an ioutil.TempFile as we do in doPut. That is, create a temporary file, write to it, close it, and then atomically rename it into place (so we don't end up with a half-written or corrupted file if we crash while recording the size). Probably not much of an issue in this case but still good practice.

yes. i just realized that tmpfiles are created in the same folder. yeah makes sense. Thanks!

Stebalien · 2018-02-16T21:30:45Z

So, reasoning through this carefully, there are a few annoying race conditions:

If two users write the same block at the same time, we'll double-count the file.
Same for two simultaneous deletes.
A simultaneous write + delete could end up not recording the fact that we've freed disk space.

For the delete case, we should:

Compute the disk space.
Remove the file.
Update the usage iff we removed the file (the call to Remove succeeded).

For the write case... I think we'll have to use os.Link(old, new) and os.Remove(old). That should work on all operating systems and should avoid overwriting the file. Note: this assumes that we never want to replace values in datastores. If we do, we'd have to do... we could do a remove and replace dance...

There are a few os-specific ways to make this faster but cross-platform low-level filesystem stuff always kind of sucks.

hsanjuan · 2018-02-19T12:49:44Z

@Stebalien you're right.

I have followed your indications. I'm not super happy though:

Not allowing overwrite of values seems like a high price to pay and changes datastore behavior
I cannot get my test of concurrent mixed put+delete over the same key to pass with sync = false. Sure, deviation is small, but it seems some sort of race exists still at a lower level. I'm happy ignoring this.
I have tried to do the remove and replace dance. This introduces further race conditions (cannot remove and replace atomically) and I'm not sure it's possible to do that dance without making things horrible or recurring to sync.Mutex..

My feelings overall:

DiskUsage bookeeping should not deeply impact datastore performance.
DiskUsage should be more or less accurate for the common usage of this datastore, but it doesn't need to exact.

Thus, if we can live with the non-overwrite limitation i'd leave things like this, otherwise, I'd return to the buggy behaviour before (fixing the delete part, which is easy). What do you think?

Stebalien · 2018-02-19T18:39:09Z

My feelings overall:

Overall, I agree. Performance is already poor enough as it is.

Thus, if we can live with the non-overwrite limitation i'd leave things like this, otherwise, I'd return to the buggy behaviour before (fixing the delete part, which is easy). What do you think?

My worry is that simultaneous writes/deletes are probably quite common and, if they're not perfectly balanced, the estimated disk usage will skew over time. We might, actually, just want to use a synchronization mechanism. I wouldn't have a global lock, but we can track in-progress puts/deletes on a per-file basis. That is, I'd have a map (either a sync.Map or a map protected by a mutex) mapping file names to some form of synchronization object. Every operation would keep retrying itself until it either (a) observes itself fail or (b) observes some concurrent operation (or itself) succeed.

That is, given a timeline like:

<--op 1---> // takes the "lock"
  <--op 2-> // waits
   <-op 3-> // waits

If op1 succeeds, we can serialize these operations as <op2><op3><op1> (and say that op2 and op3 are valid for no observable time and therefor optimize them out 😄). If op1 fails, we'll need to retry at least one of the other operations (in practice, they'll both retry and start at the top).

This should actually be faster than the existing implementation given concurrent operations and should be just as "correct".

Does this sound too complicated? Any better ideas?

@Stebalien

This implements the approach suggested by @Stebalien in #27 Write operations (delete/put) to the same key are tracked in a map which provides a shared lock. Concurrent operations to that key will share that lock. If one operation succeeds, it will remove the lock from the map and the others using it will automatically succeed. If one operation fails, it will let the others waiting for the lock try. New operations to that key will request a new lock. A new test for putMany (batching) has been added. Worth noting: a concurrent Put+Delete on a non-existing key always yields Put as the winner (delete will fail if it comes first, or will skipped if it comes second).

hsanjuan · 2018-02-23T12:08:09Z

@Stebalien ok, I have implemented the approach you suggested. Works well for my tests (added an extra one too for putMany()). Hopefully we can settle here :)

The tricky code should be sufficiently commented, but let me know if something's not clear.

Note, Mac tests fail because too many open files. I'll tune that later.

Stebalien

So, in addition to my comments, there's a bit of a problem concerning deletes. If we get a concurrent delete and put, the delete will succeed even if there was nothing to delete. If we have multiple deletes and a put, all the deletes will succeed (even though at most one should).

However, as far as I can tell, badger-db doesn't return errors on deletes of non-existent keys (idempotent deletes). Personally, I prefer those.

So, we need to decide: do we have idempotent deletes or do we return an error if there is nothing to delete.

@magik6k?

Stebalien · 2018-02-23T17:07:34Z

flatfs.go

+	// on the same key. We use opLock
+	// for that.
+	opLock.Lock()
+	defer func() {


Can just call defer opLock.Unlock().

ops, left over from some debugging

Stebalien · 2018-02-23T19:46:42Z

flatfs.go

+
+	// Destination exists, we need to discount it from diskUsage
+	if fs != nil && err == nil {
+		atomic.AddInt64(&fs.diskUsage, -fi.Size())


Needs to happen after the rename (could fail).

This will still need to be addressed (the remove could fail).

Stebalien · 2018-02-23T19:47:14Z

flatfs.go

+		return fs.renameAndUpdateDiskUsage(oper.tmp, oper.path)
+	default:
+		panic("bad operation, this is a bug")
+		return errors.New("bad operation, this is a bug")


Don't need this (go understands that panic aborts).

Stebalien · 2018-02-23T19:49:50Z

flatfs.go

+// operations if one of them does. In such case,
+// we assume that the first suceeding operation
+// on that key was the last one to happen after
+// two successful others.


"after all successful others"

Stebalien · 2018-02-23T20:00:04Z

flatfs.go

+		return err
+	}
+
+	//time.Sleep(2 * time.Second)


Commented out code.

Stebalien · 2018-02-23T20:02:31Z

flatfs.go

+		// failed, even if they did not compete with another
+		// one. If datastore operations fail all the time
+		// opMap will grow as it retains all keys of failed operations
+		// and only deletes them on success.


Hm. This is a bit unfortunate. Especially with respect to deletes. We can easily have delete(x), sleep(), delete(x).

The other way to do this is to use one-off locks. That is,

type OpMap struct { ops map[string]*OpResult mu sync.RWMutex } type OpResult struct { mu sync.RWMutex success bool opMap *opMap name string } // Returns nil if there's nothing to do. func (m *OpMap) Begin(name string) *OpResult { for { m.mu.Lock() op, ok := m.ops[name] if !ok { op := &OpResult{opMap: m, name: name} op.mu.Lock() m.ops[name] = op m.mu.Unlock() return op } m.mu.Unlock() op.mu.RLock() if op.success { return nil } } } func (o *OpResult) Finish(ok bool) { o.success = ok delete(o.opMap.ops, o.name) o.mu.Unlock() }

Ok, let me try this

@Stebalien the delete in Finish causes panics with the map (fixed by locking/unlocking around it). I have implemented the whole thing with a sync.Map instead.

Kubuxu

Running something that might be very expensive (after a ipfs crash, power loss or out of memory) during bootstrap sequence of go-ipfs doesn't sit with me right.

In past I've worked on servers where scan like that would almost bring the whole system down (few TB of data on the same almost dying HDD as the system, they were able to pull 10 IOPS on lucky day). go-ipfs would be unusable on those servers as it would take very long time to start up while looking like it has hanged.

In my opinion it is too high of a price to pay for a feature that might not be used. Having it does lazily, on request and reporting an error when the scan has not yet finished would be far better.

Kubuxu · 2018-02-24T22:14:22Z

flatfs.go

 )

+const (
+	putOp = iota


IMO use named type of opT.

Kubuxu · 2018-02-24T22:15:38Z

flatfs.go

+
+	// opMap handles concurrent write operations (put/delete)
+	// to the same key
+	opMap     map[string]*sync.Mutex


It might be better to use concurrent map. It should be essentially free when we are doing concurrent operations on different keys.

👍 although I think the OS will be the bottleneck here.

Stebalien · 2018-02-25T17:47:09Z

In my opinion it is too high of a price to pay for a feature that might not be used. Having it does lazily, on request and reporting an error when the scan has not yet finished would be far better.

We've discussed this but it's prohibitively complicated as we need a snapshot size. One solution would be to disable writes until the scan is complete but that complicates matters and isn't really a solution.

hsanjuan · 2018-02-27T10:52:06Z

We've discussed this but it's prohibitively complicated as we need a snapshot size. One solution would be to disable writes until the scan is complete but that complicates matters and isn't really a solution.

We could write the DiskUsage-file regularly (say when disk usage difference from last checkpoint >1%) so that we can read that on start. It may not be 100% accurate in the case of a crash, but it might be close enough for our needs.

Kubuxu · 2018-02-27T10:55:40Z

@hsanjuan SGWM, it also seems to me that triggering full re-scan on ipfs repo fsck would be good feature to have. This way if DiskUsage file diverged too far, fsck would fix it.

Any idea how to nicely wire it in?

hsanjuan · 2018-02-27T10:59:22Z

@Kubuxu that will need interface support in go-datastore. Should we have an Fsck() method for PersistentDatastore that does this (and potentially other things if needed)?

Kubuxu · 2018-02-27T11:02:39Z

I am thinking if PersistentDatastore is only one needing this and it seems so. I also recall Bager needing hook after GC was performed so it could perform its own internal GC. I don't know if it was solved (@magik6k).

magik6k · 2018-02-27T11:51:57Z

The GC interface is wired into go-ipfs master already (merged in ipfs/kubo#4578)

I created 2 more interfaces in go-datastore when making the GC one - https://github.com/ipfs/go-datastore/blob/master/datastore.go#L88-L103. Check from CheckedDatastore is meant to be triggered on ipfs repo fsck. I'm not sure where Scrub should get invoked, probably separate command like ipfs repo scrub.

ScrubbedDatastore.Scrub is probably where we want to wire this into. (Check / repo fsck should be fast)

whyrusleeping · 2018-02-27T20:46:03Z

Could we get some benchmarks on how this will affect the overall performance of the datastore?

Stebalien

Additionally, we still need to address the issue of idempotent deletes. Currently, this will sometimes return an error when deleting something that isn't there and sometimes won't. Personally, I'd rather just always return a success if either the delete succeeded or there wasn't something there.

Stebalien · 2018-02-27T22:54:09Z

flatfs.go

+}
+
+type opMap struct {
+	ops *sync.Map


nit: I wouldn't use a pointer here (no point).

Stebalien · 2018-02-27T23:04:42Z

flatfs.go

+	}
+}
+
+func (fs *Datastore) checkpointDiskUsage(newDuInt int64) {


There is a serious drawback to this. We could write a bunch of data but crash before checkpointing. The old way was much safer because we didn't leave the file on disk while it could be invalid.

You would need to crash precisely in the write operation which would do a checkpoint while having written a really large entry (with respects to the current size) for the checkpoint to be really skewed from actual size. I think it's not so much of a big deal as soon as the size of the datastore is sufficiently big to the size of the entries written to it.

Fair enough.

Stebalien · 2018-02-27T23:06:22Z

flatfs.go

+
+	// Destination exists, we need to discount it from diskUsage
+	if fs != nil && err == nil {
+		atomic.AddInt64(&fs.diskUsage, -fi.Size())


This will still need to be addressed (the remove could fail).

Kubuxu · 2018-02-27T23:17:18Z

From my tests, ipfs repo stat on Biham (9TB of data in go-ipfs) didn't finish after 13h, I've cancelled it. I will leave du on blockstore running over the night. du is probably as close to optimal size scanning as we will get. I expect it also to take at least hours. I don't think that we can allow for this kind "semi random" (after unexpected circumstances) delays in starting of ipfs node.

hsanjuan · 2018-02-28T11:21:51Z

@Stebalien

This will still need to be addressed (the remove could fail).

I can't comment directly there. There is no remove. I have updated it so it does updateDiskUsage when the rename fails too. In that case it should re-add an existing file size, or do nothing if it didn't exist in the first place.

hsanjuan · 2018-02-28T11:27:01Z

I don't think that we can allow for this kind "semi random" (after unexpected circumstances) delays in starting of ipfs node.

Ideally this only happens once, if the checkpoint file is not there (which should be in subsequent starts).

It might be that stat-walking large repos is too expensive even to do it once. We could be smart and list entries in folders and multiply by 256K. This would give an approximate ballpark in a hopefully way faster way. I'm happy sacrificing accuracy as in a large-file context most blocks will probably be full.

hsanjuan · 2018-03-05T16:12:18Z

Progress and tests so far:

It selects random files/folders when the limit is less than the number of them
It has a folder limit because the slowest operation is readdirnames and that happens on folders, so I figure out if we read less folders we can compensate stat-ing more files.
Readdirnames is very slow on biham when the folders are not cached (so it was fast because they had been accessed earlier). With current implementation it will take 3 minutes to start the first time after this change comes in. It reads 50 random folders (of around 1000) and uses 100 files in each of them to get the average.
This gives a 5% error in the calculated size. More accuracy will mean more time (what's acceptable?). But this only happens on the first start when there is no a diskUsage.cache file.

Are we happy with this?

Stebalien · 2018-03-06T23:28:47Z

Still need to fix #30 but we can do that in a subsequent PR.

kevina · 2018-03-06T23:34:26Z

Okay there seams to be a lot going on here. Please give me a day or two too look over this.

Note that I don't think identity hashes will cause any problems with sampling because if we implement them as I envisioned the data stores won't have to worry about them because they won't even be stored in the datastore (as there is no need).

hsanjuan · 2018-03-06T23:40:06Z

Thanks @kevina , will wait

kevina · 2018-03-07T02:08:00Z

If we are going to estimate the size I think instead of taking a fixed sample size we should iterate until we get an acceptable level of accuracy. The accuracy can be estimated using the formula for the standard error for finite population (the finite population correction factor is the important part, otherwise the error won't shrink as we take larger samples). We can continue to take samples until this number is within an acceptable margin (I would say 1%).

Since we are not even reading all the directories we have two estimate numbers we are working with (1) The estimate number of files and (2) the estimate mean file size. The standard error needs to be computed independently for each and then property combined to produce a final estimate.

The population size for (1) is the number of directories. The population size for (2) is the number of files. Since the population size for (2) is only an estimate, this might need to be taken into account when calculating the correction factor.

I would prefer that DiskUsage() is exact. If it is not, I think it is important we have a high confidence on the accuracy of the estimate, we can't get this confidence by simply taking a fixed sample size and calling it good.

kevina · 2018-03-07T02:19:10Z

Note, that I am not a stat's expert. If anyone is then please fell free to correct me if I got something wrong as the advice above is based on only a few hours of research.

kevina · 2018-03-07T05:02:40Z

I should also note that as far as I know standard error assumes a normal distribution which may not be the case in terms of block sizes, in particular there are likely to be a lot of very small and an possible even a larger number of 256K blocks.

I still think we should iterate, I am just not 100% sure when to know to stop.

kevina

We should limit sampling to huge datastores where computing the size will take hours. For most datastores (of maybe say 100 GB of less) this calculation should only take a few minutes and we should provide an exact number.

If we do sample we should consider iterating until we get an acceptable accuracy if that is at all possible, I left comments outside of this review.

kevina · 2018-03-07T05:06:11Z

flatfs.go

+	// folders are also considered files with respect to
+	// DiskUsageFilesAverage (therefore this value only makes sense
+	// if it's lower).
+	DiskUsageFoldersAverage = 50


This is not a good threshold, even modest datastores will have a large number of directories. We should only consider skipping folders after there are a certain number of files in a directory and reading them truly becomes a problem.

@kevina the directories contain random sample of the whole datastore (due to how prefix is selected) so it isn't necessary to sample many of them. In theory if the datastore is big enough we could sample just random files from one directory.

@Kubuxu that is true, but if the datastore is small, this will still be a approximation and not an exact value.

hsanjuan · 2018-03-07T10:37:41Z

We should limit sampling to huge datastores where computing the size will take hours. For most datastores (of maybe say 100 GB of less) this calculation should only take a few minutes and we should provide an exact number.

I can already tell you that, based on my tests, getting within 1% of error and being sure about it during sampling is going to take too long.

What I can do instead is to set a time limit rather than a folder limit and read all random samples possible (I'll play with larger number of files per folder) until the acceptable time limit is hit (say 5 minutes). This should give accuracy with small datastores (or those equipped with fast disks) and do our best on large/slow ones. Also, simplicity.

Would this make you happy enough @kevina ?

kevina · 2018-03-07T15:40:31Z

@Kubuxu @hsanjuan, I have a problem with DiskUsage() being approximate. The documentation for DiskUsage says DiskUsage returns the space used by a datastore, in bytes.. At very least it is bad UX and could cause problems. Worse is the fact that the user has no way to know if it is an appropriation.

Perhaps a better way forward is if it is not possible to exactly calculate the DiskUsage() in a reasonable time, we should leave leave it undefined and provide a tool to calculate the DiskUsage or to optionally make an estimate and update the value based on a command line flag. This way the user is at least aware that it is an approximation. They can also schedule to run the command at a convenient time.

In order to estimate if we can calculate the DiskUsage() in a reasonable time suggestion we sample a few random directories and count the number of files, if the count is above a threshold we give up and somehow mark the datastore as having an undefined DiskUsage(). We can also use a timeout.

Sorry, but I just can't in good consonance approve this P.R. as written.

However, if people can show me other examples where DiskUsage calculation for a Database is an approximation, I would likely be more okay with it. We would still need to update the documentation to make it clear that this is not an exact number.

@whyrusleeping (and others) thoughts

kevina · 2018-03-07T15:45:20Z

However, if people can show me other examples where DiskUsage calculation for a Database is an approximation, I would likely be more okay with it. We would still need to update the documentation to make it clear that this is not an exact number.

magik6k · 2018-03-07T16:10:06Z

As far as I am concerned the badger implementation of Size doesn't return the exact size on disk (at least not for every file-system) as doing so with the many specifics of it would be quite hard.

The problem the DiskUsage method is trying to solve is to provide a fast way to get the size of data stored within ipfs datastore.

One use case is described in ipfs/kubo#4550, the other is to display repository size in the desktop app where this was a problem too. For both of those cases being within .1% (or even 1%) should be perfectly acceptable imo. Having this method take more than few seconds is not as it makes it unusable for those use cases.

Yes, we should document that the result may be an approximation, but making return 'undefined' when it would take too long doesn't solve ether of the use cases above and just adds a feature which can't be used in most realistic scenarios (most datastores where you'd care about size will be too big to calculate exact size in a reasonable time)

kevina · 2018-03-07T16:23:59Z

One use case is described in ipfs/kubo#4550, the other is to display repository size in the desktop app where this was a problem too. For both of those cases being within .1% (or even 1%) should be perfectly acceptable imo. Having this method take more than few seconds is not as it makes it unusable for those use cases.

(1) Based on the latest estimates the accuracy will likely be more like 5%, if it was 1% (and with reasonable confidence we know its 1% I would object less)

(2) If I understand things correctly this method will only take long when the method has not been called before. On new datastores (after this lands) and after this method is called the first time, the DiskUsage will be kept in sync. Thus my proposal is if it takes too long in it initial calculation, don't do it and leave it up to the user to update manually, after which DiskUsage will remain usable.

hsanjuan · 2018-03-07T16:43:28Z

I think:

It's ok to provide estimates as good as possible within a time frame of 3-5 minutes
We already provide a method for the user to fix the estimates if they are willing. It consists in writing the diskUsage.cache file with the right value. Maybe we can make this more explicit with a special readme or printing a message.

Please let's remember how badly broken this is right now and take this as a step to improving this.

The vast majority of daemons with small storages attached won't be affected by the accuracy problems. Any new ipfs daemons won't be affected. Only nodes with huge existing storages will be affected. In these nodes it's impossible to run repo stat right now.

kevina · 2018-03-07T16:47:42Z

Only nodes with huge existing storages will be affected. In this nodes it's impossible to run repo stat right now.

Then it that case I think a better solution is to return an undefined value for the DiskUsage possible informing the user on how to rectify it for large repos rather than return an estimate that can be 5-10% off without informing the user of the inaccuracy.

hsanjuan · 2018-03-07T16:48:47Z

without informing the user of the inaccuracy.

I will inform the user

kevina · 2018-03-07T17:04:44Z

I think it would be better to to keep track of if the disk usage is an estimate or an accurate count and inform the user of which. I am okay if we delay this to another P.R. as long as it gets done sometime soonish.

Also please note, that my option is just that. I will not block this P.R. if it is impeding important progress and will not object if someone dismisses my review.

This provides a disk estimation deadline. We will stat() as many files as possible until we run out of time. If that happens, the rest will be calculated as an average. The user is informed of the slow operation and, if we ran out of time, about how to obtain better accuracy.

hsanjuan · 2018-03-07T17:27:09Z

I have implemented time deadline in the last commit (set a default of 5 minutes).

It will look like this when starting the datastore and no cache file exists:

root@biham /ipfs/ipfs_master/repo/blocks # ./test 
Calculating datastore size. This might take 5m0s at most and will happen only once
WARN: It took to long to calculate the datastore size
WARN: The total size (7589482442898) is an estimation. You can fix errors by
WARN: replacing the diskUsage.cache file with the right disk usage in bytes and
WARN: re-opening the datastore
7589482442898 <nil>

The actual reference size in biham is: 7536515831332 (so it's pretty close).

kevina · 2018-03-07T17:33:43Z

It will look like this when starting the datastore and no cache file exists:

It is possible to miss this warning. This information should at least be stored in the datastore somehow. Providing an interface to get to determine if the value stored is an approximation or an exact value can be a separate p.r.

Comments noted. Follow up in ipfs/go-datastore#85

kevina · 2018-03-09T10:43:51Z

I'm sorry, but what is the standard policy in regard to an author merging there own pull request? Especially when it is non-trivial.

Also see #31.

hsanjuan requested a review from Stebalien February 16, 2018 15:12

ghost assigned hsanjuan Feb 16, 2018

ghost added the status/in-progress In progress label Feb 16, 2018

Stebalien reviewed Feb 16, 2018

View reviewed changes

Stebalien reviewed Feb 23, 2018

View reviewed changes

Kubuxu reviewed Feb 24, 2018

View reviewed changes

hsanjuan mentioned this pull request Feb 27, 2018

gx publish 1.1.8 #29

Merged

Stebalien requested changes Feb 27, 2018

View reviewed changes

Stebalien mentioned this pull request Mar 6, 2018

Make delete idempotent #30

Closed

Stebalien approved these changes Mar 6, 2018

View reviewed changes

kevina self-requested a review March 6, 2018 23:31

kevina previously requested changes Mar 7, 2018

View reviewed changes

hsanjuan mentioned this pull request Mar 9, 2018

DiskUsage and approximate values ipfs/go-datastore#85

Open

hsanjuan merged commit a095ff5 into master Mar 9, 2018

ghost removed the status/in-progress In progress label Mar 9, 2018

kevina mentioned this pull request Mar 9, 2018

Store Some Indication of How Accurate the Initial DiskUsage() Calculation was #31

Closed

whyrusleeping deleted the feat/diskusage branch March 9, 2018 17:32

Feat: Implement a PersistentDatastore by adding DiskUsage method #27

Feat: Implement a PersistentDatastore by adding DiskUsage method #27

Conversation

hsanjuan commented Feb 16, 2018

ghost commented Feb 16, 2018

hsanjuan commented Feb 16, 2018

hsanjuan commented Feb 16, 2018

Stebalien commented Feb 16, 2018 • edited Loading

hsanjuan commented Feb 16, 2018

Stebalien commented Feb 16, 2018

hsanjuan commented Feb 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stebalien commented Feb 16, 2018

hsanjuan commented Feb 19, 2018

Stebalien commented Feb 19, 2018

hsanjuan commented Feb 23, 2018

Stebalien left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stebalien Feb 23, 2018 • edited by hsanjuan Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kubuxu left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stebalien commented Feb 25, 2018

hsanjuan commented Feb 27, 2018

Kubuxu commented Feb 27, 2018

hsanjuan commented Feb 27, 2018

Kubuxu commented Feb 27, 2018

magik6k commented Feb 27, 2018

whyrusleeping commented Feb 27, 2018

Stebalien left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kubuxu commented Feb 27, 2018

hsanjuan commented Feb 28, 2018

hsanjuan commented Feb 28, 2018

hsanjuan commented Mar 5, 2018 • edited Loading

Stebalien commented Mar 6, 2018

kevina commented Mar 6, 2018

hsanjuan commented Mar 6, 2018

kevina commented Mar 7, 2018 • edited Loading

kevina commented Mar 7, 2018

kevina commented Mar 7, 2018

kevina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsanjuan commented Mar 7, 2018

kevina commented Mar 7, 2018 • edited Loading

kevina commented Mar 7, 2018 • edited Loading

magik6k commented Mar 7, 2018

kevina commented Mar 7, 2018

hsanjuan commented Mar 7, 2018 • edited Loading

Stebalien commented Feb 16, 2018 •

edited

Loading

Stebalien Feb 23, 2018 •

edited by hsanjuan

Loading

Kubuxu left a comment •

edited

Loading

hsanjuan commented Mar 5, 2018 •

edited

Loading

kevina commented Mar 7, 2018 •

edited

Loading

kevina commented Mar 7, 2018 •

edited

Loading

kevina commented Mar 7, 2018 •

edited

Loading

hsanjuan commented Mar 7, 2018 •

edited

Loading

hsanjuan commented Mar 7, 2018 •

edited

Loading