Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Implement a PersistentDatastore by adding DiskUsage method #27

Merged
merged 24 commits into from
Mar 9, 2018

Conversation

hsanjuan
Copy link
Contributor

This adds DiskUsage().

This datastore would have a big performance hit if we walked the
filesystem to calculate disk usage everytime.

Therefore I have opted to keep tabs of current disk usage by
walking the filesystem once during "Open" and then adding/subtracting
file sizes on Put/Delete operations.

On the plus:

  • Small perf impact
  • Always up to date values
  • No chance that race conditions will leave DiskUsage with wrong values

On the minus:

  • Slower Open() - it run Stat() on all files in the datastore
  • Size does not match real size if a directory grows large
    (at least on ext4 systems). We don't track directory-size changes,
    only use the creation size.

@hsanjuan hsanjuan requested a review from Stebalien February 16, 2018 15:12
@ghost ghost assigned hsanjuan Feb 16, 2018
@ghost ghost added the status/in-progress In progress label Feb 16, 2018
@ghost
Copy link

ghost commented Feb 16, 2018

Slower Open() - it run Stat() on all files in the datastore

That's a bummer for the gateways -- can't have a full repo scan on every restart there.

Can we make the bookkeeping persistent so that it survives restart?

@hsanjuan
Copy link
Contributor Author

@lgierth is it so bad? In my gateway tests a while ago it did take a few seconds, but it's once during boot and while ipfs is not hammering the disk at the same time.

That said, I can leave note of the size on a file, sure, and only re-scan when not present.

@hsanjuan
Copy link
Contributor Author

@lgierth there you go

@Stebalien
Copy link
Member

Stebalien commented Feb 16, 2018

There are two downsides to caching disk usage:

  1. Downgrade, do something, upgrade (as we haven't bumped the repo version).
  2. It'll be even more out-of-date directory sizes (meh?).

An alternative is to calculate the initial size lazily. The first call to DiskUsage will be slow but it won't halt the world. @lgierth does this work for you?

(I don't feel that strongly about this issue, just thought it should be considered).

@hsanjuan
Copy link
Contributor Author

Without looking at Walk internals, walking lazily probably suffers from race conditions where a file may be counted twice (one through Put as soon as we Open the datastore and the other one as the lazy walk finally gets to its folder). Also, the other way around with in the case of delete.

So it also has some downsides... (and the fact that it may return really low sizes at the beginning which may be to whoever is looking at that metric).

I lean towards caching even though it's not perfect...

@Stebalien
Copy link
Member

walking lazily probably suffers from race conditions where a file may be counted twice

Hm. You're right. We'd have to track which parts we've walked and which parts we haven't. That would be horrible.

Yeah, this sounds like the best solution. Note: If we want a more accurate solution, we could cache the size of the files and recompute the sizes of the directories but it's probably not worth it.

@hsanjuan
Copy link
Contributor Author

probably not worth it.

agreed. Will you be so kind to greenlight the code or leave feedback @Stebalien ?

flatfs.go Outdated
@@ -33,6 +37,9 @@ type Datastore struct {

// sychronize all writes and directory changes for added safety
sync bool

diskUsage uint64
diskUsageCh chan int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use atomic.AddInt64 (and store it as an int64)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, indeed..

flatfs.go Outdated
}
}

func fileSize(path string) int64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could actually be slightly more accurate by adding in the size of the filename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you think? in my tests du on the disk matched the values we produce here. aren't filenames in ext* part of the directory entry, and the directory entry counts as fulll 4096 bytes block even if empty. So im not sure if adding filenames is double-counting some bytes..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to account for the case where we add a bunch of files without re-stating the directories (in which case, they'll be much larger than 4096). However, that case may not really be worth considering. If it comes to it, we can always consider recomputing once in a while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, leaving this as it is. To my usecase, it doesn't really matter. We can improve it when we need to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

flatfs.go Outdated
}

duB := []byte(fmt.Sprintf("%d", du))
ioutil.WriteFile(filepath.Join(fs.path, diskUsageFile), duB, 0644)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should really use a tempfile with a rename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do, can you explain the logic behind it though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, my /tmp folder lives in a partition different from my ipfs storage. In the case when you have a system with an ssd and a spinning disk, it is not unusual that /tmp /var etc are mounted on the spinning disk (to reduce the number of writes to the ssd) and the rest of things on the ssd, so imagine the performance hit of current implementation with tmp files and renames..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no. I mean an ioutil.TempFile as we do in doPut. That is, create a temporary file, write to it, close it, and then atomically rename it into place (so we don't end up with a half-written or corrupted file if we crash while recording the size). Probably not much of an issue in this case but still good practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. i just realized that tmpfiles are created in the same folder. yeah makes sense. Thanks!

@Stebalien
Copy link
Member

So, reasoning through this carefully, there are a few annoying race conditions:

  • If two users write the same block at the same time, we'll double-count the file.
  • Same for two simultaneous deletes.
  • A simultaneous write + delete could end up not recording the fact that we've freed disk space.

For the delete case, we should:

  1. Compute the disk space.
  2. Remove the file.
  3. Update the usage iff we removed the file (the call to Remove succeeded).

For the write case... I think we'll have to use os.Link(old, new) and os.Remove(old). That should work on all operating systems and should avoid overwriting the file. Note: this assumes that we never want to replace values in datastores. If we do, we'd have to do... we could do a remove and replace dance...

There are a few os-specific ways to make this faster but cross-platform low-level filesystem stuff always kind of sucks.

@hsanjuan
Copy link
Contributor Author

@Stebalien you're right.

I have followed your indications. I'm not super happy though:

  • Not allowing overwrite of values seems like a high price to pay and changes datastore behavior
  • I cannot get my test of concurrent mixed put+delete over the same key to pass with sync = false. Sure, deviation is small, but it seems some sort of race exists still at a lower level. I'm happy ignoring this.
  • I have tried to do the remove and replace dance. This introduces further race conditions (cannot remove and replace atomically) and I'm not sure it's possible to do that dance without making things horrible or recurring to sync.Mutex..

My feelings overall:

  • DiskUsage bookeeping should not deeply impact datastore performance.
  • DiskUsage should be more or less accurate for the common usage of this datastore, but it doesn't need to exact.

Thus, if we can live with the non-overwrite limitation i'd leave things like this, otherwise, I'd return to the buggy behaviour before (fixing the delete part, which is easy). What do you think?

@Stebalien
Copy link
Member

My feelings overall:

Overall, I agree. Performance is already poor enough as it is.

Thus, if we can live with the non-overwrite limitation i'd leave things like this, otherwise, I'd return to the buggy behaviour before (fixing the delete part, which is easy). What do you think?

My worry is that simultaneous writes/deletes are probably quite common and, if they're not perfectly balanced, the estimated disk usage will skew over time. We might, actually, just want to use a synchronization mechanism. I wouldn't have a global lock, but we can track in-progress puts/deletes on a per-file basis. That is, I'd have a map (either a sync.Map or a map protected by a mutex) mapping file names to some form of synchronization object. Every operation would keep retrying itself until it either (a) observes itself fail or (b) observes some concurrent operation (or itself) succeed.

That is, given a timeline like:

<--op 1---> // takes the "lock"
  <--op 2-> // waits
   <-op 3-> // waits

If op1 succeeds, we can serialize these operations as <op2><op3><op1> (and say that op2 and op3 are valid for no observable time and therefor optimize them out 😄). If op1 fails, we'll need to retry at least one of the other operations (in practice, they'll both retry and start at the top).

This should actually be faster than the existing implementation given concurrent operations and should be just as "correct".

Does this sound too complicated? Any better ideas?

hsanjuan added a commit that referenced this pull request Feb 23, 2018
This implements the approach suggested by @Stebalien in
#27

Write operations (delete/put) to the same key are tracked in a map
which provides a shared lock. Concurrent operations to that key
will share that lock. If one operation succeeds, it will remove
the lock from the map and the others using it will automatically
succeed. If one operation fails, it will let the others waiting
for the lock try.

New operations to that key will request a new lock.

A new test for putMany (batching) has been added.

Worth noting: a concurrent Put+Delete on a non-existing key
always yields Put as the winner (delete will fail if it comes first,
or will skipped if it comes second).
@hsanjuan
Copy link
Contributor Author

@Stebalien ok, I have implemented the approach you suggested. Works well for my tests (added an extra one too for putMany()). Hopefully we can settle here :)

The tricky code should be sufficiently commented, but let me know if something's not clear.

Note, Mac tests fail because too many open files. I'll tune that later.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in addition to my comments, there's a bit of a problem concerning deletes. If we get a concurrent delete and put, the delete will succeed even if there was nothing to delete. If we have multiple deletes and a put, all the deletes will succeed (even though at most one should).

However, as far as I can tell, badger-db doesn't return errors on deletes of non-existent keys (idempotent deletes). Personally, I prefer those.

So, we need to decide: do we have idempotent deletes or do we return an error if there is nothing to delete.

@magik6k?

flatfs.go Outdated
// on the same key. We use opLock
// for that.
opLock.Lock()
defer func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just call defer opLock.Unlock().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops, left over from some debugging


// Destination exists, we need to discount it from diskUsage
if fs != nil && err == nil {
atomic.AddInt64(&fs.diskUsage, -fi.Size())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to happen after the rename (could fail).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still need to be addressed (the remove could fail).

flatfs.go Outdated
return fs.renameAndUpdateDiskUsage(oper.tmp, oper.path)
default:
panic("bad operation, this is a bug")
return errors.New("bad operation, this is a bug")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this (go understands that panic aborts).

flatfs.go Outdated
// operations if one of them does. In such case,
// we assume that the first suceeding operation
// on that key was the last one to happen after
// two successful others.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"after all successful others"

flatfs.go Outdated
return err
}

//time.Sleep(2 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out code.

flatfs.go Outdated
// failed, even if they did not compete with another
// one. If datastore operations fail all the time
// opMap will grow as it retains all keys of failed operations
// and only deletes them on success.
Copy link
Member

@Stebalien Stebalien Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. This is a bit unfortunate. Especially with respect to deletes. We can easily have delete(x), sleep(), delete(x).

The other way to do this is to use one-off locks. That is,

type OpMap struct {
	ops map[string]*OpResult
	mu  sync.RWMutex
}
type OpResult struct {
	mu      sync.RWMutex
	success bool

	opMap *opMap
	name  string
}

// Returns nil if there's nothing to do.
func (m *OpMap) Begin(name string) *OpResult {
	for {
		m.mu.Lock()
		op, ok := m.ops[name]
		if !ok {
			op := &OpResult{opMap: m, name: name}
			op.mu.Lock()
			m.ops[name] = op
			m.mu.Unlock()
			return op
		}
		m.mu.Unlock()

		op.mu.RLock()
		if op.success {
			return nil
		}
	}
}

func (o *OpResult) Finish(ok bool) {
	o.success = ok
	delete(o.opMap.ops, o.name)
	o.mu.Unlock()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me try this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien the delete in Finish causes panics with the map (fixed by locking/unlocking around it). I have implemented the whole thing with a sync.Map instead.

Copy link
Member

@Kubuxu Kubuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running something that might be very expensive (after a ipfs crash, power loss or out of memory) during bootstrap sequence of go-ipfs doesn't sit with me right.

In past I've worked on servers where scan like that would almost bring the whole system down (few TB of data on the same almost dying HDD as the system, they were able to pull 10 IOPS on lucky day). go-ipfs would be unusable on those servers as it would take very long time to start up while looking like it has hanged.

In my opinion it is too high of a price to pay for a feature that might not be used. Having it does lazily, on request and reporting an error when the scan has not yet finished would be far better.

flatfs.go Outdated
)

const (
putOp = iota
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO use named type of opT.

flatfs.go Outdated

// opMap handles concurrent write operations (put/delete)
// to the same key
opMap map[string]*sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to use concurrent map. It should be essentially free when we are doing concurrent operations on different keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 although I think the OS will be the bottleneck here.

@Stebalien
Copy link
Member

In my opinion it is too high of a price to pay for a feature that might not be used. Having it does lazily, on request and reporting an error when the scan has not yet finished would be far better.

We've discussed this but it's prohibitively complicated as we need a snapshot size. One solution would be to disable writes until the scan is complete but that complicates matters and isn't really a solution.

@hsanjuan
Copy link
Contributor Author

We've discussed this but it's prohibitively complicated as we need a snapshot size. One solution would be to disable writes until the scan is complete but that complicates matters and isn't really a solution.

We could write the DiskUsage-file regularly (say when disk usage difference from last checkpoint >1%) so that we can read that on start. It may not be 100% accurate in the case of a crash, but it might be close enough for our needs.

@Kubuxu
Copy link
Member

Kubuxu commented Feb 27, 2018

@hsanjuan SGWM, it also seems to me that triggering full re-scan on ipfs repo fsck would be good feature to have. This way if DiskUsage file diverged too far, fsck would fix it.

Any idea how to nicely wire it in?

@hsanjuan
Copy link
Contributor Author

@Kubuxu that will need interface support in go-datastore. Should we have an Fsck() method for PersistentDatastore that does this (and potentially other things if needed)?

@Kubuxu
Copy link
Member

Kubuxu commented Feb 27, 2018

I am thinking if PersistentDatastore is only one needing this and it seems so. I also recall Bager needing hook after GC was performed so it could perform its own internal GC. I don't know if it was solved (@magik6k).

@magik6k
Copy link
Member

magik6k commented Feb 27, 2018

The GC interface is wired into go-ipfs master already (merged in ipfs/kubo#4578)

I created 2 more interfaces in go-datastore when making the GC one - https://github.com/ipfs/go-datastore/blob/master/datastore.go#L88-L103. Check from CheckedDatastore is meant to be triggered on ipfs repo fsck. I'm not sure where Scrub should get invoked, probably separate command like ipfs repo scrub.

ScrubbedDatastore.Scrub is probably where we want to wire this into. (Check / repo fsck should be fast)

@hsanjuan hsanjuan mentioned this pull request Feb 27, 2018
@whyrusleeping
Copy link
Member

Could we get some benchmarks on how this will affect the overall performance of the datastore?

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, we still need to address the issue of idempotent deletes. Currently, this will sometimes return an error when deleting something that isn't there and sometimes won't. Personally, I'd rather just always return a success if either the delete succeeded or there wasn't something there.

flatfs.go Outdated
}

type opMap struct {
ops *sync.Map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wouldn't use a pointer here (no point).

}
}

func (fs *Datastore) checkpointDiskUsage(newDuInt int64) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a serious drawback to this. We could write a bunch of data but crash before checkpointing. The old way was much safer because we didn't leave the file on disk while it could be invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would need to crash precisely in the write operation which would do a checkpoint while having written a really large entry (with respects to the current size) for the checkpoint to be really skewed from actual size. I think it's not so much of a big deal as soon as the size of the datastore is sufficiently big to the size of the entries written to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.


// Destination exists, we need to discount it from diskUsage
if fs != nil && err == nil {
atomic.AddInt64(&fs.diskUsage, -fi.Size())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still need to be addressed (the remove could fail).

@Kubuxu
Copy link
Member

Kubuxu commented Feb 27, 2018

From my tests, ipfs repo stat on Biham (9TB of data in go-ipfs) didn't finish after 13h, I've cancelled it. I will leave du on blockstore running over the night. du is probably as close to optimal size scanning as we will get. I expect it also to take at least hours. I don't think that we can allow for this kind "semi random" (after unexpected circumstances) delays in starting of ipfs node.

@hsanjuan
Copy link
Contributor Author

@Stebalien

This will still need to be addressed (the remove could fail).

I can't comment directly there. There is no remove. I have updated it so it does updateDiskUsage when the rename fails too. In that case it should re-add an existing file size, or do nothing if it didn't exist in the first place.

@hsanjuan
Copy link
Contributor Author

I don't think that we can allow for this kind "semi random" (after unexpected circumstances) delays in starting of ipfs node.

Ideally this only happens once, if the checkpoint file is not there (which should be in subsequent starts).

It might be that stat-walking large repos is too expensive even to do it once. We could be smart and list entries in folders and multiply by 256K. This would give an approximate ballpark in a hopefully way faster way. I'm happy sacrificing accuracy as in a large-file context most blocks will probably be full.

@hsanjuan
Copy link
Contributor Author

hsanjuan commented Mar 5, 2018

Progress and tests so far:

  • It selects random files/folders when the limit is less than the number of them
  • It has a folder limit because the slowest operation is readdirnames and that happens on folders, so I figure out if we read less folders we can compensate stat-ing more files.
  • Readdirnames is very slow on biham when the folders are not cached (so it was fast because they had been accessed earlier). With current implementation it will take 3 minutes to start the first time after this change comes in. It reads 50 random folders (of around 1000) and uses 100 files in each of them to get the average.
  • This gives a 5% error in the calculated size. More accuracy will mean more time (what's acceptable?). But this only happens on the first start when there is no a diskUsage.cache file.

Are we happy with this?

@Stebalien Stebalien mentioned this pull request Mar 6, 2018
@Stebalien
Copy link
Member

Still need to fix #30 but we can do that in a subsequent PR.

@kevina kevina self-requested a review March 6, 2018 23:31
@kevina
Copy link
Contributor

kevina commented Mar 6, 2018

Okay there seams to be a lot going on here. Please give me a day or two too look over this.

Note that I don't think identity hashes will cause any problems with sampling because if we implement them as I envisioned the data stores won't have to worry about them because they won't even be stored in the datastore (as there is no need).

@hsanjuan
Copy link
Contributor Author

hsanjuan commented Mar 6, 2018

Thanks @kevina , will wait

@kevina
Copy link
Contributor

kevina commented Mar 7, 2018

If we are going to estimate the size I think instead of taking a fixed sample size we should iterate until we get an acceptable level of accuracy. The accuracy can be estimated using the formula for the standard error for finite population (the finite population correction factor is the important part, otherwise the error won't shrink as we take larger samples). We can continue to take samples until this number is within an acceptable margin (I would say 1%).

Since we are not even reading all the directories we have two estimate numbers we are working with (1) The estimate number of files and (2) the estimate mean file size. The standard error needs to be computed independently for each and then property combined to produce a final estimate.

The population size for (1) is the number of directories. The population size for (2) is the number of files. Since the population size for (2) is only an estimate, this might need to be taken into account when calculating the correction factor.

I would prefer that DiskUsage() is exact. If it is not, I think it is important we have a high confidence on the accuracy of the estimate, we can't get this confidence by simply taking a fixed sample size and calling it good.

@kevina
Copy link
Contributor

kevina commented Mar 7, 2018

Note, that I am not a stat's expert. If anyone is then please fell free to correct me if I got something wrong as the advice above is based on only a few hours of research.

@kevina
Copy link
Contributor

kevina commented Mar 7, 2018

I should also note that as far as I know standard error assumes a normal distribution which may not be the case in terms of block sizes, in particular there are likely to be a lot of very small and an possible even a larger number of 256K blocks.

I still think we should iterate, I am just not 100% sure when to know to stop.

kevina
kevina previously requested changes Mar 7, 2018
Copy link
Contributor

@kevina kevina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should limit sampling to huge datastores where computing the size will take hours. For most datastores (of maybe say 100 GB of less) this calculation should only take a few minutes and we should provide an exact number.

If we do sample we should consider iterating until we get an acceptable accuracy if that is at all possible, I left comments outside of this review.

flatfs.go Outdated
// folders are also considered files with respect to
// DiskUsageFilesAverage (therefore this value only makes sense
// if it's lower).
DiskUsageFoldersAverage = 50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good threshold, even modest datastores will have a large number of directories. We should only consider skipping folders after there are a certain number of files in a directory and reading them truly becomes a problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevina the directories contain random sample of the whole datastore (due to how prefix is selected) so it isn't necessary to sample many of them. In theory if the datastore is big enough we could sample just random files from one directory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kubuxu that is true, but if the datastore is small, this will still be a approximation and not an exact value.

@hsanjuan
Copy link
Contributor Author

hsanjuan commented Mar 7, 2018

We should limit sampling to huge datastores where computing the size will take hours. For most datastores (of maybe say 100 GB of less) this calculation should only take a few minutes and we should provide an exact number.

I can already tell you that, based on my tests, getting within 1% of error and being sure about it during sampling is going to take too long.

What I can do instead is to set a time limit rather than a folder limit and read all random samples possible (I'll play with larger number of files per folder) until the acceptable time limit is hit (say 5 minutes). This should give accuracy with small datastores (or those equipped with fast disks) and do our best on large/slow ones. Also, simplicity.

Would this make you happy enough @kevina ?

@kevina
Copy link
Contributor

kevina commented Mar 7, 2018

@Kubuxu @hsanjuan, I have a problem with DiskUsage() being approximate. The documentation for DiskUsage says DiskUsage returns the space used by a datastore, in bytes.. At very least it is bad UX and could cause problems. Worse is the fact that the user has no way to know if it is an appropriation.

Perhaps a better way forward is if it is not possible to exactly calculate the DiskUsage() in a reasonable time, we should leave leave it undefined and provide a tool to calculate the DiskUsage or to optionally make an estimate and update the value based on a command line flag. This way the user is at least aware that it is an approximation. They can also schedule to run the command at a convenient time.

In order to estimate if we can calculate the DiskUsage() in a reasonable time suggestion we sample a few random directories and count the number of files, if the count is above a threshold we give up and somehow mark the datastore as having an undefined DiskUsage(). We can also use a timeout.

Sorry, but I just can't in good consonance approve this P.R. as written.

However, if people can show me other examples where DiskUsage calculation for a Database is an approximation, I would likely be more okay with it. We would still need to update the documentation to make it clear that this is not an exact number.

@whyrusleeping (and others) thoughts

@kevina
Copy link
Contributor

kevina commented Mar 7, 2018

However, if people can show me other examples where DiskUsage calculation for a Database is an approximation, I would likely be more okay with it. We would still need to update the documentation to make it clear that this is not an exact number.

@magik6k
Copy link
Member

magik6k commented Mar 7, 2018

As far as I am concerned the badger implementation of Size doesn't return the exact size on disk (at least not for every file-system) as doing so with the many specifics of it would be quite hard.

The problem the DiskUsage method is trying to solve is to provide a fast way to get the size of data stored within ipfs datastore.

One use case is described in ipfs/kubo#4550, the other is to display repository size in the desktop app where this was a problem too. For both of those cases being within .1% (or even 1%) should be perfectly acceptable imo. Having this method take more than few seconds is not as it makes it unusable for those use cases.

Yes, we should document that the result may be an approximation, but making return 'undefined' when it would take too long doesn't solve ether of the use cases above and just adds a feature which can't be used in most realistic scenarios (most datastores where you'd care about size will be too big to calculate exact size in a reasonable time)

@kevina
Copy link
Contributor

kevina commented Mar 7, 2018

One use case is described in ipfs/kubo#4550, the other is to display repository size in the desktop app where this was a problem too. For both of those cases being within .1% (or even 1%) should be perfectly acceptable imo. Having this method take more than few seconds is not as it makes it unusable for those use cases.

(1) Based on the latest estimates the accuracy will likely be more like 5%, if it was 1% (and with reasonable confidence we know its 1% I would object less)

(2) If I understand things correctly this method will only take long when the method has not been called before. On new datastores (after this lands) and after this method is called the first time, the DiskUsage will be kept in sync. Thus my proposal is if it takes too long in it initial calculation, don't do it and leave it up to the user to update manually, after which DiskUsage will remain usable.

@hsanjuan
Copy link
Contributor Author

hsanjuan commented Mar 7, 2018

I think:

  • It's ok to provide estimates as good as possible within a time frame of 3-5 minutes
  • We already provide a method for the user to fix the estimates if they are willing. It consists in writing the diskUsage.cache file with the right value. Maybe we can make this more explicit with a special readme or printing a message.

Please let's remember how badly broken this is right now and take this as a step to improving this.

The vast majority of daemons with small storages attached won't be affected by the accuracy problems. Any new ipfs daemons won't be affected. Only nodes with huge existing storages will be affected. In these nodes it's impossible to run repo stat right now.

@kevina
Copy link
Contributor

kevina commented Mar 7, 2018

Only nodes with huge existing storages will be affected. In this nodes it's impossible to run repo stat right now.

Then it that case I think a better solution is to return an undefined value for the DiskUsage possible informing the user on how to rectify it for large repos rather than return an estimate that can be 5-10% off without informing the user of the inaccuracy.

@hsanjuan
Copy link
Contributor Author

hsanjuan commented Mar 7, 2018

without informing the user of the inaccuracy.

I will inform the user

@kevina
Copy link
Contributor

kevina commented Mar 7, 2018

I think it would be better to to keep track of if the disk usage is an estimate or an accurate count and inform the user of which. I am okay if we delay this to another P.R. as long as it gets done sometime soonish.

Also please note, that my option is just that. I will not block this P.R. if it is impeding important progress and will not object if someone dismisses my review.

This provides a disk estimation deadline. We will stat() as many
files as possible until we run out of time. If that happens,
the rest will be calculated as an average.

The user is informed of the slow operation and, if we ran out of time,
about how to obtain better accuracy.
@hsanjuan
Copy link
Contributor Author

hsanjuan commented Mar 7, 2018

I have implemented time deadline in the last commit (set a default of 5 minutes).

It will look like this when starting the datastore and no cache file exists:

root@biham /ipfs/ipfs_master/repo/blocks # ./test 
Calculating datastore size. This might take 5m0s at most and will happen only once
WARN: It took to long to calculate the datastore size
WARN: The total size (7589482442898) is an estimation. You can fix errors by
WARN: replacing the diskUsage.cache file with the right disk usage in bytes and
WARN: re-opening the datastore
7589482442898 <nil>

The actual reference size in biham is: 7536515831332 (so it's pretty close).

@kevina
Copy link
Contributor

kevina commented Mar 7, 2018

It will look like this when starting the datastore and no cache file exists:

It is possible to miss this warning. This information should at least be stored in the datastore somehow. Providing an interface to get to determine if the value stored is an approximation or an exact value can be a separate p.r.

@kevina
Copy link
Contributor

kevina commented Mar 9, 2018

I'm sorry, but what is the standard policy in regard to an author merging there own pull request? Especially when it is non-trivial.

Also see #31.

@whyrusleeping whyrusleeping deleted the feat/diskusage branch March 9, 2018 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants