LevelDB is using a lot of FDs duing --nocopy add #3763

jefft0 · 2017-03-07T19:51:44Z

Version information:

go-ipfs version: 0.4.6-
Repo version: 5
System version: amd64/linux
Golang version: go1.8

Type: bug

Priority: P1

Description:

With a script, I used ipfs add --nocopy --raw-leaves to individually add 9110 webcam video files, each about 200 MB. (I did not use recursive add.) This is a fresh installation of the latest code in master. I did not start the daemon. Now, even after restarting my computer, when I try to add --nocopy another file, I get the error:

open /home/jeff/.ipfs/datastore/206401.ldb: too many open files

The .ipfs/blocks folder has 73689 .data files, and the .ipfs/datastore folder has 4314 .ldb files.

Maybe ipfs add --nocopy is trying to open all the .ldb files at once?

The text was updated successfully, but these errors were encountered:

whyrusleeping · 2017-03-07T20:00:01Z

Hey @jefft0 could you try out #3756 and see if it solved the problem?

jefft0 · 2017-03-08T00:18:57Z

I see you merged the pull request. I pulled and re-built from master, but still get the same error.

whyrusleeping · 2017-03-08T00:38:53Z

@jefft0 interesting... can you help debug this for us?

Theres a program i use for this called lsof (it should be in most linuxes package repos).

Start up your daemon and get its pid. Then do whatever it is that you do to reproduce this problem, and when its about to happen, grab the output of lsof -p $IPFS_PID.

It might help to grab it a few times, before, during (as close as you can get to the error) and after. With this info we should be able to figure out where those pids are coming from and fix the bug

jefft0 · 2017-03-08T08:49:59Z

See the attached output files from lsof .
lsof-before.txt
lsof-during.txt
lsof-after.txt

Kubuxu · 2017-03-08T10:34:25Z

LevelDB is using 500 FDs, this shouldn't happen.

whyrusleeping · 2017-03-08T11:05:18Z

what the hell...

jefft0 · 2017-03-10T20:18:10Z

Any ideas on why so many file descriptors?

whyrusleeping · 2017-03-13T05:29:12Z

@jefft0 I've tried to reproduce the issue several times now with no luck. What sort of disks do you have? and is there anything weird about your setup? (like, different weird filesystems, mountpoints, OS tweaks that might affect this)

jefft0 · 2017-03-13T16:08:42Z

My files are in an 8-drive JBOD housing over USB. I'm using the default mount point for Ubuntu, but through symbolic links in the .ipfs folder (as I described elsewhere to meet the security requirement).

whyrusleeping · 2017-03-13T16:29:38Z

@jefft0 and your .ipfs directory (the datastore and blocks directories) are stored on a normal disk? (ssd or spinner?)

jefft0 · 2017-03-13T16:31:04Z

Yes, they're on my SSD system drive.

jefft0 · 2017-03-13T16:31:26Z

(My default home directory)

whyrusleeping · 2017-03-13T16:34:34Z

@jefft0 Okay, i'll run some experiments with the data being on a mounted drive.

In the meantime, if you don't mind, could you grab a stack dump from your daemon while the file descriptor usage is really high?

curl localhost:5001/debug/pprof/goroutine\?debug=2 > ipfs.stacks

I'm curious to see if there are tons of concurrent processes hitting leveldb for some reason

jefft0 · 2017-03-14T18:06:03Z

Attached.
ipfs.stacks.txt

jefft0 · 2017-03-19T09:40:06Z

@whyrusleeping Here's another data point. I tried the same setup on macOS 10.12 . I'm not running the daemon as I add --nocopy each file. It's easy to get the PID and run lsof because it's getting slower and slower as I add each file. It now takes over a minute to add one 200MB file.

Anyway, attached is the lsof output when adding a file. Why is is necessary to open 363 LDB files at the same time? Is each one opened in a goroutine that maybe doesn't close and terminate?
lsof-no-daemon.txt

jefft0 · 2017-03-31T21:36:13Z

Is add --nocopy supported in js-ipfs on Ubuntu? Do you think it's worth trying to add lots of files with js-ipfs to see if its use of LevelDB doesn't run out of file descriptors?

Kubuxu · 2017-03-31T22:54:04Z

I don't think it is.

whyrusleeping · 2017-03-31T23:19:32Z

@jefft0 i have no idea why leveldb is behaving that way, it doesnt do so on my machines.

Just seeing your lsof output now

jefft0 · 2017-04-01T10:16:50Z

@whyrusleeping On the Ubuntu machine you tested, what version of LevelDB? How was it installed?

jefft0 · 2017-04-07T16:35:21Z

I am able to reproduce the error in a fresh Ubuntu virtual machine (both VirtualBox and Parallels) with a fresh installation of the latest go-ipfs. I transfer a 700MB gzip file and expand it as the new installation's ~/.ipfs . Then, using ipfs add --nocopy --raw-leaves gives the error "too many open files".

So, if someone wants to reproduce and debug this issue, I can get you the 700MB gzip file. (You don't need the original files that were added.) Would that be useful?

jefft0 · 2017-04-07T16:41:42Z

... or better yet, make a virtual machine for you to log into?

whyrusleeping · 2017-04-07T16:43:11Z

@jefft0 Yeah, getting that gzip file would be great!

jefft0 · 2017-04-07T17:20:18Z

See ipfs-too-many-fds.tar.gz .

whyrusleeping · 2017-04-09T17:46:57Z

@jefft0 the archive you sent was an ipfs repo. Was your issue with adding an ipfs repo into ipfs? Or did you mean to zip up something else?

jefft0 · 2017-04-10T15:14:57Z

That's what I meant to send. (Don't worry about the private key. It was generated just for this test.) Steps to reproduce

An installation of Ubuntu 16.04. (Haven't tested with other Ubuntu versions.)
Install the latest IPFS, but don't do ipfs init.
Extract the gzip file as your ~/.ipfs
Put a big file in ~ (for example, the ipfs executable).
Do ipfs add --nocopy --raw-leaves <file>

For me, it gives the error "too many open files".

kevina · 2017-05-16T00:00:04Z

@whyrusleeping, okay I will look into upgrading the leveldb process and see how low I can get the ulimit.

whyrusleeping · 2017-05-16T00:22:56Z

@kevina just wanted to test my theory about batching, so i disabled it by making the batch code just write directly through to the db. It made the add take slightly longer to fail, but still got a too many FD error nonetheless.

Its all tons of leveldb files, This feels like a leveldb bug to me...

kevina · 2017-05-16T02:49:45Z

Okay, I found the problem. The leveldb was not being properly compacted and there where too many "*.ldb" files (I counted over 4000), each of them open. I was able to fix the problem by calling:

db.CompactRange(util.Range{nil,nil})

after the database is open in go-ds-leveldb.NewDatastore. This took awhile when first called but then greatly speed up the add operation (I would guess at least an order of magnitude but I am not sure).

There are a bunch of parameters related to compaction and I'm not sure why the its is not being called automatically in our case.

On a possible related note, we disable compression, I am not sure how much speed benefit this gives us and may also relate to the fact that the leveldb is not automatically compacting (but this is just a wild guess).

whyrusleeping · 2017-05-16T05:51:05Z

I remember we disabled compression back when we were writing all the block data into leveldb, we can probably try turning it back on and seeing how that works.

Though that has nothing to do with compaction as far as i can tell. Its worrying that compaction was never run... I think that happens as a background process. @jefft0 Do you run the ipfs daemon for long periods? Or do you mainly do ipfs add without the daemon running? It could be that the leveldb process just never gets time to run a compaction if the daemon is never on for long.

@kevina do you think we should just run compaction at daemon startup all the time?

jefft0 · 2017-05-16T08:13:34Z

Do you run the ipfs daemon for long periods? Or do you mainly do ipfs add without the daemon running?

In my use case, I am doing the initial add --nocopy of all my files without running the daemon.

kevina · 2017-05-16T16:28:03Z

@whyrusleeping we could run compaction when we open the database (as oppose to just starting the daemon). But I see that as a bit of a hack. Note also that the first compaction could take a very long time (over 10 minutes in this case).

whyrusleeping · 2017-05-16T17:00:22Z

Hrm... How about we find a nice way to add it to 'ipfs repo fsck' ?

…

On Tue, May 16, 2017, 09:28 Kevin Atkinson ***@***.***> wrote: @whyrusleeping <https://github.com/whyrusleeping> we could run compaction when we open the database (as oppose to just starting the daemon). But I see that as a bit of a hack. Note also that the first compaction could take a very long time (over 10 minutes in this case). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3763 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABL4HCtJMklkuYvGKsbd5T-lM0BaXdHLks5r6c6YgaJpZM4MV8et> .

whyrusleeping · 2017-05-16T18:30:58Z

@jefft0 Ah, that makes sense then. In your usecase, leveldb was never able to successfully complete a compaction run

jefft0 · 2017-05-16T18:35:21Z

@whyrusleeping Are you saying that if I start the daemon, it will do a compaction?

jefft0 · 2017-05-16T18:39:23Z

... I'll try it. But if memory serves, I was also getting the error when I was running the daemon, trying to serve the files that I had added.

whyrusleeping · 2017-05-16T18:40:17Z

@jefft0 well, in your rather special case, it will likely start a compaction after a few minutes, then fail because of too many open file descriptors. What i would try it raising your ulimit to something above 5000, and then starting the daemon with --offline (so it doesnt use up any file descriptors making network connections) and wait a bit.

It will likely take half an hour or so, since the compaction code @kevina said "do nothing else, just compact" and the default background compaction tries not to get in the way too much.

jefft0 · 2017-05-16T18:46:06Z

Do I need to add this line of code?

db.CompactRange(util.Range{nil,nil})

whyrusleeping · 2017-05-16T18:52:24Z

@jefft0 no, you shouldnt have to. leveldb will do the compaction automatically in the background. Just make sure you run your daemon with a very high ulimit

kevina · 2017-05-16T23:51:00Z

@whyrusleeping we cold add it to ipfs fsck for when the compaction does not have a chance to run normally.

whyrusleeping · 2017-05-16T23:57:10Z

@kevina Yeah, i like that idea.

whyrusleeping · 2017-05-17T00:00:44Z

really, we just need to move away from leveldb. It has many issues. I have high hopes for badger, but i'm waiting on a few issues to be resolved: dgraph-io/badger#28

manishrjain · 2017-05-19T07:53:23Z

@whyrusleeping : We should be able to resolve them fairly quickly. ETA: week or so.

whyrusleeping · 2017-05-19T18:40:00Z

@manishrjain Thats great news, thanks!

manishrjain · 2017-05-30T13:53:10Z

dgraph-io/badger#28 is now resolved.

whyrusleeping · 2017-09-02T19:01:00Z

@jefft0 has this been resolved? I'm fairly certain the compaction issue can be resolved by running ipfs daemon --offline with a very high (or unlimited) file descriptor limit for some time so that leveldb can compact succesfully.

jefft0 · 2017-09-02T22:10:06Z

I would say yes, it been resolved. Maybe I passed a threshold in having so many files that it always compacts. I dunno. But the simple fact is that it keeps adding files without error. Also, when I run ipfs daemon, I always increase the ulimit.

schomatis · 2018-05-03T17:41:26Z

I'm adding the badger label to close this issue when the transition to Badger is completed and this error won't happen for new repositories (unless this can be already closed).

kevina · 2018-05-03T17:59:08Z

@schomatis so I thought the move is to make badger the default, not the only option. Has this changed, once badger is working will no longer support the original configuration with leveldb/flatfs? Or is the intent to move to badger/flatfs? I can see many advantages to still supporting flatfs and hope we won't abandon support for it completely.

schomatis · 2018-05-03T18:09:54Z

@kevina Sorry for the confusion, leveldb/flatfs will still be supported after the transition (AFAIK), what I meant to say is that, as this issue was raised for the scenario of a default installation,

This is a fresh installation of the latest code in master.

this error won't happen after Badger is the default datastore (yes it will happen if flatfs is specifically chosen for the repo profile). You're right this issue is related to LevelDB and not Badger, but as this issue has been stale for several months already (and the original author already reported that the proposed solution is working) I though it would be sensible to close it once we know this won't happen for new users.

jefft0 mentioned this issue Mar 7, 2017

Implement basic filestore 'no-copy' functionality #3629

Merged

whyrusleeping mentioned this issue Mar 8, 2017

go-reuseport needs to start using singlepoll on OSX #3762

Closed

Kubuxu added kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding labels Mar 15, 2017

Kubuxu changed the title ~~add --nocopy error: too many open files~~ LevelDB is using a lot of FDs duing --nocopy add Mar 15, 2017

kevina mentioned this issue May 17, 2017

Add option to compact leveldb to repo fsck #3927

Open

schomatis added the topic/badger Topic badger label May 3, 2018

schomatis self-assigned this May 3, 2018

Stebalien unassigned schomatis and kevina Apr 22, 2021

LevelDB is using a lot of FDs duing --nocopy add #3763

LevelDB is using a lot of FDs duing --nocopy add #3763

Comments

jefft0 commented Mar 7, 2017

Version information:

Type: bug

Priority: P1

Description:

whyrusleeping commented Mar 7, 2017

jefft0 commented Mar 8, 2017

whyrusleeping commented Mar 8, 2017

jefft0 commented Mar 8, 2017

Kubuxu commented Mar 8, 2017

whyrusleeping commented Mar 8, 2017

jefft0 commented Mar 10, 2017

whyrusleeping commented Mar 13, 2017

jefft0 commented Mar 13, 2017

whyrusleeping commented Mar 13, 2017

jefft0 commented Mar 13, 2017

jefft0 commented Mar 13, 2017

whyrusleeping commented Mar 13, 2017

jefft0 commented Mar 14, 2017

jefft0 commented Mar 19, 2017

jefft0 commented Mar 31, 2017

Kubuxu commented Mar 31, 2017

whyrusleeping commented Mar 31, 2017

jefft0 commented Apr 1, 2017

jefft0 commented Apr 7, 2017

jefft0 commented Apr 7, 2017

whyrusleeping commented Apr 7, 2017

jefft0 commented Apr 7, 2017 • edited Loading

whyrusleeping commented Apr 9, 2017

jefft0 commented Apr 10, 2017

kevina commented May 16, 2017

whyrusleeping commented May 16, 2017

kevina commented May 16, 2017 • edited Loading

whyrusleeping commented May 16, 2017

jefft0 commented May 16, 2017

kevina commented May 16, 2017

whyrusleeping commented May 16, 2017 via email

whyrusleeping commented May 16, 2017

jefft0 commented May 16, 2017

jefft0 commented May 16, 2017 • edited Loading

whyrusleeping commented May 16, 2017

jefft0 commented May 16, 2017

whyrusleeping commented May 16, 2017

kevina commented May 16, 2017

whyrusleeping commented May 16, 2017

whyrusleeping commented May 17, 2017

manishrjain commented May 19, 2017

whyrusleeping commented May 19, 2017

manishrjain commented May 30, 2017

whyrusleeping commented Sep 2, 2017

jefft0 commented Sep 2, 2017

schomatis commented May 3, 2018

kevina commented May 3, 2018

schomatis commented May 3, 2018

jefft0 commented Apr 7, 2017 •

edited

Loading

kevina commented May 16, 2017 •

edited

Loading

jefft0 commented May 16, 2017 •

edited

Loading