Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LevelDB is using a lot of FDs duing --nocopy add #3763

Open
jefft0 opened this issue Mar 7, 2017 · 58 comments
Open

LevelDB is using a lot of FDs duing --nocopy add #3763

jefft0 opened this issue Mar 7, 2017 · 58 comments
Labels
kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding topic/badger Topic badger

Comments

@jefft0
Copy link
Contributor

jefft0 commented Mar 7, 2017

Version information:

go-ipfs version: 0.4.6-
Repo version: 5
System version: amd64/linux
Golang version: go1.8

Type: bug

Priority: P1

Description:

With a script, I used ipfs add --nocopy --raw-leaves to individually add 9110 webcam video files, each about 200 MB. (I did not use recursive add.) This is a fresh installation of the latest code in master. I did not start the daemon. Now, even after restarting my computer, when I try to add --nocopy another file, I get the error:

open /home/jeff/.ipfs/datastore/206401.ldb: too many open files

The .ipfs/blocks folder has 73689 .data files, and the .ipfs/datastore folder has 4314 .ldb files.

Maybe ipfs add --nocopy is trying to open all the .ldb files at once?

@whyrusleeping
Copy link
Member

Hey @jefft0 could you try out #3756 and see if it solved the problem?

@jefft0
Copy link
Contributor Author

jefft0 commented Mar 8, 2017

I see you merged the pull request. I pulled and re-built from master, but still get the same error.

@whyrusleeping
Copy link
Member

@jefft0 interesting... can you help debug this for us?

Theres a program i use for this called lsof (it should be in most linuxes package repos).

Start up your daemon and get its pid. Then do whatever it is that you do to reproduce this problem, and when its about to happen, grab the output of lsof -p $IPFS_PID.

It might help to grab it a few times, before, during (as close as you can get to the error) and after. With this info we should be able to figure out where those pids are coming from and fix the bug

@jefft0
Copy link
Contributor Author

jefft0 commented Mar 8, 2017

See the attached output files from lsof .
lsof-before.txt
lsof-during.txt
lsof-after.txt

@Kubuxu
Copy link
Member

Kubuxu commented Mar 8, 2017

LevelDB is using 500 FDs, this shouldn't happen.

@whyrusleeping
Copy link
Member

what the hell...

@jefft0
Copy link
Contributor Author

jefft0 commented Mar 10, 2017

Any ideas on why so many file descriptors?

@whyrusleeping
Copy link
Member

@jefft0 I've tried to reproduce the issue several times now with no luck. What sort of disks do you have? and is there anything weird about your setup? (like, different weird filesystems, mountpoints, OS tweaks that might affect this)

@jefft0
Copy link
Contributor Author

jefft0 commented Mar 13, 2017

My files are in an 8-drive JBOD housing over USB. I'm using the default mount point for Ubuntu, but through symbolic links in the .ipfs folder (as I described elsewhere to meet the security requirement).

@whyrusleeping
Copy link
Member

@jefft0 and your .ipfs directory (the datastore and blocks directories) are stored on a normal disk? (ssd or spinner?)

@jefft0
Copy link
Contributor Author

jefft0 commented Mar 13, 2017

Yes, they're on my SSD system drive.

@jefft0
Copy link
Contributor Author

jefft0 commented Mar 13, 2017

(My default home directory)

@whyrusleeping
Copy link
Member

@jefft0 Okay, i'll run some experiments with the data being on a mounted drive.

In the meantime, if you don't mind, could you grab a stack dump from your daemon while the file descriptor usage is really high?

curl localhost:5001/debug/pprof/goroutine\?debug=2 > ipfs.stacks

I'm curious to see if there are tons of concurrent processes hitting leveldb for some reason

@jefft0
Copy link
Contributor Author

jefft0 commented Mar 14, 2017

Attached.
ipfs.stacks.txt

@Kubuxu Kubuxu added kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding labels Mar 15, 2017
@Kubuxu Kubuxu changed the title add --nocopy error: too many open files LevelDB is using a lot of FDs duing --nocopy add Mar 15, 2017
@jefft0
Copy link
Contributor Author

jefft0 commented Mar 19, 2017

@whyrusleeping Here's another data point. I tried the same setup on macOS 10.12 . I'm not running the daemon as I add --nocopy each file. It's easy to get the PID and run lsof because it's getting slower and slower as I add each file. It now takes over a minute to add one 200MB file.

Anyway, attached is the lsof output when adding a file. Why is is necessary to open 363 LDB files at the same time? Is each one opened in a goroutine that maybe doesn't close and terminate?
lsof-no-daemon.txt

@jefft0
Copy link
Contributor Author

jefft0 commented Mar 31, 2017

Is add --nocopy supported in js-ipfs on Ubuntu? Do you think it's worth trying to add lots of files with js-ipfs to see if its use of LevelDB doesn't run out of file descriptors?

@Kubuxu
Copy link
Member

Kubuxu commented Mar 31, 2017

I don't think it is.

@whyrusleeping
Copy link
Member

@jefft0 i have no idea why leveldb is behaving that way, it doesnt do so on my machines.

Just seeing your lsof output now

@jefft0
Copy link
Contributor Author

jefft0 commented Apr 1, 2017

@whyrusleeping On the Ubuntu machine you tested, what version of LevelDB? How was it installed?

@jefft0
Copy link
Contributor Author

jefft0 commented Apr 7, 2017

I am able to reproduce the error in a fresh Ubuntu virtual machine (both VirtualBox and Parallels) with a fresh installation of the latest go-ipfs. I transfer a 700MB gzip file and expand it as the new installation's ~/.ipfs . Then, using ipfs add --nocopy --raw-leaves gives the error "too many open files".

So, if someone wants to reproduce and debug this issue, I can get you the 700MB gzip file. (You don't need the original files that were added.) Would that be useful?

@jefft0
Copy link
Contributor Author

jefft0 commented Apr 7, 2017

... or better yet, make a virtual machine for you to log into?

@whyrusleeping
Copy link
Member

@jefft0 Yeah, getting that gzip file would be great!

@jefft0
Copy link
Contributor Author

jefft0 commented Apr 7, 2017

See ipfs-too-many-fds.tar.gz .

@whyrusleeping
Copy link
Member

@jefft0 the archive you sent was an ipfs repo. Was your issue with adding an ipfs repo into ipfs? Or did you mean to zip up something else?

@jefft0
Copy link
Contributor Author

jefft0 commented Apr 10, 2017

That's what I meant to send. (Don't worry about the private key. It was generated just for this test.) Steps to reproduce

  • An installation of Ubuntu 16.04. (Haven't tested with other Ubuntu versions.)
  • Install the latest IPFS, but don't do ipfs init.
  • Extract the gzip file as your ~/.ipfs
  • Put a big file in ~ (for example, the ipfs executable).
  • Do ipfs add --nocopy --raw-leaves <file>

For me, it gives the error "too many open files".

@kevina
Copy link
Contributor

kevina commented May 16, 2017

@whyrusleeping, okay I will look into upgrading the leveldb process and see how low I can get the ulimit.

@whyrusleeping
Copy link
Member

@kevina just wanted to test my theory about batching, so i disabled it by making the batch code just write directly through to the db. It made the add take slightly longer to fail, but still got a too many FD error nonetheless.

Its all tons of leveldb files, This feels like a leveldb bug to me...

@kevina
Copy link
Contributor

kevina commented May 16, 2017

Okay, I found the problem. The leveldb was not being properly compacted and there where too many "*.ldb" files (I counted over 4000), each of them open. I was able to fix the problem by calling:

db.CompactRange(util.Range{nil,nil})

after the database is open in go-ds-leveldb.NewDatastore. This took awhile when first called but then greatly speed up the add operation (I would guess at least an order of magnitude but I am not sure).

There are a bunch of parameters related to compaction and I'm not sure why the its is not being called automatically in our case.

On a possible related note, we disable compression, I am not sure how much speed benefit this gives us and may also relate to the fact that the leveldb is not automatically compacting (but this is just a wild guess).

@whyrusleeping
Copy link
Member

I remember we disabled compression back when we were writing all the block data into leveldb, we can probably try turning it back on and seeing how that works.

Though that has nothing to do with compaction as far as i can tell. Its worrying that compaction was never run... I think that happens as a background process. @jefft0 Do you run the ipfs daemon for long periods? Or do you mainly do ipfs add without the daemon running? It could be that the leveldb process just never gets time to run a compaction if the daemon is never on for long.

@kevina do you think we should just run compaction at daemon startup all the time?

@jefft0
Copy link
Contributor Author

jefft0 commented May 16, 2017

Do you run the ipfs daemon for long periods? Or do you mainly do ipfs add without the daemon running?

In my use case, I am doing the initial add --nocopy of all my files without running the daemon.

@kevina
Copy link
Contributor

kevina commented May 16, 2017

@whyrusleeping we could run compaction when we open the database (as oppose to just starting the daemon). But I see that as a bit of a hack. Note also that the first compaction could take a very long time (over 10 minutes in this case).

@whyrusleeping
Copy link
Member

whyrusleeping commented May 16, 2017 via email

@whyrusleeping
Copy link
Member

@jefft0 Ah, that makes sense then. In your usecase, leveldb was never able to successfully complete a compaction run

@jefft0
Copy link
Contributor Author

jefft0 commented May 16, 2017

@whyrusleeping Are you saying that if I start the daemon, it will do a compaction?

@jefft0
Copy link
Contributor Author

jefft0 commented May 16, 2017

... I'll try it. But if memory serves, I was also getting the error when I was running the daemon, trying to serve the files that I had added.

@whyrusleeping
Copy link
Member

@jefft0 well, in your rather special case, it will likely start a compaction after a few minutes, then fail because of too many open file descriptors. What i would try it raising your ulimit to something above 5000, and then starting the daemon with --offline (so it doesnt use up any file descriptors making network connections) and wait a bit.

It will likely take half an hour or so, since the compaction code @kevina said "do nothing else, just compact" and the default background compaction tries not to get in the way too much.

@jefft0
Copy link
Contributor Author

jefft0 commented May 16, 2017

Do I need to add this line of code?

db.CompactRange(util.Range{nil,nil})

@whyrusleeping
Copy link
Member

@jefft0 no, you shouldnt have to. leveldb will do the compaction automatically in the background. Just make sure you run your daemon with a very high ulimit

@kevina
Copy link
Contributor

kevina commented May 16, 2017

@whyrusleeping we cold add it to ipfs fsck for when the compaction does not have a chance to run normally.

@whyrusleeping
Copy link
Member

@kevina Yeah, i like that idea.

@whyrusleeping
Copy link
Member

really, we just need to move away from leveldb. It has many issues. I have high hopes for badger, but i'm waiting on a few issues to be resolved: dgraph-io/badger#28

@manishrjain
Copy link

@whyrusleeping : We should be able to resolve them fairly quickly. ETA: week or so.

@whyrusleeping
Copy link
Member

@manishrjain Thats great news, thanks!

@manishrjain
Copy link

dgraph-io/badger#28 is now resolved.

@whyrusleeping
Copy link
Member

@jefft0 has this been resolved? I'm fairly certain the compaction issue can be resolved by running ipfs daemon --offline with a very high (or unlimited) file descriptor limit for some time so that leveldb can compact succesfully.

@jefft0
Copy link
Contributor Author

jefft0 commented Sep 2, 2017

I would say yes, it been resolved. Maybe I passed a threshold in having so many files that it always compacts. I dunno. But the simple fact is that it keeps adding files without error. Also, when I run ipfs daemon, I always increase the ulimit.

@schomatis
Copy link
Contributor

I'm adding the badger label to close this issue when the transition to Badger is completed and this error won't happen for new repositories (unless this can be already closed).

@schomatis schomatis added the topic/badger Topic badger label May 3, 2018
@schomatis schomatis self-assigned this May 3, 2018
@kevina
Copy link
Contributor

kevina commented May 3, 2018

@schomatis so I thought the move is to make badger the default, not the only option. Has this changed, once badger is working will no longer support the original configuration with leveldb/flatfs? Or is the intent to move to badger/flatfs? I can see many advantages to still supporting flatfs and hope we won't abandon support for it completely.

@schomatis
Copy link
Contributor

@kevina Sorry for the confusion, leveldb/flatfs will still be supported after the transition (AFAIK), what I meant to say is that, as this issue was raised for the scenario of a default installation,

This is a fresh installation of the latest code in master.

this error won't happen after Badger is the default datastore (yes it will happen if flatfs is specifically chosen for the repo profile). You're right this issue is related to LevelDB and not Badger, but as this issue has been stale for several months already (and the original author already reported that the proposed solution is working) I though it would be sensible to close it once we know this won't happen for new users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding topic/badger Topic badger
Projects
None yet
Development

No branches or pull requests

6 participants