Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient file storage #84

Open
btrask opened this issue Aug 9, 2015 · 2 comments
Open

More efficient file storage #84

btrask opened this issue Aug 9, 2015 · 2 comments

Comments

@btrask
Copy link
Owner

btrask commented Aug 9, 2015

Surprisingly, our submission bottleneck is the file system, not the database.

Process for adding a file:

  1. Write it to a temporary location
  2. fsync the file
  3. Atomic rename (actually link(2)) to final location
  4. Open the parent directory
  5. fsync the directory
  6. Close the directory

We don't do any batching and can't really because the directories accessed are random (first byte of the hash).

Surprisingly again, there's no reason for our on-disk file representation to actually use content addresses. Once we look up the file info in the database, which we always have to do anyway, we might as well use the file ID or some other sequential ID to access the file.

I'm thinking we could write files to a spread of ~100 directories in batches of 1000. So files 1-1000 go in directory 1, 1001-2000 in directory 2, etc. Then it wraps back around so 10,001-11,000 are back in directory 1 again.

For batch submissions (hopefully most submissions, depending on #1), this should cut the syscall overhead from like 5 down to 2, plus 3 for the whole batch. And the number of fsyncs would be cut from 2 per file to 1.

@btrask
Copy link
Owner Author

btrask commented Aug 23, 2015

A problem with the above idea of using sequential IDs instead of hashes for internal storage is that we can end up with junk in the file system when a transaction rolls back. Coordinating transactions with the file system so they both get rolled back atomically is hard.

@btrask
Copy link
Owner Author

btrask commented Aug 23, 2015

BTW there is also the idea of storing small files directly in the database. I don't know the exact tipping point where it becomes worth it, but most file systems use 4K blocks and LSM trees suffer with large blobs during compaction. A reasonable threshold might be anywhere between 128B and 2KB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant