Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move account data to persistent storage #2279

Merged
merged 11 commits into from
Feb 27, 2019
Merged

Conversation

sambley
Copy link
Contributor

@sambley sambley commented Dec 26, 2018

Problem

The cost of RAM is much higher that adds up to the cost of operating a full node (16GB of RAM is the same cost as 500GB of high speed NVMe SSDs). Look into ways to reduce the RAM usage by moving some of the data onto SSDs and have them loaded / stored on demand.

Summary of Changes

Implements #2769

To help reduce RAM usage of the nodes, persist storage of accounts across NVMe SSDs and load / store them on a need basis from SSDs.

Store account information across two files: Index and Data
Index: Contains offset into data
Data: Contains the length followed by the account data

The accounts are split across NVMe SSDs using the pubkey as the key.

TODOs:

  • The account data is stored currently across 2 directories and need to be changed to use NVMe
  • Look into performance bottleneck due to using the storage and look to see if the account could be partitioned across multiple directories to parallelize load and store operations.
  • Error handling and remove files across runs

Snapshot and version numbering is not planned for this release.

Fixes # #2499

@sambley sambley added the work in progress This isn't quite right yet label Dec 26, 2018
Copy link
Member

@aeyakovenko aeyakovenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rad!!! Can @sakridge and @garious take a look as well.

@aeyakovenko
Copy link
Member

I think the thread count in the replay stage or the process transactions stage would need to be equal to or higher than the ‘q’ setting on the NVEs.

@sambley sambley requested a review from garious December 26, 2018 02:29
@sambley sambley added the noCI Suppress CI on this Pull Request label Dec 26, 2018
@sambley sambley force-pushed the accountsio branch 3 times, most recently from a38dca6 to a37ab7d Compare December 27, 2018 06:59
@aeyakovenko
Copy link
Member

@sambley did you get a chance to test it on our 2 nve machine?

@garious
Copy link
Contributor

garious commented Jan 2, 2019

@sambley, the text you have under "Problem" in the PR description doesn't describe a problem. It's a summary of a solution. What problem are you solving precisely? At what point does the RAM usage of accounts affect a metric? Or what metric will improve if we merge this PR?

@aeyakovenko
Copy link
Member

@sambley, the text you have under "Problem" in the PR description doesn't describe a problem. It's a summary of a solution. What problem are you solving precisely? At what point does the RAM usage of accounts affect a metric? Or what metric will improve if we merge this PR?

@garious the problem is that the cost of ram is higher than ssds. 16gb of ram is the same cost as 500gb of high speed NVE. so about 30x improvement in cost per full node. Multiply that by 16k fullnodes for an ethereum sized network.

@garious
Copy link
Contributor

garious commented Jan 3, 2019

@aeyakovenko, our cost per fullnode goal is 5k USD. What's the current cost? With a 30x improvement for this particular component, what does that drop the cost to?

@aeyakovenko
Copy link
Member

aeyakovenko commented Jan 3, 2019

@garious cost per allocated byte would be roughly 0.000121875 U.S. dollars / byte for 15,000 nodes at $130 per 16gb. But you can't really build systems with more than 128gb per system cheaply. Motherboards that support more ram are either more expensive, or are not as flexible for GPUs or other components. It is about $15k for a 1TB of ddr per system, and that board doesn't support GPUs. So the price per byte is likely to be higher since the maximum the entire account space can take up is going to be the smallest node that is in the supermajority (desired finality size).

@aeyakovenko
Copy link
Member

@garious at 128gb per system, we end up with a maximum of account number of about 1b, and only if we can optimize the Account instance allocation to fit entirely into 128 bytes. It might be doable, but is going to be hard.

@sambley
Copy link
Contributor Author

sambley commented Jan 3, 2019

@anatoly, I tried it out on the 2 nve m/c today and am seeing the average tps to be twice as slow when number of accounts is closer to 100000 and seems to degrade quite drastically for larger number of accounts. Still looking into it to see what is causing the degradation.

@aeyakovenko
Copy link
Member

@sambley, that’s good data! Can you play around with the qd and IO settings?

@sakridge is journaling disabled?

@sambley
Copy link
Contributor Author

sambley commented Jan 3, 2019

@anatoly, yes I will play around with the settings

@aeyakovenko
Copy link
Member

@sambley, my GitHub handle is @aeyakovenko. I wish I was @anatoly :)

@garious
Copy link
Contributor

garious commented Jan 3, 2019

cc #1884

@aeyakovenko
Copy link
Member

@sambley, my guess is performance is better while the file system is in the Linux ram cache.

@garious
Copy link
Contributor

garious commented Jan 3, 2019

@sambley, does this PR assume SSDs are available? Or is there some way to get the original behavior when there are no SSDs available (like on a developer machine)?

@aeyakovenko
Copy link
Member

@garious we can figure out how to factor out the load/store to persistent storage. In theory, the linux kernel caches the file system in ram, so if the on disk accounts fit in ram it shouldn't be noticable for developers.

@sakridge
Copy link
Member

sakridge commented Jan 4, 2019

@sambley, that’s good data! Can you play around with the qd and IO settings?

@sakridge is journaling disabled?

@aeyakovenko I haven't formatted the drives, @sambley should have access to the machine now though.

*edit saw that he already ran the experiment so it's a question for @sambley

@sambley
Copy link
Contributor Author

sambley commented Jan 4, 2019

@aeyakovenko, formatted the drives for ext4, so journaling should be enabled. I have rewritten the implementation to use memory mapped I/O and that seems to perform on par for atleast 100000 accounts as expected. Will try out for larger number of accounts and see how it behaves.

@aeyakovenko
Copy link
Member

@sambley Awesome! What TPS are you seeing? Journaling might be significantly worse for writes in some cases. We would need to profile with both. I think there might be a bunch of parameters to tune there too.

@sambley
Copy link
Contributor Author

sambley commented Jan 5, 2019

@aeyakovenko, its hitting only a mean TPS of 30K, would experiment tuning the different parameters to see which one provides better results.

@aeyakovenko
Copy link
Member

aeyakovenko commented Jan 5, 2019

@sambley the spec for those has 500,000 random writes per sec with qd32

How many reads and writes are we doing per tx? Can you profile a mmap file on those devices as well?

@aeyakovenko
Copy link
Member

@sambley, another thing to try would be to append any new accounts, like one file per bank thread, and store the file+index offset in the tree. Appending is well optimized by all the drivers and the hardware.

@sambley
Copy link
Contributor Author

sambley commented Feb 26, 2019

@garious, updated patch with your other review comments.
Just to clarify, you would like to fallback to older accounts interface if the paths is set to None and only use the persistent store changes if the paths is set

Copy link
Contributor

@garious garious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an amazing contribution. Thanks so much!

@garious
Copy link
Contributor

garious commented Feb 26, 2019

@sakridge, still waiting for changes?

@sambley sambley force-pushed the accountsio branch 2 times, most recently from beb73fd to 560d8a0 Compare February 26, 2019 17:44
@sakridge
Copy link
Member

@garious it doesn't do the fallback to the hashmap-only implementation. It also takes over the Bank::id with another value, I'm not sure if that's safe because I don't know what exactly we are using that for before.

@sakridge
Copy link
Member

@sambley I was trying to rebase this change and I found it doesn't pass this test which seems to cause it to fail one of the integration tests: #2961

@sambley
Copy link
Contributor Author

sambley commented Feb 27, 2019

@sambley I was trying to rebase this change and I found it doesn't pass this test which seems to cause it to fail one of the integration tests: #2961

I am working on adding the fallback mechanism to hashmap implementation and will fix the test failure as well

@sakridge
Copy link
Member

@sambley sweet, thanks!

@sakridge
Copy link
Member

@sambley we also decided the fallback to hashmap-only implementation is not necessary today, so prioritize that last.

@sambley sambley force-pushed the accountsio branch 2 times, most recently from 218a25b to 5ef7c3a Compare February 27, 2019 06:33
@sambley
Copy link
Contributor Author

sambley commented Feb 27, 2019

@sambley we also decided the fallback to hashmap-only implementation is not necessary today, so prioritize that last.

@sakridge, fixed the test failures. Let me know if we want to merge this and work on the fallback mechanism or any pending issues in separate PR.

@garious
Copy link
Contributor

garious commented Feb 27, 2019

This looks good to me. I'm okay with this being merged if @sakridge approves. The quantity of unsafe calls definitely makes me uncomfortable about not having a fallback, so I hope we can get a follow-up PR shortly after. That PR doesn't necessarily have to add a fallback though. Finding a way to eliminate all (maybe most) of the unsafe calls might be sufficient. What's important to me is that if some component in Solana causes a seemingly-random crash, that there's some build variation where the compiler can guarantee that's not possible. That means all Rust, only safe uses of unsafe (need to manually reason through each instance), and no C/C++ libs under the hood. We're not there yet, but this PR moves us in the opposite direction because of those unsafe, unaligned memory accesses.

Copy link
Member

@sakridge sakridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me.

@sakridge sakridge merged commit ca0f16c into solana-labs:master Feb 27, 2019
brooksprumo added a commit to brooksprumo/solana that referenced this pull request Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants