Move account data to persistent storage #2279

sambley · 2018-12-26T02:05:29Z

Problem

The cost of RAM is much higher that adds up to the cost of operating a full node (16GB of RAM is the same cost as 500GB of high speed NVMe SSDs). Look into ways to reduce the RAM usage by moving some of the data onto SSDs and have them loaded / stored on demand.

Summary of Changes

Implements #2769

To help reduce RAM usage of the nodes, persist storage of accounts across NVMe SSDs and load / store them on a need basis from SSDs.

Store account information across two files: Index and Data
Index: Contains offset into data
Data: Contains the length followed by the account data

The accounts are split across NVMe SSDs using the pubkey as the key.

TODOs:

The account data is stored currently across 2 directories and need to be changed to use NVMe
Look into performance bottleneck due to using the storage and look to see if the account could be partitioned across multiple directories to parallelize load and store operations.
Error handling and remove files across runs

Snapshot and version numbering is not planned for this release.

Fixes # #2499

aeyakovenko

Rad!!! Can @sakridge and @garious take a look as well.

aeyakovenko · 2018-12-26T02:13:15Z

I think the thread count in the replay stage or the process transactions stage would need to be equal to or higher than the ‘q’ setting on the NVEs.

aeyakovenko · 2019-01-02T23:15:56Z

@sambley did you get a chance to test it on our 2 nve machine?

garious · 2019-01-02T23:40:57Z

@sambley, the text you have under "Problem" in the PR description doesn't describe a problem. It's a summary of a solution. What problem are you solving precisely? At what point does the RAM usage of accounts affect a metric? Or what metric will improve if we merge this PR?

aeyakovenko · 2019-01-02T23:59:07Z

@sambley, the text you have under "Problem" in the PR description doesn't describe a problem. It's a summary of a solution. What problem are you solving precisely? At what point does the RAM usage of accounts affect a metric? Or what metric will improve if we merge this PR?

@garious the problem is that the cost of ram is higher than ssds. 16gb of ram is the same cost as 500gb of high speed NVE. so about 30x improvement in cost per full node. Multiply that by 16k fullnodes for an ethereum sized network.

garious · 2019-01-03T02:26:04Z

@aeyakovenko, our cost per fullnode goal is 5k USD. What's the current cost? With a 30x improvement for this particular component, what does that drop the cost to?

aeyakovenko · 2019-01-03T03:42:12Z

@garious cost per allocated byte would be roughly 0.000121875 U.S. dollars / byte for 15,000 nodes at $130 per 16gb. But you can't really build systems with more than 128gb per system cheaply. Motherboards that support more ram are either more expensive, or are not as flexible for GPUs or other components. It is about $15k for a 1TB of ddr per system, and that board doesn't support GPUs. So the price per byte is likely to be higher since the maximum the entire account space can take up is going to be the smallest node that is in the supermajority (desired finality size).

aeyakovenko · 2019-01-03T03:50:18Z

@garious at 128gb per system, we end up with a maximum of account number of about 1b, and only if we can optimize the Account instance allocation to fit entirely into 128 bytes. It might be doable, but is going to be hard.

sambley · 2019-01-03T05:35:55Z

@anatoly, I tried it out on the 2 nve m/c today and am seeing the average tps to be twice as slow when number of accounts is closer to 100000 and seems to degrade quite drastically for larger number of accounts. Still looking into it to see what is causing the degradation.

aeyakovenko · 2019-01-03T05:46:47Z

@sambley, that’s good data! Can you play around with the qd and IO settings?

@sakridge is journaling disabled?

sambley · 2019-01-03T07:12:27Z

@anatoly, yes I will play around with the settings

aeyakovenko · 2019-01-03T08:03:04Z

@sambley, my GitHub handle is @aeyakovenko. I wish I was @anatoly :)

garious · 2019-01-03T16:19:00Z

cc #1884

aeyakovenko · 2019-01-03T16:56:39Z

@sambley, my guess is performance is better while the file system is in the Linux ram cache.

garious · 2019-01-03T17:28:00Z

@sambley, does this PR assume SSDs are available? Or is there some way to get the original behavior when there are no SSDs available (like on a developer machine)?

aeyakovenko · 2019-01-03T17:51:25Z

@garious we can figure out how to factor out the load/store to persistent storage. In theory, the linux kernel caches the file system in ram, so if the on disk accounts fit in ram it shouldn't be noticable for developers.

sakridge · 2019-01-04T04:19:18Z

@sambley, that’s good data! Can you play around with the qd and IO settings?

@sakridge is journaling disabled?

@aeyakovenko I haven't formatted the drives, @sambley should have access to the machine now though.

*edit saw that he already ran the experiment so it's a question for @sambley

sambley · 2019-01-04T07:08:10Z

@aeyakovenko, formatted the drives for ext4, so journaling should be enabled. I have rewritten the implementation to use memory mapped I/O and that seems to perform on par for atleast 100000 accounts as expected. Will try out for larger number of accounts and see how it behaves.

aeyakovenko · 2019-01-04T13:21:22Z

@sambley Awesome! What TPS are you seeing? Journaling might be significantly worse for writes in some cases. We would need to profile with both. I think there might be a bunch of parameters to tune there too.

sambley · 2019-01-05T08:25:23Z

@aeyakovenko, its hitting only a mean TPS of 30K, would experiment tuning the different parameters to see which one provides better results.

aeyakovenko · 2019-01-05T13:02:47Z

@sambley the spec for those has 500,000 random writes per sec with qd32

How many reads and writes are we doing per tx? Can you profile a mmap file on those devices as well?

aeyakovenko · 2019-01-08T02:41:19Z

@sambley, another thing to try would be to append any new accounts, like one file per bank thread, and store the file+index offset in the tree. Appending is well optimized by all the drivers and the hardware.

sambley · 2019-02-26T06:20:39Z

@garious, updated patch with your other review comments.
Just to clarify, you would like to fallback to older accounts interface if the paths is set to None and only use the persistent store changes if the paths is set

garious

This is an amazing contribution. Thanks so much!

garious · 2019-02-26T16:28:50Z

@sakridge, still waiting for changes?

sakridge · 2019-02-26T18:31:35Z

@garious it doesn't do the fallback to the hashmap-only implementation. It also takes over the Bank::id with another value, I'm not sure if that's safe because I don't know what exactly we are using that for before.

sakridge · 2019-02-26T22:12:53Z

@sambley I was trying to rebase this change and I found it doesn't pass this test which seems to cause it to fail one of the integration tests: #2961

sambley · 2019-02-27T00:58:41Z

@sambley I was trying to rebase this change and I found it doesn't pass this test which seems to cause it to fail one of the integration tests: #2961

I am working on adding the fallback mechanism to hashmap implementation and will fix the test failure as well

sakridge · 2019-02-27T01:45:32Z

@sambley sweet, thanks!

sakridge · 2019-02-27T01:57:11Z

@sambley we also decided the fallback to hashmap-only implementation is not necessary today, so prioritize that last.

Add Accounts::has_accounts function for hash_internal_state calculation.

- Fix format check warnings

Also reduce some code duplication with cleanup_dirs fn.

sambley · 2019-02-27T07:03:40Z

@sambley we also decided the fallback to hashmap-only implementation is not necessary today, so prioritize that last.

@sakridge, fixed the test failures. Let me know if we want to merge this and work on the fallback mechanism or any pending issues in separate PR.

garious · 2019-02-27T15:40:40Z

This looks good to me. I'm okay with this being merged if @sakridge approves. The quantity of unsafe calls definitely makes me uncomfortable about not having a fallback, so I hope we can get a follow-up PR shortly after. That PR doesn't necessarily have to add a fallback though. Finding a way to eliminate all (maybe most) of the unsafe calls might be sufficient. What's important to me is that if some component in Solana causes a seemingly-random crash, that there's some build variation where the compiler can guarantee that's not possible. That means all Rust, only safe uses of unsafe (need to manually reason through each instance), and no C/C++ libs under the hood. We're not there yet, but this PR moves us in the opposite direction because of those unsafe, unaligned memory accesses.

sakridge

Seems good to me.

…ate-snapshot (solana-labs#2279)

sambley added the work in progress This isn't quite right yet label Dec 26, 2018

sambley requested review from sakridge and aeyakovenko December 26, 2018 02:05

aeyakovenko reviewed Dec 26, 2018

View reviewed changes

sambley requested a review from garious December 26, 2018 02:29

sambley added the noCI Suppress CI on this Pull Request label Dec 26, 2018

sambley force-pushed the accountsio branch 3 times, most recently from a38dca6 to a37ab7d Compare December 27, 2018 06:59

sambley force-pushed the accountsio branch from 0922b71 to 68b6e6c Compare February 26, 2019 06:05

garious approved these changes Feb 26, 2019

View reviewed changes

sambley force-pushed the accountsio branch 2 times, most recently from beb73fd to 560d8a0 Compare February 26, 2019 17:44

sambley force-pushed the accountsio branch 2 times, most recently from 218a25b to 5ef7c3a Compare February 27, 2019 06:33

sambley and others added 11 commits February 26, 2019 22:35

AppendVec

d09883d

Persistent account storage across directories

9ec85e7

Rebase and panic with no accounts

7abc5e0

Add Accounts::has_accounts function for hash_internal_state calculation.

tx count per fork

bc800b3

- Fix lock/unlock of accounts

ba55b04

- Fix format check warnings

Change benchmark path to target/ or OUT_DIR

ba2049a

Also reduce some code duplication with cleanup_dirs fn.

Performance optimizations

07a46fb

Use default if previous values do not exist

a880280

Fix review comments

2b12ca6

Remove extraneous print.

e9d062d

Fix test failure

bfcbc42

sambley force-pushed the accountsio branch from 5ef7c3a to bfcbc42 Compare February 27, 2019 06:45

sakridge approved these changes Feb 27, 2019

View reviewed changes

sakridge merged commit ca0f16c into solana-labs:master Feb 27, 2019

brooksprumo added a commit to brooksprumo/solana that referenced this pull request Jul 25, 2024

Waits for initial startup verification to complete in ledger-tool cre…

1e3cea6

…ate-snapshot (solana-labs#2279)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move account data to persistent storage #2279

Move account data to persistent storage #2279

sambley commented Dec 26, 2018 •

edited

Loading

aeyakovenko left a comment

aeyakovenko commented Dec 26, 2018

aeyakovenko commented Jan 2, 2019

garious commented Jan 2, 2019

aeyakovenko commented Jan 2, 2019

garious commented Jan 3, 2019

aeyakovenko commented Jan 3, 2019 •

edited

Loading

aeyakovenko commented Jan 3, 2019

sambley commented Jan 3, 2019

aeyakovenko commented Jan 3, 2019

sambley commented Jan 3, 2019

aeyakovenko commented Jan 3, 2019

garious commented Jan 3, 2019

aeyakovenko commented Jan 3, 2019

garious commented Jan 3, 2019

aeyakovenko commented Jan 3, 2019

sakridge commented Jan 4, 2019 •

edited

Loading

sambley commented Jan 4, 2019

aeyakovenko commented Jan 4, 2019

sambley commented Jan 5, 2019

aeyakovenko commented Jan 5, 2019 •

edited

Loading

aeyakovenko commented Jan 8, 2019

sambley commented Feb 26, 2019 •

edited

Loading

garious left a comment

garious commented Feb 26, 2019

sakridge commented Feb 26, 2019

sakridge commented Feb 26, 2019

sambley commented Feb 27, 2019

sakridge commented Feb 27, 2019

sakridge commented Feb 27, 2019

sambley commented Feb 27, 2019

garious commented Feb 27, 2019

sakridge left a comment

Move account data to persistent storage #2279

Move account data to persistent storage #2279

Conversation

sambley commented Dec 26, 2018 • edited Loading

Problem

Summary of Changes

aeyakovenko left a comment

Choose a reason for hiding this comment

aeyakovenko commented Dec 26, 2018

aeyakovenko commented Jan 2, 2019

garious commented Jan 2, 2019

aeyakovenko commented Jan 2, 2019

garious commented Jan 3, 2019

aeyakovenko commented Jan 3, 2019 • edited Loading

aeyakovenko commented Jan 3, 2019

sambley commented Jan 3, 2019

aeyakovenko commented Jan 3, 2019

sambley commented Jan 3, 2019

aeyakovenko commented Jan 3, 2019

garious commented Jan 3, 2019

aeyakovenko commented Jan 3, 2019

garious commented Jan 3, 2019

aeyakovenko commented Jan 3, 2019

sakridge commented Jan 4, 2019 • edited Loading

sambley commented Jan 4, 2019

aeyakovenko commented Jan 4, 2019

sambley commented Jan 5, 2019

aeyakovenko commented Jan 5, 2019 • edited Loading

aeyakovenko commented Jan 8, 2019

sambley commented Feb 26, 2019 • edited Loading

garious left a comment

Choose a reason for hiding this comment

garious commented Feb 26, 2019

sakridge commented Feb 26, 2019

sakridge commented Feb 26, 2019

sambley commented Feb 27, 2019

sakridge commented Feb 27, 2019

sakridge commented Feb 27, 2019

sambley commented Feb 27, 2019

garious commented Feb 27, 2019

sakridge left a comment

Choose a reason for hiding this comment

sambley commented Dec 26, 2018 •

edited

Loading

aeyakovenko commented Jan 3, 2019 •

edited

Loading

sakridge commented Jan 4, 2019 •

edited

Loading

aeyakovenko commented Jan 5, 2019 •

edited

Loading

sambley commented Feb 26, 2019 •

edited

Loading