Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lighthouse space usage - Increasing dramastically from 87GB to 430GB in 7 days #4773

Closed
luarx opened this issue Sep 22, 2023 · 40 comments
Closed
Assignees
Labels
bug Something isn't working database v4.6.0 ETA Q1 2024

Comments

@luarx
Copy link

luarx commented Sep 22, 2023

Description

We are running several Lighthouse clients and I have noticed that in 10% of them there are some ocassions were the volume usage is increased 5x in 7 days!

My temporal solution is about deleting the volume date and resync, after that the Lighthouse client is stable again. But some weeks later a different Lighthouse client repeats this weird behaviour...

image

Version

Lighthouse v4.1.0 (docker image)

Present Behaviour

The volume usage increases 5x in 7 days

Params:

          args: [
              "--debug-level=info",
              "--datadir=/beacondata",
              "--network=mainnet",
              "beacon_node",
              "--disable-enr-auto-update",
              "--enr-address=127.0.0.1",
              "--enr-tcp-port=9000",
              "--enr-udp-port=9000",
              "--port=9000" ,
              "--discovery-port=9000",
              "--eth1",
              "--http",
              "--http-address=0.0.0.0",
              "--http-port=5052",
              "--metrics",
              "--metrics-address=0.0.0.0",
              "--metrics-port=5054",
              "--listen-address=0.0.0.0",
              "--target-peers=100",
              "--http-allow-sync-stalled",
              "--disable-packet-filter",
              "--execution-endpoint=http://localhost:9545",
              "--jwt-secrets=/tmp/jwtsecret",
              "--disable-deposit-contract-sync",
              "--checkpoint-sync-url=https://beaconstate-mainnet.chainsafe.io"
              ]

Expected Behaviour

Do not increment so heavy the volume usage in a few days, at least some stable values

Steps to resolve

Do not increment so heavy the volume usage in a few days

@luarx
Copy link
Author

luarx commented Sep 22, 2023

Another screenshot of a different Ligthouse client that is using the same version and same params.
Look how the space was stable again after deleting the volume data and resyncing

image

@michaelsproul
Copy link
Member

michaelsproul commented Sep 22, 2023

If you still have any oversized nodes (or if it happens again) you can try:

Running du -sh freezer_db chain_db in the $datadir/beacon. I suspect it's the chain_db that's bloated.

If it is, you can try running lighthouse db inspect --column ste --data /your/datadir to see if the beacon state storage is the culprit (I suspect it is).

We haven't identified the bug that causes this, but have had a few reports of it. The database schema is getting completely overhauled in a few versions time, with a new system that prunes states more aggressively and so far has not shown any signs of blowing up in the same way (see the tree states releases if you're curious).

Still, if we find the bug in the current DB it would be nice to fix it. Debug logs from an affected node would be helpful (DM me on Discord @sproul, or email me: michael at sigmaprime dot io)

@luarx
Copy link
Author

luarx commented Sep 22, 2023

Sure! I still have that node oversized to help to debug this issue

Here you are the results that you have asked for:

du -sh freezer_db chain_db
5.3G	freezer_db
432G	chain_db

I do not know if I am executing correctly this second one:

lighthouse db inspect --column ste --datadir beacondata/
Sep 22 14:08:26.602 INFO Running database manager for mainnet network
Fatal error: DBError { message: "Error { message: \"IO error: lock beacondata/beacon/freezer_db/LOCK: Resource temporarily unavailable\" }" }

Perfect, I will contact you in Discord

@michaelsproul
Copy link
Member

Looks like you've got the command right but need to just stop the BN while you run it. It will produce quite a lot of output so you might also want to pipe it to a file

I'm afk from Discord for the weekend but will get in touch Monday :)

@michaelsproul
Copy link
Member

Thanks, I can see what the issue is now:

  • Your node stopped being able to delete finalized states from the database due to an error:

WARN Block pruning failed error: BeaconStateError(MissingBeaconBlock(SignedBeaconBlockHash(0x69eaa1973c1e57ae88d4d85d46c4a7e30684f0318cb3b3411a1339597aadbecf))), service: beacon

  • Due to Lighthouse's current database schema, these states consume a lot of disk space (around 30GB/day).

The tree states release I mentioned should alleviate the latter point, because it results in vastly less data being stored on disk. However, it would be good to understand why the database is erroring, as I haven't seen this error before. Strangely the block that's missing from your DB isn't one that I can find: it's not part of the canonical chain and it wasn't seen by the couple of nodes I spot-checked (I'll check more thoroughly tomorrow).

It would be great if you could provide logs from around the time the database starting growing, as this might lead us to the mystery block and the root cause. For the best chance of success it would be great to have debug level logs, which should have been written by default to your datadir at /beacondata/beacon/logs/. If you could find the file that has Sept 15 (or whenever the issue started on this node) and DM that to me compressed, that would be great.

Thanks!

@luarx
Copy link
Author

luarx commented Sep 25, 2023

Good to know that! Thanks!

Unfortunately, I have only logs from September 20th because of the internal log rotation I would say...
Let me know if I can provide you some other info

@michaelsproul
Copy link
Member

Any info logs from earlier that might have been retained by Docker or the OS?

@luarx
Copy link
Author

luarx commented Sep 25, 2023

I have checked and it is not possible neither, they have been rotated :(

@michaelsproul
Copy link
Member

That's ok, thanks for your efforts! I'll see if I can dig up that block from somewhere and see if it provides a clue about what went wrong

@jimmygchen jimmygchen added bug Something isn't working database labels Sep 26, 2023
@luarx
Copy link
Author

luarx commented Sep 28, 2023

In case that the same issue is present in some of my Lighthouse nodes, I will copy those logs before they are rotated @michaelsproul

@michaelsproul
Copy link
Member

Hey @luarx I did some investigating and found a few interesting things:

  • That block which was missing from your node (0x69eaa1973c1e57ae88d4d85d46c4a7e30684f0318cb3b3411a1339597aadbecf) was published at slot 7323584 building on a parent from 12 slots prior (!) at 7323572. The block graffiti was Avado and the proposer was 781777. The parent block has root 0xe6ae158ddfd054042b135fdf3691cc5b60f6b4bd2f01308911f77e72d2466ad3 which is the canonical block from that slot.
  • The block was probably published by a Lighthouse validator (according to blockprint). This isn't a bad thing per se, as Lighthouse will publish blocks even when it's slightly out of sync in order to keep the network live. It can't always tell whether there are legitimately no blocks, or it's just out of sync.

Anyway, the details of the block's publishing probably aren't too important, because the issue is how it was processed by your node.

To get that error, the database pruning process must have hit the Ok(None) branch here:

for (head_hash, head_slot) in heads {
// Load head block. If it fails with a decode error, it's likely a reverted block,
// so delete it from the head tracker but leave it and its states in the database
// This is suboptimal as it wastes disk space, but it's difficult to fix. A re-sync
// can be used to reclaim the space.
let head_state_root = match store.get_blinded_block(&head_hash) {
Ok(Some(block)) => block.state_root(),
Ok(None) => {
return Err(BeaconStateError::MissingBeaconBlock(head_hash.into()).into())
}

The pruning is trying to fetch a block which exists in the head_tracker from disk, which should always succeed if the head_tracker is in sync with what's on disk. The question then becomes:

How did we end up with a block in the head_tracker which isn't stored on disk?

There are only two places where we add a block to the head tracker. The main one is here, after importing the block:

self.head_tracker
.register_block(block_root, parent_root, slot);

Notably this happens after the block is written to disk, here:

ops.push(StoreOp::PutBlock(block_root, signed_block.clone()));
ops.push(StoreOp::PutState(block.state_root(), &state));
let txn_lock = self.store.hot_db.begin_rw_transaction();
if let Err(e) = self.store.do_atomically(ops) {

So on this code path it seems like it would be impossible to end up in a situation where the block is in the head tracker, but not written to disk. Even if the node crashed or was SIGKILLed in between those two calls, the result would be that the block was on disk and not in the head tracker.

So where's the other place we call register_block? Here, on node startup when reverting a missed hard fork:

error!(
log,
"Error decoding head block";
"message" => "This node has likely missed a hard fork. \
It will try to revert the invalid blocks and keep running, \
but any stray blocks and states will not be deleted. \
Long-term you should consider re-syncing this node."
);
let (block_root, block) = revert_to_fork_boundary(
current_slot,
initial_head_block_root,
store.clone(),
&self.spec,
&log,
)?;
// Update head tracker.
head_tracker.register_block(block_root, block.parent_root(), block.slot());

I doubt that your node could have hit this, because it hadn't missed the most recent fork (Capella, which happened back in April). It could have tried to run that revert if the head block was corrupt on disk, but even if it did, it wouldn't have selected 7323584 because of the condition here:

if block.slot() < fork_epoch.start_slot(E::slots_per_epoch()) {
Some((block_root, block))

So, in terms of entries being added to the head tracker without the block existing on disk, I think we're in the clear.

The other option is that a block was deleted from disk, but not deleted from the head_tracker. There's only one place where we delete from the head tracker, so lets look for a race there:

for head_hash in abandoned_heads {
head_tracker_lock.remove(&head_hash);
}

Those head tracker deletions get added to an atomic I/O batch here, together with the block deletions:

// Persist the head in case the process is killed or crashes here. This prevents
// the head tracker reverting after our mutation above.
let persisted_head = PersistedBeaconChain {
_canonical_head_block_root: DUMMY_CANONICAL_HEAD_BLOCK_ROOT,
genesis_block_root,
ssz_head_tracker: SszHeadTracker::from_map(&head_tracker_lock),
};
drop(head_tracker_lock);
kv_batch.push(persisted_head.as_kv_store_op(BEACON_CHAIN_DB_KEY));

If LevelDB (and the filesystem) uphold their ACID guarantees, then this batch should happen atomically: either the blocks & head tracker entries are both deleted, or neither are. Even though we release the lock on the head tracker early (drop(head_tracker)), the worst that can happen is that another thread writes the updated tracker to disk before the blocks are deleted, leading to the block not appearing in the head tracker while still existing on disk (this is safe, we're looking for the opposite: block in head tracker but not on disk).

The other possibility is that blocks are deleted elsewhere, but this is not the case. The only usage of StoreOp::DeleteBlock is in the migrator/pruner, and the delete_block function is not used outside of tests.

Therefore, I'm a bit stumped. For the bug to occur, one of the assumptions I've made must have been violated, e.g.

  • LevelDB is not able to do atomic I/O on your disk due to a filesystem issue. I think this is possible if you're using an NFS drive, as NFS drives don't generally provide strong atomicity guarantees. It does look like you're using Docker and/or Kubernetes, so it would be interesting to know if you're using an NFS mount for the datadir.
  • LevelDB isn't able to do atomic I/O properly due to us not calling fsync. We call fsync every so often, but not after every write batch. My understanding on LevelDB's guarantees is that it should still uphold atomicity of batches in this case, but I could be wrong. I think the NFS explanation is more likely.
  • The hardware (e.g RAM/disk) is faulty (probably not, as you've seen it on multiple machines).

Even if we don't understand exactly how it happened, we could still try to mitigate the impact of this occurring by making the pruning process ignore this kind of error. We could simply log a warning whenever this is detected, rather than stalling pruning indefinitely (which is what leads to the database growth). The outcome would be a little bit of database bloat, due to states/blocks that weren't fully cleaned up, but I think it would be better than the alternative. I think this is probably a good harm-minimisation approach, particularly if we discover that this is quite likely on NFS.

Long-term we could also mitigate it by:

  • Being more aggressive with calling fsync (if the fsync hunch turns out to be correct)
  • Using a different backend from LevelDB to see if suffers the same issues (this is in-progress here: Modularize beacon node backend #4718).

TL;DR: I haven't been able to find an obvious Lighthouse bug, and suspect foul play from the filesystem, possibly NFS.

@luarx
Copy link
Author

luarx commented Sep 29, 2023

Amazing, that you for time debugging this and sharing what you have found in such a detailed explanation! 🙌 🙌 🙌

  • About the filesystem
    • We use AWS EBS volumes, so they are not NFS, because NFS presents a file-based interface while EBS is a block/device-based interface
  • About The hardware (e.g RAM/disk) is faulty (probably not, as you've seen it on multiple machines).
    • That's right, it had happened on multiple and different machines

Unfortunately I can not support more on this as it is internal Lighthouse logic that it escapes my control... but I do agree that a good solution at least is to make the pruning process ignore this kind of error to prevent the volume usage increase 👀

@luarx
Copy link
Author

luarx commented Sep 29, 2023

Some more information in case that it can be useful:

  • We are running Lighthouse v4.1.0 from May (previously we had v4.0.1) for mainnet nodes
  • During the last 12 months, we saw the first volume usage spike in July 12, 23h UTC (attached image)
    image
  • We run Lighthouse nodes for mainnet and gnosis-chain networks, and only mainnet nodes had problems.
    • It is true that Gnosis-chain nodes run v4.1.0 from May, then were upgraded to v4.2.0 on Jun 26, later to v4.3.0 in July 17 and finally in September 20 to v4.4.1
  • In September, we saw 2 usage spikes in 2 different nodes running 2 different execution nodes and in 2 different machines. Almost the same day, one was in September 13 and the other September 14
    image
    image

@luarx
Copy link
Author

luarx commented Sep 29, 2023

Just curious, do you think that it could be a bug incorporated in v4.1.0 and fixed in most recent versions @michaelsproul ?
We will upgrade soon to the last version

@michaelsproul
Copy link
Member

Just curious, do you think that it could be a bug incorporated in v4.1.0 and fixed in most recent versions @michaelsproul ?

No, I don't think that's likely, as most of this code hasn't changed since v4.1.0. I would still recommend upgrading though, as there's a small chance it will help.

We use AWS EBS volumes, so they are not NFS, because NFS presents a file-based interface while EBS is a block/device-based interface

Are you using Docker or Kubernetes? I'm trying to understand whether there could be some extra layer between Lighthouse and AWS EBS that is buggy

@luarx
Copy link
Author

luarx commented Nov 2, 2023

I am using Kubernetes 🔥

@winksaville
Copy link
Contributor

I had a similar problem on my prater machine and I reported the issue on discord. @jimmygchen thought I might be running into this same issue and suggested I remove prater/beacon/* and I also added --checkpoint-sync-url "https://prater.checkpoint.sigp.io" parameter when executing beacon.

Before resyncing du reported:

$ du -sh eth2-internal-data/lighthouse/prater/beacon/*
293G    lighthouse/prater/beacon/chain_db
159G    lighthouse/prater/beacon/freezer_db
1.2G    lighthouse/prater/beacon/logs
16K    lighthouse/prater/beacon/network

About 24hrs after resyncing started and the loading of the historical data was completed du reported:

$ du -sh eth2-internal-data/lighthouse/prater/beacon/*
29G    eth2-internal-data/lighthouse/prater/beacon/chain_db
61M    eth2-internal-data/lighthouse/prater/beacon/freezer_db
94M    eth2-internal-data/lighthouse/prater/beacon/logs
16K    eth2-internal-data/lighthouse/prater/beacon/network

So it would appear a "regular" X86 Ubuntu machine can run into this problem:

$ uname -a
Linux ethel 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct  9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Sadly, I didn't save any logs so I don't have any other information, but let me know if there are any testing/debugging you'd like me to do.

I vote this bug be fixed ASAP :)

@michaelsproul
Copy link
Member

Had another report of this on GCP (Google Cloud Platform) using ext4 as the filesystem.

Were you also using GCP @luarx?

@winksaville
Copy link
Contributor

winksaville commented Jan 12, 2024 via email

@michaelsproul
Copy link
Member

I'll add this to my TODO list for next week. Just getting stuck in after the holidays, but can probably squeeze a fix into v4.6.0 proper.

@michaelsproul michaelsproul added the v4.6.0 ETA Q1 2024 label Jan 12, 2024
@michaelsproul michaelsproul self-assigned this Jan 12, 2024
@winksaville
Copy link
Contributor

I just saw the announcement for the Experimental 4.6.111 release, is your proposed fix for that in that release?

@luarx
Copy link
Author

luarx commented Jan 12, 2024

Were you also using GCP @luarx?

No, a different cloud provider 👀

I can confirm that this is still happening in v4.4.1 😟
image

@michaelsproul
Copy link
Member

I just saw the announcement for the Experimental 4.6.111 release, is your proposed fix for that in that release?

No, it isn't. The database is substantially different in that release but I think it could still be affected by this issue, seeing as we don't actually know the root cause. It doesn't contain the fix that I'm proposing: to ignore the database inconsistency when it happens, in order to avoid the space blow-up

@winksaville
Copy link
Contributor

@michaelsproul, maybe you should add it to 4.6.111 and I'll test it on my prater machine that has failed.

Actually, anyone else besides @luarx and me seen this problem?

@michaelsproul
Copy link
Member

michaelsproul commented Jan 13, 2024

@winksaville 4.6.111 requires a resync at the moment, so I'd rather get you to test a PR off stable

I think a few other people on Discord have hit this issue as well

@winksaville
Copy link
Contributor

@winksaville 4.6.111 requires a resync at the moment, so I'd rather get you to test a PR off stable

OK

@michaelsproul
Copy link
Member

michaelsproul commented Jan 16, 2024

Found the root cause with @dapplion! See:

We'll prepare a more polished version of the fix and merge it to unstable shortly. DO NOT RUN THE CODE FROM #5073, IT WON'T WORK.

@paulhauner
Copy link
Member

Resolved via #5084

@adaszko
Copy link
Contributor

adaszko commented Jan 22, 2024

So it seems this issue is a non-trivial manifestation #1557 has been waiting for? ;)

@michaelsproul
Copy link
Member

michaelsproul commented Jan 22, 2024

@adaszko Exactly, yes! I'd forgotten about that issue, thanks for making the link!

We think there are still issues remaining in the head tracker impl, so my resolution now is to delete it entirely.

@winksaville
Copy link
Contributor

Is there another issue tracking the remaining issues?

What's the expected time frame for #5084 to be incorporated into a release?

@michaelsproul
Copy link
Member

Is there another issue tracking the remaining issues?

@winksaville There's one here:

Another (milder) atomicity issue that we decided not to fix is described in the fix PR:

There's also more discussion on the issue about deleting the head tracker:

@luarx
Copy link
Author

luarx commented Feb 6, 2024

Looking forward to seeing the fix in a new version as we are suffering this bug yet in v4.6.0 😞

Could a hotfix version be created at least with #5084? 🙏

@michaelsproul
Copy link
Member

@luarx The fix is already in v4.6

@michaelsproul
Copy link
Member

Did you have a node that wasn't already corrupt become corrupt while running 4.6? If so, can you please share logs?

The fix will not un-corrupt an already corrupt database. For that you need to re-sync.

@luarx
Copy link
Author

luarx commented Feb 6, 2024

Regarding your questions:

  • Network: Sepolia
  • Lighthouse version: v4.6.0
    • We upgraded from v4.4.1 -> v4.6.0 on Jan 29th at 12h UTC
  • More context:
    • It seems that on Jan 30th at 23h UTC our 2 Sepolia nodes started to use more volume space (green line). This is a big coincidence 😮
    • The green line went down when we erased the volume data
    • Unfortunately we only have logs from Feb 6th

Lighthouse_1 (zoom in and zoom out)
lighthouse1
image

Lighthouse 2 (zoom in and zoom out)
image
image

@luarx
Copy link
Author

luarx commented Feb 6, 2024

@luarx The fix is already in v4.6

Are you sure @michaelsproul ? I see that it is only mentioned in https://github.com/sigp/lighthouse/releases/tag/v4.6.222-exp

image

@michaelsproul
Copy link
Member

Are you sure @michaelsproul ? I see that it is only mentioned in https://github.com/sigp/lighthouse/releases/tag/v4.6.222-exp

Yeah it just got left out of the release notes for 4.6, I'll add it now.

You can see all the tags that a commit is part of by clicking on it: 585124f. That commit is in v4.6.0 and v4.6.222-exp.

@michaelsproul
Copy link
Member

@luarx I wonder if something about the Deneb fork triggered the disk space increase, rather than the old bug that was fixed in #5084. We would need logs to confirm. If it happens again, can you save debug logs from the datadir before wiping?

@luarx
Copy link
Author

luarx commented Feb 6, 2024

@luarx I wonder if something about the Deneb fork triggered the disk space increase, rather than the old bug that was fixed in #5084. We would need logs to confirm. If it happens again, can you save debug logs from the datadir before wiping?

It could be... for the next time we will try to have logs to debug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working database v4.6.0 ETA Q1 2024
Projects
None yet
Development

No branches or pull requests

6 participants