v5.2.0 built from source runs out of memory #5970

alecalve · 2024-06-20T09:28:18Z

Description

We build a Docker image from the Lighthouse v5.2.0 Dockerfile with very minor changes:

adding curl inside the image directly
some changes relating to how the binary is run

The binary itself is built the same way as in this repo's Dockerfile, using make and the default values for FEATURES, PROFILE, etc..

diff --git a/Dockerfile b/Dockerfile
index e0c48699b..da033a4bf 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -13,6 +13,21 @@ FROM ubuntu:22.04
 RUN apt-get update && apt-get -y upgrade && apt-get install -y --no-install-recommends \
   libssl-dev \
   ca-certificates \
-  && apt-get clean \
-  && rm -rf /var/lib/apt/lists/*
+  curl && \
+  apt-get clean && \
+  rm -rf /var/lib/apt/lists/*
+
 COPY --from=builder /usr/local/cargo/bin/lighthouse /usr/local/bin/lighthouse
+RUN useradd -m -u 1000 -s /bin/bash lighthouse
+RUN chown -R lighthouse:lighthouse /opt
+USER lighthouse
+RUN mkdir /opt/data
+WORKDIR /opt/lighthouse
+
+# Firewall configurations
+# P2P TCP, UDP
+EXPOSE 9001/TCP
+EXPOSE 9001/UDP
+# RPC
+EXPOSE 5053
+ENTRYPOINT [ "lighthouse" ]

The node is run using these arguments:

      --datadir=/opt/data
      --http
      --listen-address=0.0.0.0
      --listen-address=::
      --http-address=::
      --http-allow-origin=*
      --http-port=5052
      --target-peers=50
      --slots-per-restore-point=32
      --prune-blobs=false
      --execution-jwt=/secrets/jwtsecret
      --execution-endpoint=http://....:8551
      --port=9000
      --port6=9000
      --network=mainnet
      --disable-deposit-contract-sync
      --state-cache-size=2
      --metrics
      --metrics-address=::
      --metrics-port=8080
      --metrics-allow-origin=*

Version

The Docker image is built from v5.2.0

Present Behaviour

Once started, the application runs then finally runs out of memory after hitting the limit of 64GB we assigned to it.

This did not happen with the same modifications applied to previous versions of Lighthouse. The last one we had tested this with was v5.0.0.

Using the image you provide (sigp/lighthouse:v5.2.0) with the same arguments on the same datadir results in a reasonable, stable, memory usage.

Expected Behaviour

The application should have a stable memory footprint and not run out of memory.

The text was updated successfully, but these errors were encountered:

alecalve · 2024-06-20T10:07:35Z

Compiling with --features "portable,gnosis,slasher-lmdb,slasher-mdbx,jemalloc" --profile "maxperf" makes the application not run out of memory

jimmygchen · 2024-06-20T12:32:27Z

Thanks for raising this! We can take a look into this.

Could you try removing or increasing the value for --state-cache-size=2?

I haven't tried testing using a low state cache size since v5.2.0, but with the introduction of tree-states, state caches are now 32x cheaper and we recommend to adjust this to a higher number (64 if you want the allocate equivalent of 2 state caches in 5.1.3), or just use the default of 128. Using a low state cache size may result in more cache misses and copying, and potentially more memory usage elsewhere, although I'm not very sure and need to take a deeper look.

alecalve · 2024-06-20T12:38:02Z

Ah yes, the low value for --state-cache-size was my initial attempt at trying to get memory under control until I realized it came from a much deeper issue. I'll remove it but I can say that the issue was occuring with the default value for that setting.

dapplion · 2024-06-20T13:09:09Z

Do you have metrics tracking the memory usage of this instance? Would be helpful to see if it's a linear or spiky increase and at what rate. Also please send us debug logs if possible to dive deeper

alecalve · 2024-06-20T13:21:27Z

Here's what memory usage was, it was OOM-killed each time it exceeded 64G:

I don't have debug logs available.

michaelsproul · 2024-06-20T23:20:32Z

I wonder if maxperf is causing the compiler to find some optimisation that prevents an OOM. I will try to repro and profile the mem usage of a release build.

michaelsproul · 2024-06-21T06:43:19Z

I couldn't repro the OOM with LH v5.2.0 compiled from source with release profile. I ran for about 2 hours and mem didn't bump above 5GB. I didn't run under Docker though.

michaelsproul · 2024-06-21T06:43:38Z

I'm trying a long-range sync now, as one of the other OOM reports we had was a node syncing a month of history.

michaelsproul · 2024-06-21T07:33:10Z

No luck with the long-range sync either. Mem usage didn't bump above 3GB

chong-he · 2024-06-25T23:48:50Z

Compiling with --features "portable,gnosis,slasher-lmdb,slasher-mdbx,jemalloc" --profile "maxperf" makes the application not run out of memory

We have another report by @rz4884 on Discord that's facing the same issue, and the user confirmed that it is because he doesn't include --feature jemalloc.

@michaelsproul mentions that the Dockerfile appears to not have enabled jemalloc by default. We will fix this, thanks for reporting

The original patch was missing balance updates and the default features/profiles led to increase memory usage (sigp/lighthouse#5970)

michaelsproul · 2024-07-10T04:08:07Z

Closed by #5995

alecalve · 2024-08-23T14:27:53Z

It is happening again using 5.3.0 when building from source.

The image is built from commit d6ba8c3 using

docker build --build-arg="PROFILE=maxperf" ...

michaelsproul · 2024-08-23T14:41:16Z

@alecalve does the resulting binary show jemalloc as the allocator in ./lighthouse --version?

alecalve · 2024-08-23T14:42:15Z

It does:

$ lighthouse --version
Lighthouse v5.3.0-b11261e
BLS library: blst
BLS hardware acceleration: true
SHA256 hardware acceleration: true
Allocator: jemalloc
Profile: maxperf
Specs: mainnet (true), minimal (false), gnosis (false)

michaelsproul · 2024-08-23T23:49:28Z

What's the memory usage getting to now on 5.3.0 when the OOM occurs?

It must be something other than the lack of jemalloc. Things to check:

Are you running a slasher?
If you have metrics enabled, please share a dump of curl http://localhost:5054/metrics here or via DM (@sproul on Discord). In particular it would be good to check beacon_fork_choice_process_attestation_seconds_* to rule out the issue described here: Expensive fork-choice queued attestation mutation #6206.
Are you running Lighthouse with any other flags, e.g. --state-cache-size, --reconstruct-historic-states, etc.
What's the hardware?

alecalve · 2024-08-26T07:40:47Z

We are not running a slasher
I enabled metrics, beacon_fork_choice_process_attestation_seconds_* metrics look healthy, metrics are here
We run the following arguments:

      beacon_node
      --datadir=/opt/data
      --http
      --listen-address=0.0.0.0
      --listen-address=::
      --http-address=::
      --http-allow-origin=*
      --http-port=5052
      --checkpoint-sync-url=https://mainnet-checkpoint-sync.attestant.io/
      --reconstruct-historic-states
      --target-peers=50
      --slots-per-restore-point=256
      --historic-state-cache-size=1
      --prune-blobs=false
      --execution-jwt=/secrets/jwtsecret
      --execution-endpoint=http://geth-1-cluster:8551
      --port=9000
      --port6=9000
      --network=mainnet
      --metrics
      --disable-deposit-contract-sync

The node has finished reconstructing states, we do see a lot of State cache missed logs though.

Hardware wise, it's in a k8s env, running on Linux. We allocate 32GB of RAM to that container.

michaelsproul · 2024-08-26T07:52:40Z

we do see a lot of State cache missed logs though.

This sounds like the issue.

The state cache referred to in this log is for the unfinalized portion of the chain. It shouldn't frequently miss with the default --state-cache-size of 128. It doesn't have anything to do with state reconstruction.

Can you post the output of curl http://localhost:5052/lighthouse/database/info | jq? I want to check if the split.slot is advancing normally and keeping the unfinalized part of the DB a manageable size.

Can you also provide some info on what's making state queries? How many requests per second, are they made concurrently, etc?

alecalve · 2024-08-26T08:04:51Z

  "split": {
    "slot": "9760096",
    "state_root": "0x19325b996b812c1c1d11728a0481f1c333c224e0fa6b10ebd9aefeddb34d9f44",
    "block_root": "0x52ea319a5ff08c1ca9914952690dff649c59808028cee0e450c50274faad04dc"
  }

But the node is way beyond that slot:

Aug 26 08:04:17.000 INFO Synced                                  slot: 9819619, block:    …  empty, epoch: 306863, finalized_epoch: 306861, finalized_root: 0x732f…d236, exec_hash: 0x3837…f93b (verified), peers: 13, service: slot_notifier

Full output:

{
  "schema_version": 21,
  "config": {
    "slots_per_restore_point": 256,
    "slots_per_restore_point_set_explicitly": true,
    "block_cache_size": 5,
    "state_cache_size": 128,
    "historic_state_cache_size": 1,
    "compact_on_init": false,
    "compact_on_prune": true,
    "prune_payloads": true,
    "prune_blobs": false,
    "epochs_per_blob_prune": 1,
    "blob_prune_margin_epochs": 0
  },
  "split": {
    "slot": "9760096",
    "state_root": "0x19325b996b812c1c1d11728a0481f1c333c224e0fa6b10ebd9aefeddb34d9f44",
    "block_root": "0x52ea319a5ff08c1ca9914952690dff649c59808028cee0e450c50274faad04dc"
  },
  "anchor": null,
  "blob_info": {
    "oldest_blob_slot": "9483873",
    "blobs_db": true
  }
}

alecalve · 2024-08-26T08:06:20Z

RPC wise, the only users are some L2 nodes, I don't have an insight into how frequently they query the node.

michaelsproul · 2024-08-26T08:10:52Z

But the node is way beyond that slot

This is the issue. The state migration must be failing.

Do you see an error log like:

WARN Block pruning failed

There was an old issue prior to v4.6.0 that could cause DB corruption similar to this:

Lighthouse space usage - Increasing dramastically from 87GB to 430GB in 7 days #4773

But I'm guessing that seeing as it happened in

jimmygchen mentioned this issue Jun 26, 2024

Enable jemalloc by default on non windows targets #5995

Merged

centromere pushed a commit to coinmetrics/fullnode that referenced this issue Jun 26, 2024

Update Lighthouse patch, features and profile

bedca74

The original patch was missing balance updates and the default features/profiles led to increase memory usage (sigp/lighthouse#5970)

michaelsproul closed this as completed Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v5.2.0 built from source runs out of memory #5970

v5.2.0 built from source runs out of memory #5970

alecalve commented Jun 20, 2024

alecalve commented Jun 20, 2024

jimmygchen commented Jun 20, 2024

alecalve commented Jun 20, 2024

dapplion commented Jun 20, 2024

alecalve commented Jun 20, 2024

michaelsproul commented Jun 20, 2024

michaelsproul commented Jun 21, 2024

michaelsproul commented Jun 21, 2024

michaelsproul commented Jun 21, 2024

chong-he commented Jun 25, 2024

michaelsproul commented Jul 10, 2024

alecalve commented Aug 23, 2024

michaelsproul commented Aug 23, 2024

alecalve commented Aug 23, 2024 •

edited

Loading

michaelsproul commented Aug 23, 2024

alecalve commented Aug 26, 2024 •

edited

Loading

michaelsproul commented Aug 26, 2024 •

edited

Loading

alecalve commented Aug 26, 2024

alecalve commented Aug 26, 2024

michaelsproul commented Aug 26, 2024

v5.2.0 built from source runs out of memory #5970

v5.2.0 built from source runs out of memory #5970

Comments

alecalve commented Jun 20, 2024

Description

Version

Present Behaviour

Expected Behaviour

alecalve commented Jun 20, 2024

jimmygchen commented Jun 20, 2024

alecalve commented Jun 20, 2024

dapplion commented Jun 20, 2024

alecalve commented Jun 20, 2024

michaelsproul commented Jun 20, 2024

michaelsproul commented Jun 21, 2024

michaelsproul commented Jun 21, 2024

michaelsproul commented Jun 21, 2024

chong-he commented Jun 25, 2024

michaelsproul commented Jul 10, 2024

alecalve commented Aug 23, 2024

michaelsproul commented Aug 23, 2024

alecalve commented Aug 23, 2024 • edited Loading

michaelsproul commented Aug 23, 2024

alecalve commented Aug 26, 2024 • edited Loading

michaelsproul commented Aug 26, 2024 • edited Loading

alecalve commented Aug 26, 2024

alecalve commented Aug 26, 2024

michaelsproul commented Aug 26, 2024

alecalve commented Aug 23, 2024 •

edited

Loading

alecalve commented Aug 26, 2024 •

edited

Loading

michaelsproul commented Aug 26, 2024 •

edited

Loading