Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bound history size #4444

Merged
merged 5 commits into from
Nov 11, 2021
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions EIPS/eip-4444.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
eip: 4444
lightclient marked this conversation as resolved.
Show resolved Hide resolved
title: Bound Historical Data in Execution Clients
description: Prune historical data in clients older than one year
author: George Kadianakis (@asn_d6), lightclient (@lightclient), Alex Stokes (@ralexstokes)
discussions-to: https://ethereum-magicians.org/t/eip-4444-bound-historical-data-in-execution-clients/7450
status: Draft
type: Standards Track
category: Core
lightclient marked this conversation as resolved.
Show resolved Hide resolved
created: 2021-11-02
---

## Abstract

Clients must stop serving historical blocks/receipts that are older than a year on the p2p layer. Clients may prune such historical data.
lightclient marked this conversation as resolved.
Show resolved Hide resolved

## Motivation

Historical blocks, states, and receipts currently occupy more than 400GB of disk space (and growing!). Therefore, to validate the chain, users must typically have a 1TB disk.

Historical data is not necessary for validating new blocks, so once a client has synced the tip of the chain, historical data is only retrieved when requested explicitly over the JSON-RPC or when a peer attempts to sync the chain. By pruning the history, this proposal reduces the disk requirements for users. Pruning history also allows clients to remove code that processes historical blocks. This means that execution clients don't need to maintain code paths that deal with each upgrade's compound changes.

This proposal also concretizes the availability guarantees of the p2p layer as clients won't be able to fetch historical data anymore. This results in less strain on the network as clients adopt more lightweight sync strategies based on the PoS weak subjectivity assumption.

## Specification

| Parameter | Value | Description |
| - | - | - |
| `HISTORY_PRUNE_EPOCHS` | 82125 | A year in beacon chain epochs |

Clients MUST NOT serve blocks/states/receipts that are older than `HISTORY_PRUNE_EPOCHS` epochs on the p2p network.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not everyone will agree on what time it is, should we have a bit of a no-mans-land between the time clients are expected to stop requesting ancient data and when clients are expected to stop providing ancient data? This could be minutes/hours/days, just enough to deal with time synchronization issues.

Presumably, under such a regime any peer that requests data older than the longer of the two SHOULD be kicked. Meanwhile, clients SHOULD be programmed to never ask for anything older than the shorter of the two.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I can/should locally prune after these epochs, and thus I can be in a situation where our clocks have a disparity (within reason) and you make what you think is an honest request that I do not have. Are you suggesting in such a situation that although I cannot respond to the request, I should not consider your request as nefarious? Whereas if you are outside of the window, I can consider this as "bad" behavior and descore/kick you?

under such a regime any peer that requests data older than the longer of the two SHOULD be kicked

I prefer MAY be downscored and/or kicked.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not everyone will agree on what time it is, should we have a bit of a no-mans-land between the time clients are expected to stop requesting ancient data and when clients are expected to stop providing ancient data? This could be minutes/hours/days, just enough to deal with time synchronization issues.

Presumably, under such a regime any peer that requests data older than the longer of the two SHOULD be kicked. Meanwhile, clients SHOULD be programmed to never ask for anything older than the shorter of the two.

@MicahZoltu In my country, MasterCard and Visa while not blockchain related are required to keep past history of the last 10 years for law enforcement and they use a higher transaction throughout.


Clients MAY locally prune block/state/receipt history that is older than `HISTORY_PRUNE_EPOCHS` epochs.
Comment on lines +25 to +33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specification should indicate what behavior should occur if a peer asks for data that is too old (which is why I think this should be a Networking spec).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should anything specific occur? There are a small number of clients, we should work together to ensure that they follow the behavior of the EIP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been argued that requesting and serving data beyond the limit should be explicitly incorrect. In such a case, you'd need to have devp2p return errors on such requests


#### Bootstrapping and syncing

This EIP impacts the way clients bootstrap and sync. Clients won't be able to full sync or snapshot sync since historical data won't be available on the p2p network.

Clients MUST use a valid Weak Subjectivity Checkpoint to bootstrap from a more recent view of the chain. For the purpose of syncing, clients treat weak subjectivity checkpoints as the genesis block. We call this method "checkpoint sync".

For the purposes of this proposal, we assume clients always start with a configured and valid weak subjectivity checkpoint. The way this is achieved is out-of-scope for this proposal.
lightclient marked this conversation as resolved.
Show resolved Hide resolved

## Rationale

This proposal forces clients to stop serving old historical data over p2p. We make this explicit to force clients to seek historical data from other sources, instead of relying on the optional behavior of some clients which would result in quality degradation.

### Why a year?

This proposal sets `HISTORY_PRUNE_EPOCHS` to 82125 epochs (one earth year). This constant is big enough to provide sufficient room for the Weak Subjectivity Period to grow, and it's also small enough so as to not occupy too much disk space.

## Backwards Compatibility

### Preserving historical data

This proposal impacts nodes that make use of historical data (e.g. web3 applications that display history of blocks, transactions or accounts). Preserving the history of Ethereum is fundamental and we believe there are various out-of-band ways to achieve this.

Historical data can be packaged and shared via torrent magnet links or over networks like IPFS. Furthermore, systems like the Portal Network or The Graph can be used to acquire historical data.

Clients should allow importing and exporting of historical data. Clients can provide scripts that fetch/verify data and automatically import them.

### Full syncing from genesis

Full syncing will no longer be possible over the p2p network. However, we do want to allow interested parties to do so on their own.

We suggest that a specialized "full sync" client is built. The client is a shim that pieces together different releases of execution engines and can import historical blocks to validate the entire Ethereum chain from genesis and generate all other historical data.

It's important to also note that although archive nodes with "state sync" functionality are in development, full sync is currently the only reliable way to bootstrap them. Clients that wish to continue providing archive support would need the ability to import historical blocks retrieved out-of-band and retain support for each historical upgrade of the network.

### User experience

This proposal impacts the UX for setting up applications that use historical data. Hence we suggest that clients introduce this change in two phases:

In the first phase, clients don't prune historical data by default. They introduce a command line option similar to geth's `--txlookuplimit` that users can use if they want to prune historical data.

In the second phase, history is pruned by default and the command line option is removed. Clients assume that users will find and import data in an out-of-band way.

### JSON-RPC changes

After this proposal is implemented, certain JSON-RPC endpoints (e.g. like `getBlockByHash`) won't be able to tell whether a given hash is invalid or just too old. Other endpoints like `getLogs` will simply no longer have the data the user is requesting. The way this regression should be handled by applications or clients is out-of-scope for this proposal.

## Security Considerations

### Relying on weak subjectivity

With the move to PoS, it's essential for security to use valid weak subjectivity checkpoints because of long-range attacks.

This proposal relies on the weak subjectivity assumption and assumes that clients will not bootstrap with an invalid or old WS checkpoint.

This proposal also assumes that the weak subjectivity period will never be longer than `HISTORY_PRUNE_EPOCHS`. If that were to happen, clients with an old weak subjectivity checkpoint would never be able to "checkpoint sync" since the p2p network would not be able to provide the required data.

### Centralization/censorship risk

There are censorship/availability risks if there is a lack of incentives to keep historical data.

It's important that Ethereum historical data are preserved and seeded by independent organizations, and their availability should be checked frequently. We consider these mechanisms as out-of-scope for this proposal.

Furthermore there is a risk that more dapps will rely on centralized services for acquiring historical data because it will be harder to setup a full node.

## Copyright
Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).