-
Notifications
You must be signed in to change notification settings - Fork 987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Ethereum network indexer (phase 1: blocks only) #1383
Conversation
e1de997
to
5a69e25
Compare
ec19ad2
to
02e93a5
Compare
This is probably quite a diff, would it be done in this PR or a follow up? |
@leoyvens I'd be happy to make that a follow up, given the size of this PR. |
One thing I'll still do is put metrics back in. I added the utilities ( |
25fd80e
to
b2232a9
Compare
ef11471
to
ae8b81b
Compare
Tests generally pass, just sometimes they hang on Travis. I'm not sure yet what causes it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet fully understood the reorg algorithm, but it seems complicated due to the need of finding the common ancestor. I wonder if we could do a simpler algorithm which is to revert a single block if the next block is not a child of the current one, and go back to the starting state. This would naturally find the common ancestor by reverting one block at a time until it is found. It would in theory be less efficient for large reorgs, but looking at Etherscan statistics, it seems that 95% of reorgs are 1 block deep, and I couldn't even find a 3 block deep reorg, those must happen only once every full moon. So the simpler algorithm could maybe be more efficient because it needs to do less work on 1 block reorgs which are the common case. My point being that the performance difference would not matter if there is any, so we should favor simplicity.
|
||
Box::new( | ||
// Add the block entity | ||
self.set_entity(block.as_ref(), Some(vec![("isOmmer", false.into())])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems hacky that isOmmer
is set here, probably reflecting the fact that this field was the last thing added. It would be nicer if we set that in a data structure when the block is fetched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I've changed this so ommer blocks are wrapped in an Ommer
newtype and the isOmmer
flag is now set in the TryIntoEntity
implementations for Ommer
and BlockWithUncles
.
#[derive(Clone, Debug, Default, PartialEq)] | ||
pub struct BlockWithUncles { | ||
pub block: EthereumBlock, | ||
pub uncles: Vec<Option<Block<H256>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Vec<Option<_>>
is weird, what does None
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs are not super clear; I think https://github.com/ethereum/wiki/wiki/JSON-RPC#eth_getUncleByBlockHashAndIndex and it's link to https://github.com/ethereum/wiki/wiki/JSON-RPC#eth_getblockbyhash suggests that None
means the uncle couldn't be found.
We can't allow different nodes to return different uncles though – we need them all for computing block rewards. So I think if an uncle (ommer) is not found, that's a serious reason to fail the network indexer. I'll drop the Option
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
// Check whether we have a reorg (parent of the new block != our local head). | ||
if block.inner().parent_ptr() != state.local_head { | ||
let depth = block.inner().number.unwrap().as_u64() | ||
- state.local_head.map_or(0u64, |ptr| ptr.number); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the purpose of this variable, it's only used in logs and metrics, and afaict it's either 0
if block
is genesis or 1
otherwise, so not very meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, you're right, this will never report the real depth of the reorg. I do want that information, but I think I can only log it once we have found the common ancestor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
let state = state.take(); | ||
|
||
transition!(PollChainHead { | ||
local_head: state.old_local_head, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we went back to the starting state here, and then we wouldn't need keep old_local_head
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
graph/src/ext/futures.rs
Outdated
@@ -224,6 +228,10 @@ impl<F: Future> FutureExtension for F { | |||
on_cancel, | |||
} | |||
} | |||
|
|||
fn measure<C: FnOnce(&Self::Item, Duration)>(self, callback: C) -> Measure<Self, C> { | |||
Measure::new(self, callback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having this helper is unecessary, it only has one caller from what I can tell, and with async/.await it won't be idiomatic because it's a variation of and_then
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move it into the indexer code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this is done also.
3d6eacb
to
2f3437e
Compare
@leoyvens I think I've addressed all comments with the appropriate changes. Could you take another look? |
127bdba
to
c072697
Compare
Rebased on top of master. |
Reorg handling was simplified as per @leoyvens's suggestion. Reduced the implementation by about 600 lines and made it a lot easier to follow. |
Left to do: Count consecutive reverts to capture and log reorg depths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code is looking good, I only have few minor comments. Once this is ready for merging I'll also give it a run locally.
difficulty: block.difficulty, | ||
total_difficulty: block.total_difficulty, | ||
seal_fields: block.seal_fields, | ||
uncles: block.uncles, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is correct and worth it to assert that this is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I wouldn't think uncles ever report more uncles (that's not how it works). Asserting this might cause failures though. I'd rather have references to uncles of uncles in the resulting data that resolve to null
.
@@ -0,0 +1,83 @@ | |||
""" Block is an Ethereum block.""" | |||
type Block @entity { | |||
id: ID! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we start rolling this out, we should brush up the work I did on making it possible to make ID
equivalent to Bytes
for a subgraph so that these id's take only 20 instead of 40 (or 42) bytes to store. It will be transparent to the rest of the code, but requires that we pass a flag to create_subgraph
that indicates whether ID
should be a String
or Bytes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code for this is in store::postgres::relational::Layout::new
, but it's not exposed to callers, and instead capped at IdType::String
- we should expose this up in the callstack as arguments so it can be set in create_subgraph
and is stored in the database (maybe as a field on deployment_schemas
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much is there to do? And how much, if any, risk does that support introduce? Does it affect clients, filters, anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides passing IdType::Bytes
or IdType::String
through when creating a subgraph, the following needs to be done in the Store
:
- fix up a handful of places in
relational_queries.rs
where we assume that the id is aString
- Look at the type of the id column in
information_schema.columns
when starting a subgraph and decide whether it usesBytes
orString
as the id
At the Entity
layer, the block explorer would have to make sure to pass id
as a Value::Bytes
rather than a Value::String
.
At the GraphQL layer, users have to pass the id
as something that can be converted to Value::Bytes
, and we'd have to do that conversion when coercing values. (I could make it so that we convert a Value::String
into a Value::Bytes
in relational_queries.rs
, but that seems a bit hacky)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked a little more, and to avoid changing too much in the code base, I think the best course of action is to keep that distinction within the relational mapping code. That means that code that deals with entity ID's continues to use strings, and the conversion from string to bytea
and vice versa all happens in relational_queries
. For users of the storage layer, the main change is that they might get a new error when the id is not a string in the form 0xdeadbeef
.
Those changes should be possible to do in a couple of days. The only change for the block explorer would be to pass IdType::Bytes
when creating the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this separately? We're not going to activate this feature right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, totally; we just need to do this before the feature goes live. We can migrate the database after the fact, but it's likely to take long (maybe hours) and during that time, the block explorer would be unavailable.
When this goes into a release, we need to make sure it's behind a feature flag so that users who install that release don't wind up with this data in their database which we would have to migrate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #1414 to track this properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already behind a --network-subgraphs
CLI flag.
99f3dcb
to
d085330
Compare
Set it to `false` for regular blocks and to `true` for ommers.
If they all start with `Failed ...`, they are easier to grep for.
When writing blocks, set the `isOmmer` entity field based on whether the block being written is an `Ommer` (true) or a `BlockWithOmmers` (false).
This is more idiomatic, apart from `LightEthereumBlock`, where a new `format()` method is added because `LightEthereumBlock` is a foreign type that we can't implement `Display` for without a wrapper.
Move these into `graph` so they can be used in other places as well (like other chain integrations in the future).
This avoids dealing with `Option` blocks in the rest of the indexer and therefore simplifies things a bit.
Since it's only used in one place right now (`track_future!` in the network indexer), we can get away with something as simple as ```rust let start_time = Instant::now(); ... .inspect(move |_| { let duration = start_time.elapsed(); ... }) ``` Squashme: remove measure
Instead of collecting all old and new blocks to find the common ancestor and revert old blocks, we simply revert the local head block one block at a time and re-evaluate the situation (by polling the chain head block again and deciding which blocks to look up next). Eventually, this procedure will revert the local head back to the common ancestor. For deep reorgs, this will be slow, but about 99% of Ethereum reorgs have a depth of one, so this is something we can live with easily.
f3363e6
to
f92ced3
Compare
I consider this generally ready for review!
This PR implements phase 1 of #297, including the following features:
A network indexer that indexes blocks from the selected network. (Transactions, logs, receipts, accounts, balances come in the next two phases.) This network indexer handles reorgs to an unlimited depth (bounded only by the memory of the machine the node runs on).
A
--network-subgraphs
CLI flag to enable network subgraph indexing per network, e.g. withA built-in GraphQL schema that allows Ethereum network subgraphs to be queried (or subscribed to) at
/subgraphs/ethereum/mainnet
. Apart from slightly different relationship fields, this schema is heavily based on Geth's GraphQL schema.Tests for basic indexing and reorg handling. These should be extended further by also verifying that the blocks are actually written to the store. Right now they only test the
Revert
andAddBlock
events emitted by the indexer (which are only emitted after reverting/writing blocks successfully; however that doesn't mean the written data is correct).Database size
Regarding the size of the indexed data: The most recent 4000 mainnet blocks resulted in a Postgres database size of 20MB. That makes ~65GB for all 9,000,000 blocks, assuming they are the same size on average. They are probably smaller on average (older blocks were less busy), so we're looking at maybe 50GB just for the blocks (or their headers, rather).
Review guide
I recommend reviewing commit by commit first. I've consolidated the PR into commits that mostly change only one thing at a time, so it's easier to follow what was added when.
It may help to read up on the terminology used across the network indexer here:
graph-node/chain/ethereum/src/network_indexer/network_indexer.rs
Lines 16 to 54 in ec19ad2
The state machine for the network indexer is documented here:
graph-node/chain/ethereum/src/network_indexer/network_indexer.rs
Lines 509 to 673 in ec19ad2
While this is being reviewed, I'll work on improving the tests to test data correctness and potentially generalizing the network indexer across chains. I've done some initial thinking to identify how the indexer depends on Ethereum right now and I think we can abstract that away.