Using Substrate ETL, users can query Polkadot and Kusama networks for large scale analysis data of blocks, extrinsics, calls, events, balances, and transfers. Substrate ETL relies on Colorful Notion's Polkaholic.io indexing of Polkadot + Kusama networks into public Google BigQuery datasets (one dataset for each relay chain).
As of September 2023, substrate-etl datasets are now in Google's BigQuery Public Datasets: bigquery-public-data.crypto_polkadot
and bigquery-public-data.crypto_kusama
! See announcements from Google Cloud Web3 + Parity
NEW: As part of OpenGov #366, Colorful Notion is integrating Polkadot data (relay chain, system chains and all known parachains as of January 2024) with Dune. If you are interested in long-term support for your parachain, please email michael@colorfulnotion.com.
You can add these datasets to your Google Cloud project with the following Analytics Hub links:
When you add your project, you will get a Google Cloud assigned project ID (e.g. bumbleholt_14234) and will be able to get linked datasets to the crypto_polkadot
and crypto_kusama
datasets -- use this ID in place of the bigquery-public-data
below.
bigquery-public-data
is a public project within BigQuery which anyone can access. You see hundreds of BigQuery tables in the public crypto_polkadot
and crypto_kusama
datasets. Just open the query editor and try some of the queries below:
See all available tables/views via AAA_tableschema (Schema)
select * from `bigquery-public-data.crypto_polkadot.AAA_tableschema`
See all available tables/views of paraid 2000
select * from `bigquery-public-data.crypto_polkadot.AAA_tableschema` where table_id like '%2000'
Get blocks of paraid 2000 (Schema):
select * from `bigquery-public-data.crypto_polkadot.blocks2000` where DATE(block_time) >= "2024-03-01" and DATE(block_time) <= "2024-03-31"
Get extrinsics of paraid 2000 (Schema):
select * from `bigquery-public-data.crypto_polkadot.extrinsics2000` where DATE(block_time) >= "2024-03-01" and DATE(block_time) <= "2024-03-31"
Get calls of paraid 2000 (Schema):
select * from `bigquery-public-data.crypto_polkadot.calls2000` where DATE(block_time) >= "2024-03-01" and DATE(block_time) <= "2024-03-31"
Schemas for several of the most common source tables are listed below -- a full list of schemas can be found here with further details below.
Substrate data for each chain is held in 9 tables in one of 2 public datasets, with one dataset for each relay chain and all its parachains. By convention, relaychain data is considered "paraid=0".
Project: (Location: US)
bigquery-public-data
Datasets:
crypto_polkadot
crypto_kusama
Tables: (replace {paraID}
with a specific para ID, e.g. 2000
for acala
)
- Blocks:
bigquery-public-data.crypto_${relayChain}.blocks${paraID}
(date-partitioned byblock_time
) - Schema - Extrinsics:
bigquery-public-data.crypto_${relayChain}.extrinsics${paraID}
(date-partitioned byblock_time
) - Schema - Calls:
bigquery-public-data.crypto_${relayChain}.calls${paraID}
(date-partitioned byblock_time
) - Schema - Events:
bigquery-public-data.crypto_${relayChain}.events${paraID}
(date-partitioned byblock_time
) - Schema - Transfers:
bigquery-public-data.crypto_${relayChain}.transfers${paraID}
(date-partitioned byblock_time
) - Schema - Balances:
bigquery-public-data.crypto_${relayChain}.balances${paraID}
(date-partitioned byts
) - Schema
Thus polkadot relay chain blocks are held in bigquery-public-data.crypto_polkadot.blocks0
, acala blocks are stored in bigquery-public-data.crypto_polkadot.blocks2000
, and similarly for any chain / table name.
See Definitions for how the tables are constructed and tentative definitions.
Every chain has a auto generated README with the chains tables explicitly enumerated, and includes sample queries.
Notes:
- All tables are date-partitioned to support low cost, high speed scans.
- If a parachain has a renewal, the first paraid assigned is used for subsequent renewals.
See:
Field | BigQuery Type |
---|---|
table_id | STRING |
time_partitioning_field | STRING |
table_cols | STRING REPEATED |
table_schema | INTEGER |
For a quick overview of the available tables/views within the dataset, please query AAA_tableschema view
See:
Field | Type | BigQuery Type |
---|---|---|
hash | hex_string | STRING |
parent_hash | hex_string | STRING |
number | bigint | INTEGER |
state_root | hex_string | STRING |
extrinsics_root | hex_string | STRING |
block_time | bigint | TIMESTAMP |
author_ss58 | string | STRING |
author_pub_key | string | STRING |
spec_version | bigint | INTEGER |
relay_block_number | bigint | INTEGER |
relay_state_root | hex_string | STRING |
extrinsic_count | bigint | INTEGER |
event_count | bigint | INTEGER |
transfer_count | bigint | INTEGER |
See:
Field | Type | BigQuery Type |
---|---|---|
hash | hex_string | STRING |
extrinsic_id | string | STRING |
block_time | bigint | TIMESTAMP |
block_number | bigint | INTEGER |
block_hash | hex_string | STRING |
lifetime | JSON | JSON |
section | string | STRING |
method | string | STRING |
params | JSON | JSON |
fee | bigint | INTEGER |
weight | bigint | INTEGER |
signed | boolean | BOOLEAN |
signer_ss58 | string | STRING |
signer_pub_key | hex_string | STRING |
Field | Type | BigQuery Type |
---|---|---|
event_id | string | STRING |
section | string | STRING |
method | string | STRING |
data | JSON | JSON |
extrinsic_id | string | STRING |
extrinsic_hash | hex_string | STRING |
block_time | bigint | TIMESTAMP |
block_number | bigint | INTEGER |
block_hash | hex_string | STRING |
Field | Type | BigQuery Type |
---|---|---|
event_id | string | STRING |
section | string | STRING |
method | string | STRING |
data | JSON | JSON |
extrinsic_id | string | STRING |
extrinsic_hash | hex_string | STRING |
block_time | bigint | TIMESTAMP |
block_number | bigint | INTEGER |
block_hash | hex_string | STRING |
from_ss58 | string | STRING |
to_ss58 | string | STRING |
from_pub_key | string | STRING |
to_pub_key | string | STRING |
asset | string | STRING |
price_usd | float | FLOAT64 |
amount_usd | float | FLOAT64 |
symbol | string | STRING |
decimals | int8 | INTEGER |
amount | float | FLOAT64 |
raw_amount | bigint | INTEGER |
Field | Type | BigQuery Type |
---|---|---|
symbol | string | STRING |
address_ss58 | string | STRING |
address_pubkey | string | STRING |
ts | bigint | TIMESTAMP |
id | string | STRING |
chain_name | string | STRING |
para_id | int | INTEGER |
free | bigint | FLOAT |
free_usd | float | FLOAT |
reserved | float | FLOAT |
reserved_usd | float | FLOAT |
misc_frozen | float | FLOAT |
misc_frozen_usd | float | FLOAT |
frozen | float | FLOAT |
frozen_usd | float | FLOAT |
Generally the data is complete as can be, but the sole data source is the Polkaholic.io indexer.
From this single source, the primary causes of missing data stem from:
- Chains that do not provide a public RPC node. Most of the time, these chains are new with very little actual activity.
- Chains that have a public RPC node, but no RPC Endpoint is an archive nodes
- Chains that are being onboarded
- Some blocks fail decoding due to chain halting, or are missing an up-to-date node.js API package for type definitions.
A daily/hourly github actions process summarizes the state of the index for:
and for every single chain that is being indexed. See the report Issues column for chains with systemic issues or blocks that are missing. Generally the last 24 hrs have blocks that are missing that are filled in by the end of the day.
Included in each summary are sample queries and a complete breakdown. Chain data is appended daily.
- All temporal BigQuery datasets are date-partitioned and split into multiple tables by {paraId} to enable low-cost low-latency BigQuery scans for specific date, parachain combinations. Timestamped data use BigQuery TIMESTAMP date types.
- Addresses are provided in “public key” (signer_pub_key) and SS58 Address (signer_ss58) form to support multi-chain queries with wild card table selection eg
select * from polkadot.extrinsics* where signer_pub_key='<pubkey>'
aggregates multi-chain transactional history for a given account. - When assets are mentioned (eg transfers), we "decimalize" the output and include basic USD price valuation if possible. Many assets are not valued with USD values in this way.
Spring/Summer 2023
- Initial table designs {blocks, extrinsics, events, trasnfers, balances, xcmTransfers}
- Daily/hourly dump via Github workflow
- Hourly summary report for all reachable parachains
- GKE systematization, Reliability Improvements
- Integration with XCM Global-Asset Registry repo
Fall/Winter 2023
- Basic WASM contract support (psp22, events, bytecode) using ChainIDE integration (ink!ubator)
- Deep Account Analytics: Staking + Democracy + Relay Chain Trace Support (supported by Web3F)
- Dune PoC
Winter 2024
- Dune Integration of Relay Chain
Spring 2024
- Dune Integration of All Polkadot Parachains
- EVM Support within Dune
Your feedback and your ideas are important -- please submit an issue or reach out to us on Telegram (@sourabhniyogi) or Matrix.
Contributions are welcome. Contributors will be invited to a dedicated Telegram group and are held to the Polkadot communities' Code of Conduct.