Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to move all subgraphs out of a shard #4371

Closed
lutter opened this issue Feb 15, 2023 · 13 comments · Fixed by #4374
Closed

Add a way to move all subgraphs out of a shard #4371

lutter opened this issue Feb 15, 2023 · 13 comments · Fixed by #4374
Assignees

Comments

@lutter
Copy link
Collaborator

lutter commented Feb 15, 2023

There are situations where it is desirable to move all subgraphs from one shard into another shard (e.g., to consolidate shards) It's currently possible to do that, but pretty laborious: for each subgraph in the shard that should be emptied, it is necessary to run the following commands manually:

graphman copy create sgd<src> <dst shard> <index node>
# wait until copying has finished and the copy is synced
graphman copy activate <IPFS hash> <dst shard>
graphman unassign sgd<src>

It would be much better if graph-node could perform these steps without much user interaction. We should change graphman copy create to accept two new flags:

  • --activate: activate the copy when it has caught up with the chain head. For that, SyncStore::deployment_synced will call primary::Connection::activate
  • --replace: when the copy has synced, do what --activate does but also unassign the source of the copy so that it will eventually be deleted

With that, a shard can be emptied by running graphman copy create --replace sgd<src> <dst shard> <index node> for each subgraph in that shard. It should be possible to issue those commands all at once, graph-node should work through that list on its own and appropriately limit load in case there is a huge number of deployments to move, though that needs to be tried out and tested carefully

@paymog
Copy link

paymog commented Feb 21, 2023

We're outgrowing our database and would like to start sharding. Is there a way to add a new shard to the graph node and also configure it to create all new subgraphs in the new shard?

@lutter
Copy link
Collaborator Author

lutter commented Feb 23, 2023

We're outgrowing our database and would like to start sharding. Is there a way to add a new shard to the graph node and also configure it to create all new subgraphs in the new shard?

I actually talked about sharding in the indexer office hours a few weeks ago (starts at ~ 12:00 in the recording) Be warned though that if you copy deployments across shards, the indexer agent will get stuck because of this issue

@paymog
Copy link

paymog commented Apr 5, 2023

@lutter just did my first deployment migration between shards and it went well! However, it's not clear how to move a deployment back to the original shard. Any chance you have some tips you can share? I'd assume than trying to copy again would result in issues if the subgraph on the original shard still hasn't been cleaned up after being unassigned. It's also not clear how to drop the subgraph on the secondary shard when there are two copies.

EDIT: maybe the following is correct?

graphman copy activate <IPFS hash> <src shard>
graphman unassign sgd<dst>

@paymog
Copy link

paymog commented Apr 5, 2023

I just tried copying another subgraph several times and I keep getting errors like the following (each line is a different attempt):

Error: could not find a block with number 31352 in our cache
Error: could not find a block with number 32916 in our cache
Error: could not find a block with number 33064 in our cache

However, when I query the database, I can see these blocks in the cache:

client__goldskyraw=> select number,hash,parent_hash from chain1.blocks where number=31352 or number=32916 or number=33064;
 number |                                hash                                |                            parent_hash
--------+--------------------------------------------------------------------+--------------------------------------------------------------------
  31352 | \x62398b9dffdf345fb3e8540fb40dad00fc1ed6dd23fe8d1636b730ccfefb2acd | \xf6e8cb587b2d29323ef96d31cf842bae37bff07b5b4fc5bb6981f6fada4ae0ac
  33064 | \x22e518e2c4b1de6d79022f2439f6a9bda526fa4ba53d2fb101ff40f7fd4aad95 | \xedc9a8c71c68aacc9026a7eb0dde98f89dfb0361104a400ac1b687b4703e356b
  32916 | \xd4db0ce6f8a24969d76ff606b7cfa9a50c74d131df07059e694aa410de9c1f0e | \x14f290c8ef94c9f9be97e5160abbda7616d740d8a97dc34bd51eefa6965eebfa
(3 rows)

The command which produced the above errors is

graphman copy create QmYPBRHzsv6TEtTryA7BtKcRtkszgeBA8wPMx14aKRWnSt secondary indexer_4

I also tried the following with the same effect

graphman copy create sgd12006 secondary indexer_4

Not sure if this is relevant but this subgraph is running against mainnet while the subgraph from the above comment which I moved successfully was running against matic.

@paymog
Copy link

paymog commented Apr 5, 2023

Interestingly, it seems the copy command calls this function which uses the public.ethereum_blocks table. When I check that table in my primary shard, I see that it's empty:

client__goldskyraw=> select * from public.ethereum_blocks;
 hash | number | parent_hash | network_name | data
------+--------+-------------+--------------+------
(0 rows)

Now I'm wondering how the first subgraph was successfully copied...

EDIT: maybe the mainnet vs matic difference does matter? It seems that there are two ways to get the hash, the shared route linked above and the private route

@paymog
Copy link

paymog commented Apr 5, 2023

If I run graphman info against this subgraph I see the following:

name      | clg3mkfdj5jr73sy6gbw41jbv
status    | current
id        | QmYPBRHzsv6TEtTryA7BtKcRtkszgeBA8wPMx14aKRWnSt
namespace | sgd12006
shard     | primary
active    | true
chain     | mainnet
node_id   | indexer_4

and if I check the chains table for the mainnet chain I see

client__goldskyraw=> select * from public.chains where name = 'mainnet';
 id |  name   | net_version |                        genesis_block_hash                        |  shard  | namespace
----+---------+-------------+------------------------------------------------------------------+---------+-----------
  3 | mainnet | 1           | d4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3 | primary | chain3
(1 row)

...oh, mainnet is chain3, not chain1 like in previous comments and it does seem that I'm missing those blocks in chain3

client__goldskyraw=> select number,hash,parent_hash from chain3.blocks where number=31352 or number=32916 or number=33064;
 number | hash | parent_hash
--------+------+-------------
(0 rows)

@paymog
Copy link

paymog commented Apr 5, 2023

What could be causing the graph node to not populate the block cache for mainnet? It seems the cache stops at block ~1500

client__goldskyraw=> select number from chain3.blocks where number < 34000 order by number desc limit 5;
 number
--------
   1577
   1576
   1575
   1574
   1573
(5 rows)

@paymog
Copy link

paymog commented Apr 5, 2023

cc @mangas @azf20

@paymog
Copy link

paymog commented Apr 5, 2023

It seems there are large gaps in my block cache:

client__goldskyraw=> select number from chain3.blocks where number < 4000000 order by number desc limit 5;
 number
---------
 3743295
 3698589
    1577
    1576
    1575
(5 rows)

@paymog
Copy link

paymog commented Apr 5, 2023

I decided to try another subgraph which was caught up to the tip of mainnet where I have a complete (ish) block cache. I ran the command to create a copy successfully

root@goldskyraw-indexer-0-graph-node-5fc7f86cb4-pl7d5:/# graphman copy create QmVfC9FSBLgpkMbmAR81CmkbGc8SpHgHmUSHps5u8U8mEm secondary indexer_7
created deployment QmVfC9FSBLgpkMbmAR81CmkbGc8SpHgHmUSHps5u8U8mEm[12040] as copy of QmVfC9FSBLgpkMbmAR81CmkbGc8SpHgHmUSHps5u8U8mEm[2315]

I then checked the secondary shard to see if the subgraph is there and indeed it is.

client__goldskyraw=> select id,deployment,latest_ethereum_block_number from subgraphs.subgraph_deployment;
  id   |                   deployment                   | latest_ethereum_block_number
-------+------------------------------------------------+------------------------------
 12040 | QmVfC9FSBLgpkMbmAR81CmkbGc8SpHgHmUSHps5u8U8mEm |

However it seems it's not sycning. That's when I realized this copy operation is effectively a graft based on the following:

client__goldskyraw=> select id,deployment,graft_base from subgraphs.subgraph_deployment;
  id   |                   deployment                   |                   graft_base
-------+------------------------------------------------+------------------------------------------------
 12040 | QmVfC9FSBLgpkMbmAR81CmkbGc8SpHgHmUSHps5u8U8mEm | QmVfC9FSBLgpkMbmAR81CmkbGc8SpHgHmUSHps5u8U8mEm
(1 row)

I've been waiting a while for this graft to complete and I'm not seeing any progress in the graph node logs. I'm also not seeing any data in the new subgraph

client__goldskyraw=> select * from sgd12040.vote limit 1;
 vid | block$ | id | choice | weight | reason | voter | proposal | block | block_time | txn_hash
-----+--------+----+--------+--------+--------+-------+----------+-------+------------+----------
(0 rows)

And when I check the running queries in postgres I don't see anything in the secondary shard

client__goldskyraw=> SELECT pid, usename, datname, query_start, state, query
FROM pg_stat_activity
WHERE state = 'active';
  pid  | usename |      datname       |         query_start          | state  |                          query
-------+---------+--------------------+------------------------------+--------+---------------------------------------------------------
 27996 | root    | client__goldskyraw | 2023-04-05 14:19:26.15164+00 | active | SELECT pid, usename, datname, query_start, state, query+
       |         |                    |                              |        | FROM pg_stat_activity                                  +
       |         |                    |                              |        | WHERE state = 'active';
(1 row)

Nor do I see any relevant queries in the primary shard where the subgraph is being copied from

client__goldskyraw=> SELECT pid, usename, datname, query_start, state, query
FROM pg_stat_activity
WHERE state = 'active' and query ilike '%sgd2315%';
  pid  | usename |      datname       |          query_start          | state  |                          query
-------+---------+--------------------+-------------------------------+--------+---------------------------------------------------------
 58077 | root    | client__goldskyraw | 2023-04-05 14:20:16.978556+00 | active | SELECT pid, usename, datname, query_start, state, query+
       |         |                    |                               |        | FROM pg_stat_activity                                  +
       |         |                    |                               |        | WHERE state = 'active' and query ilike '%sgd2315%';
(1 row)

@paymog
Copy link

paymog commented Apr 5, 2023

Just found the following:

root@goldskyraw-indexer-0-graph-node-5fc7f86cb4-pl7d5:/# graphman copy list
------------------------------------------------------------------------------
deployment           | QmVfC9FSBLgpkMbmAR81CmkbGc8SpHgHmUSHps5u8U8mEm
action               | sgd2315 -> sgd12040 (secondary)
started              | 2023-04-05T13:56:53+00:00
progress             | 0.00% done, 0/9967119
root@goldskyraw-indexer-0-graph-node-5fc7f86cb4-pl7d5:/# date
Wed Apr  5 14:27:39 UTC 2023

The subgraph is ~8gigs from the subgraph_sizes table

client__goldskyraw=> select total from info.subgraph_sizes where subgraph='QmVfC9FSBLgpkMbmAR81CmkbGc8SpHgHmUSHps5u8U8mEm';
  total
---------
 8376 MB
(1 row)

Is it expected that the graft would still be at 0% after ~30 minutes?

EDIT: it's now been over an hour and the graft is still at 0%

EDIT 2: it's now the next day and still at 0%. I wonder if this is because I assigned the copy to the same indexer as the source?

@paymog
Copy link

paymog commented Apr 6, 2023

I tried to cancel the copy process by running the following:

root@goldskyraw-indexer-0-graph-node-5fc7f86cb4-pl7d5:/# graphman drop sgd12040
Found 1 deployment(s) to remove:
name       | cl8vzjh1g0ohg0hubf4sk0bj6
deployment | QmVfC9FSBLgpkMbmAR81CmkbGc8SpHgHmUSHps5u8U8mEm

Continue? [y/N] y
unassigning QmVfC9FSBLgpkMbmAR81CmkbGc8SpHgHmUSHps5u8U8mEm[12040]
Removing subgraph cl8vzjh1g0ohg0hubf4sk0bj6
Recording unused deployments. This might take a while.
id        | 12040
shard     | secondary
namespace | sgd12040
subgraphs |
entities  | 0
Recorded 1 unused deployments
====================================    1 ====================================
removing sgd12040 from secondary
   deployment id: QmVfC9FSBLgpkMbmAR81CmkbGc8SpHgHmUSHps5u8U8mEm
        entities: 0
done removing sgd12040 from secondary in 0.1s

and it seems to have removed the original deployment since I can't create a new copy:

root@goldskyraw-indexer-0-graph-node-5fc7f86cb4-pl7d5:/# graphman copy create sgd2315 secondary indexer_2
Error: Found no deployment for `sgd2315`

nor find info on the deployment

root@goldskyraw-indexer-0-graph-node-5fc7f86cb4-pl7d5:/# graphman info sgd2315
No matches

EDIT: I manually redeployed the same subgraph and things are good now. I was a bit surprised that dropping sgd12040 resulted in this issue

@paymog
Copy link

paymog commented Apr 6, 2023

Maybe this issue is because of #4394 and the fact that I'm still running 0.29.0? Does the copy operation invoke reassign_subgraph and is that relevant to the grafting process?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants