Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db-sync crashes when MIR cert transaction to unregistered stake address is submitted #845

Closed
mkoura opened this issue Sep 23, 2021 · 22 comments
Labels
bug Something isn't working

Comments

@mkoura
Copy link
Contributor

mkoura commented Sep 23, 2021

OS
Linux - Fedora 34

Versions
The db-sync version:

cardano-db-sync-extended 11.0.0 - linux-x86_64 - ghc-8.10
git revision bada2e65b06e898b326ab2cfc80d6f496ef41a7a

(yesterday's master)
PostgreSQL version: 11.11

Build/Install Method
The method you use to build or install cardano-db-sync: nix-build

Run method
The method you used to run cardano-db-sync (eg Nix/Docker/systemd/none): supervisord

Additional context
Latest master node running with latest master db-sync, Alonzo era, Protocol Version 6.

Problem Report
I have a test that tries to send funds from reserves or treasury pot to an unregistered stake address.

The test:

  • generates an MIR certificate for unregistered stake address
cardano-cli governance create-mir-certificate stake-addresses --reserves --stake-address stake_test1uzdf8kfcpxnspkk34s6xkrksdxaqr9h28zgztrpghvxg3js85h7k9 --reward 50000000000000 --out-file test_pay_unregistered_stake_addr_from_reserves_mir_stake.cert
  • submits a TX with the MIR certificate
cardano-cli transaction build-raw --fee 199571 --out-file test_pay_unregistered_stake_addr_from_reserves_tx.body --tx-in 2aa7224609633b7a7550048cef89259fb54e4f3f026765c066e4d521a2b34939#0 --tx-out addr_test1qpqs60harrg0tj9xgkm7mccm4lt7c8uzpkpvkkwqwd2p9au6j0vnszd8qrddrtp5dv8dq6d6qxtw5wysykxz3wcv3r9qu6548e+999800429 --certificate-file test_pay_unregistered_stake_addr_from_reserves_mir_stake.cert --alonzo-era
cardano-cli transaction sign --tx-body-file test_pay_unregistered_stake_addr_from_reserves_tx.body --out-file test_pay_unregistered_stake_addr_from_reserves_tx.signed --testnet-magic 42 --signing-key-file test_mir_certs_ci0_addr0.skey --signing-key-file delegate-keys/delegate1.skey
cardano-cli transaction submit --testnet-magic 42 --tx-file test_pay_unregistered_stake_addr_from_reserves_tx.signed --cardano-mode

After the transaction is submitted, db-sync crashes at the epoch boundary without any error in the log files, there's just warning about rewards earned:

[db-sync-node:Info:1092] [2021-09-22 22:20:36.41 UTC] Took a ledger snapshot at ./state-cluster0/db-sync/19995-f2f328a051-19.lstate
[db-sync-node:Info:1092] [2021-09-22 22:20:36.41 UTC] Removing epoch boundary files ["./state-cluster0/db-sync/17986-ffcf561ae8-17.lstate"]
[db-sync-node:Info:1092] [2021-09-22 22:20:36.42 UTC] Starting epoch 20
[db-sync-node:Info:1092] [2021-09-22 22:20:36.42 UTC] Handling 3 stakes for epoch 20 slot 20007, hash 899e72ba883a99b97dc3566ce0632f4f29f9523bc0eec00f3043f11abd150e96
[db-sync-node:Warning:1092] [2021-09-22 22:20:36.42 UTC] validateEpochRewards: rewards earned in epoch 18 expected total of 1202.844154 ADA but got 50001202.844154 ADA

After restarting it keeps crashing with the same warning.

@mkoura mkoura added the bug Something isn't working label Sep 23, 2021
@erikd
Copy link
Contributor

erikd commented Sep 23, 2021

This crash is rewards related. Rewards are currently known to be broken.

@erikd
Copy link
Contributor

erikd commented Sep 23, 2021

This is a logWrarning . It should not cause db-sync to terminate.

@erikd
Copy link
Contributor

erikd commented Sep 24, 2021

What network is this?

@mkoura
Copy link
Contributor Author

mkoura commented Sep 24, 2021

It's on local cluster, no testnet.

@erikd
Copy link
Contributor

erikd commented Oct 14, 2021

Supposedly a transaction that was supposed to trigger this issue was submitted to the shelley_qa network.

When running on that network I ran into some other problems (#882) and after figuring out that problem was relatively benign and adding code to skip that issue I find that db-sync does indeed hang. The last messages are:

[2021-10-14 01:31:43.52 UTC] epochPluginInsertBlock: epoch 5057
[2021-10-14 01:31:43.53 UTC] epochPluginInsertBlock: epoch 5058
[2021-10-14 01:31:43.60 UTC] Handling 31 rewards for epoch 5058 slot 36416171,hash 8b322627e5a...45776dc400ab
[2021-10-14 01:31:43.61 UTC] insertEpochInterleaved: Epoch 5058, 31 rewards

Stopping db-sync and restarting it does not help (good).

@erikd
Copy link
Contributor

erikd commented Oct 14, 2021

I traced this hang back to where it reads blocks from queue. Blocking (ie hanging) it the correct behaviour when the queue is and remains empty.

Looking at the other end of the queue, it turns out that nothing was being fed into the queue, because no blocks were being received from the node. Checking on the node showed that it had stopped getting blocks from the network.

This is all correct and expected behavior as far as db-sync is concerned.

Rerunning both the node and db-sync from genesis to make sure.

@erikd
Copy link
Contributor

erikd commented Oct 14, 2021

On resync from geneisis, db-sync halted, but only because the node it was connected to was no longer supplying it with blocks.

Someone from devops has confirmed that the shelley_qa network stopped creating blocks at 2021-10-13 04:20 UTC.

@erikd
Copy link
Contributor

erikd commented Oct 14, 2021

@mkoura your comment:

After restarting it keeps crashing with the same warning.

does not seem to be correct. That "warning" is not a crash. Its a log message. If db-sync is still running after you see that message, then it has not crashed.

WHile it does not seem to have "crashed" it does seem to have halted, but its doing that because its waiting for more input (ie more blocks from the node). However, since it does not get more input it doesn't do anything more.

@erikd
Copy link
Contributor

erikd commented Oct 14, 2021

shelley_qa network is currently down, but IOHK devops tells me it should come back up soon. I have left both my node and my db-sync running. What I expect to happen is that they come back up, the node starts feeding blocks to db-sync and db-sync continues as normal.

@mkoura
Copy link
Contributor Author

mkoura commented Oct 14, 2021

That doesn't seem like the problem I'm observing. My node is running with no issues and keeps creating blocks even after db-sync crashes. Db-sync process exits with exit code 1 (i.e. not hanging), every single time, 100% reproducible. Hence it is a crash.

My comment

After restarting it keeps crashing with the same warning.

means the warning is the last message before the crash (there's no error in the log file etc.)

@erikd
Copy link
Contributor

erikd commented Oct 14, 2021

Its reproducible by you, but I have not yet been able to, despite sowing several days of effort into the attempt. I cannot fix this, if I cannot reproduce it.

@erikd
Copy link
Contributor

erikd commented Oct 15, 2021

db-sync should always provide some indication or error message why it exited.

I can think of two ways db-sync could exit without any message:

  • Killed by the OM killer (but there would be a log message in the system logs).
  • It does produce an error message that is swallowed/hidden by whatever tool is being used to run it.

@erikd
Copy link
Contributor

erikd commented Oct 15, 2021

With assistance of QA I tried to reproduce this with master of node and master of db-sync with some tiny tweaks that avoid panicAborts on the shelley_qa network. db-sync passed the

pytest -k "test_pay_unregistered_stake_addr_from" cardano_node_tests

test with flying colours. No hang, no crash.

@mkoura
Copy link
Contributor Author

mkoura commented Oct 15, 2021

I'm still able to reproduce it with up-to-date master (cb1187494a75ae886c00a271d609e3ff7ab0e76b). However I needed to run the test several times before it crashed. Once it crashed, db-sync cannot be restarted, even when the database and db-sync dir is cleaned and db-sync is forced to sync from scratch.

I ran db-sync in the foreground so it dumps it's stdout and stderr to terminal:

...
[db-sync-node:Info:55] [2021-10-15 11:52:43.68 UTC] Starting epoch 39
[db-sync-node:Info:55] [2021-10-15 11:52:43.68 UTC] Handling 3 stakes for epoch 39 slot 39010, hash be32e945223ce01da36c8fc6be5665e2e023abf1e9ce9268509e865f96760eaa
[db-sync-node:Warning:55] [2021-10-15 11:52:43.68 UTC] validateEpochRewards: rewards spendable in epoch 39 expected total of 1199.775185 ADA but got 1501199.775185 ADA
Epoch No: 39
dbMap length: 4
ledgerMap length: 3
diffMap length: 1
e008b0fb5c159cb4d4c77e91e7d08d2391a392e22808cea7896a7766c4 (1500000000000, 1): [(RwdTreasury,DbLovelace 1500000000000)], Coin (-1)
Rewards differ between ledger and db-sync.
Please report at https://github.com/input-output-hk/cardano-db-sync/issues.

$ echo $?
1

@erikd
Copy link
Contributor

erikd commented Oct 19, 2021

@mkoura The original bug report claimed that db-sync was crashing without providing any information about why. I have yet to be provided with a way to reproduce that.

The above obviously does have information about why it is exiting (ie the lines starting at "Epoch No: 39") . However, this is almost certainly due to #882 which is a known issue that needs to be fixed in ledger-specs (IntersectMBO/cardano-ledger#2521).

If the original claim is correct, that db-sync crashes without providing any information about why, then that is a different issue.

@mkoura
Copy link
Contributor Author

mkoura commented Oct 19, 2021

If #882 doesn't crash, it's different issue. You can reproduce this issue either by following the steps in description, or by using the node DB and configuration where the issue was already reproduced that I have provided to you.
Also we see the crash regularly on CI nightly job when running with db-sync master.

@erikd
Copy link
Contributor

erikd commented Oct 19, 2021

#882 prints an error message and then exits. Exiting is a deliberate action, not a crash

@mkoura
Copy link
Contributor Author

mkoura commented Oct 19, 2021

#882 doesn't mention anything about exit or crash (October 19, 2021, 9:10 UTC)

@erikd
Copy link
Contributor

erikd commented Oct 19, 2021

Doesn't mention it, but it does exit.

@erikd
Copy link
Contributor

erikd commented Oct 21, 2021

This was never a "crash without any error message".

After updating the dependencies to fix the issue in #882, I still see:

Epoch No: 9
dbMap length: 4
ledgerMap length: 3
diffMap length: 1
e0b1d7a6db1859b8d24def65f87554e4549546186e4cb65b55a5e51399 (1500000000000, 1):
      [(RwdTreasury,DbLovelace 1500000000000)], Coin (-1)
Rewards differ between ledger and db-sync.
Please report at https://github.com/input-output-hk/cardano-db-sync/issues.

Looking at that address:

select stake_registration.* from stake_registration inner join stake_address
     on stake_address.id = stake_registration.addr_id
     where hash_raw = '\xe0b1d7a6db1859b8d24def65f87554e4549546186e4cb65b55a5e51399' ;
 id | addr_id | cert_index | epoch_no | tx_id 
----+---------+------------+----------+-------
  8 |      21 |          0 |        7 |    14
(1 row)

and almost immediately deregistered:

select stake_deregistration.* from stake_deregistration inner join stake_address
    on stake_address.id = stake_deregistration.addr_id
    where hash_raw = '\xe0b1d7a6db1859b8d24def65f87554e4549546186e4cb65b55a5e51399' ;
 id | addr_id | cert_index | epoch_no | tx_id | redeemer_id 
----+---------+------------+----------+-------+-------------
  2 |      21 |          0 |        7 |    16 |            
(1 row)

but then does receive a payment from the treasury:

select reward.* from reward
    inner join stake_address on reward.addr_id = stake_address.id
    where stake_address.hash_raw = '\xe0b1d7a6db1859b8d24def65f87554e4549546186e4cb65b55a5e51399' ;
 id | addr_id |   type   |    amount     | earned_epoch | spendable_epoch | pool_id 
----+---------+----------+---------------+--------------+-----------------+---------
 17 |      21 | treasury | 1500000000000 |            8 |               9 |        
(1 row)

If the address has been de-registered it should not be receiving rewards. The important factor here is not that has been registered and is currently unregistered, rather than it never having been registered at all.

Looks like this is yet another indication that trying to have two implementations of the same logc (ie in db-sync and in ledger-specs) is an incredibly bad idea.

@erikd
Copy link
Contributor

erikd commented Oct 22, 2021

Fixing this in db-sync is a HUGE pain in the neck. Running (a slightly modified version of) the stake history tool:

../cardano-node/stake-credential-history-exe --conf ../mir_cert_cluster/state-cluster0/config-bft1.json \
    --socket ../mir_cert_cluster/state-cluster0/bft1.socket \
    --stake-address-hex e0b1d7a6db1859b8d24def65f87554e4549546186e4cb65b55a5e51399 \
    --checkpoints-off
NEW-ERA ----------- EpochNo 1, SlotNo 1000, shelley
NEW-ERA ----------- EpochNo 2, SlotNo 2000, allegra
NEW-ERA ----------- EpochNo 3, SlotNo 3019, mary
NEW-ERA ----------- EpochNo 4, SlotNo 4000, alonzo
REGISTRATION ------ EpochNo 7, SlotNo 7783
WDRL -------------- EpochNo 7, SlotNo 7789, Lovelace 0
DE-REGISTRATION --- EpochNo 7, SlotNo 7803
MIR --------------- EpochNo 8, SlotNo 8016, TreasuryMIR, Lovelace 1500000000000

which shows the stake address receiving a MIR payment after it is deregistered. This points to an issue in ledger-specs.

The stake-credential-history tool mods have been PRed in IntersectMBO/cardano-node#3311 .

@erikd
Copy link
Contributor

erikd commented Oct 27, 2021

Finally fixed this in db-sync #918 .

Have not been able to recreate any crash or hang without an error message.

Closing this.

@erikd erikd closed this as completed Oct 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants