-
Notifications
You must be signed in to change notification settings - Fork 41
Logbook 2022 H1
-
We have reviewed and merged one of the latest stage of the GCP Aggregator stabilization for this issue
Stabilize GCP Aggregator
#273 -
We have paired on the
Retrieve SD from cardano-cli in Aggregator/Signer
#275 issue and we have merged the realepoch
retrieval from the Cardano node. There is still a problem with the Docker images that crash because of a version issue ofglibc
(version2.29
is expected but not available on thedebian:buster-slim
image used). We will fix the problem by using a different base image for the Docker files. -
Also we have paired on preparing the demo path:
- Road map to Open Sourcing
- Workshop summary presentation (see
miro board
) - End to end demo on
devnet
withreal
Epoch Retrieval fromCardano
node:
# Setup demo
## Checkout correct commit
cd mithril/
cd mithril-client && make build && cp mithril-client ../../ && cd ..
rm -rf mithril
---
# Demo: Bootstrap and start a Mithril/Cardano devnet
## Change directory
cd mithril-test-lab/mithril-devnet
## Run devnet with 1 BTF and 2 SPO Cardano nodes
NUM_BFT_NODES=1 NUM_POOL_NODES=2 ./devnet-run.sh
## Watch devnet logs
watch -n 1 LINES=5 ./devnet-log.sh
## Watch devnet queries
watch -n 1 ./devnet-query.sh
## Visualize devnet topology
./devnet-visualize.sh
## Stop devnet
./devnet-stop.sh
# Client
## Get Latest Snapshot Digest
LATEST_DIGEST=$(curl -s http://localhost:8080/aggregator/snapshots | jq -r '.[0].digest')
echo $LATEST_DIGEST
## List Snapshots
NETWORK=devnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-client list -vvv
## Show Latest Snapshot
NETWORK=devnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-client show $LATEST_DIGEST -vvv
## Download Latest Snapshot (Optional)
NETWORK=devnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-client download $LATEST_DIGEST -vvv
## Restore Latest Snapshot
NETWORK=devnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-client restore $LATEST_DIGEST -vvv
-
We have re synchronized ourselves in order to progressively roll out the new features related to retrieving the real epoch and stake distribution from the Cardano node:
-
First: real epoch
- Wiring the new chain observer to the Aggregator and the Signer (only for the epoch retrieval code)
- Merging the new test lab that works directly with the devnet
- Fixing the rights issues between Cardano and Mithril nodes on GCP
-
Second: real stake distribution
- Activating the real stake distribution retrieval in the chain observer
- Activating of the
PoolId
handling in the evnet and the test lab - Update the configuration of the Mithril nodes on GCP
-
First: real epoch
-
We have also talked about the enhancements to be done on the documentation website as described in
improve UI/UX of documentation site
#245. These will be done shortly. We also discussed about includinglive
information gathered from the Aggregator
-
We have reviewed and merged the issue
Add Single Signatures store to aggregator
#282. It has raised some questions regarding some optimizations in the single signature registration process. We have opened a new issue to follow their implementation:Optimize single signature in Mithril Aggregator/Signer
#296 -
We have talked about PRs related to the switch of the
epoch
retrieval from the Cardano node:-
Devnet retrieves 'PoolId' artifacts from Cardano node
#292 -
Greg/250/chain observer
#294 - A new PR is in progress for updating the Docker images so that they embed the
cardano-cli
- A new PR is in progress for wiring the
cardano_cli_observer
in the aggregator and the signer instead of thefake_observer
- The merge of these PR will unlock the creation of the snapshots on GCP that is currently blocked
-
-
We also had talks about the finalization of the end to end test runner working on top of the devnet. We will merge shortly
-
We have reviewed and paired on the almost finished work on the
Use the devnet in rust e2e tests
#274:- The end to end test runner is now able to interact with the devnet
- It raised some issues with single signature registrations that take too long and make the tests fail with more than 1 pool node on the Cardano network. Some optimizations are being done in the issue
Add Single Signatures store to aggregator
#282 - This should drastically reduce the average time to run the tests on the CI (from 9 minutes to less that 3 minutes) 🥳
-
We have noticed some issues with the configuration cascade that is not working properly because of the default values of
clap
that always override any value passed with environment vars. We will need to work specifically on this problem shortly. -
In the mean time, some optimizations have been done in the PR
Enhance runtime configuration of Mithril Aggregator
#291 that has been merged in order toStabilize GCP aggregator
#273. There are still issues related to epoch transitions (fake at the moment) that are occuring too often and that trigger unwanted transitions in the runtime of the Aggregator. We will add a quickfix on this in order to resume the snapshots production on GCP and at the same time we are implementing the real epoch retrieval from thecardano-cli
(seeRetrieve SD from cardano-cli in Aggregator/Signer
#275) -
We have also planned our developments for the switch to the real stake distribution of the Cardano network in
Retrieve SD from cardano-cli in Aggregator/Signer
#275:- We will deploy in two phases:
- Use real
epoch
but fakestake distribution
- Use real
epoch
and realstake distribution
- Use real
- A difficulty is to used dynamic
PoolId
from Cardano network asparty_id
in the Mithril network. This implies some modifications on the Docker images production (so that they embed acardano-cli
binary), the end to end test runner, as well as the devnet
- We will deploy in two phases:
-
The incoming PR related to the stake distribution interface have been reviewed:
-
We had discussions about:
- The second step of the end to end tests runner which should use the new devnet in order to launch Cardano nodes (related to
Use the devnet in rust e2e tests
#274). We must take great care to the execution time of the tests as they are ran in the CI. This will imply fine tuning of the devnet Cardano nodes (epoch length for example) - The need to implement and use the
cardano-cli
in the Mithril nodes (related toRetrieve SD from cardano-cli in Aggregator/Signer
#275). We will pair on this tomorrow - The need to clearly define scenarios to implement in the integration tests vs those run on the end to end tests (related to
Add integration tests in Mithril Aggregator
#284) - The evolution that we should do on the Aggregator runtime in order to better manage the
epoch
transition vs theimmutable file number
transition in thebeacon
- The runtime of the Signer that should be implemented with a state machine (as done in the Aggregator)
- The second step of the end to end tests runner which should use the new devnet in order to launch Cardano nodes (related to
- We have reviewed the incoming issues and their respective PRs:
-
Provide interface for retrieving SD from external source in Aggregator/Signer
#250: should be merged shortly -
Retrieve SD from cardano-cli in Aggregator/Signer
#275: the parsing of the datas from the stake distribution is almost done, and the cal to thecardano-cli
is in progress -
Stabilize GCP aggregator
#273: the developments are in progress -
Use the devnet in rust e2e tests
#274:- The first PR has been merged:
Add devnet choose nodes to start
#278 - The adaptation of the end to end test runner is in progress
- The first PR has been merged:
-
Remove Mithril error codes in aggregator
#283 has been merged 🥳 -
Handle the poolId along with party_id in Core/Aggregator/Signer
: #276: work is blocked by #250 and will resume once it is merged
-
-
We had talks about the GCP aggregator stabilization #273. Among issues, there seems to be some trouble with user rights on the file system. We will investigate them shortly. We now have multiple snapshots that are properly produced at each new beacon and that can be restored with the Mithril Client. We will continue focusing on this issue in the next few days. Also, we need to determine what is the target topology of the Cardano nodes we want to run on that environment (SPO or not).
-
We have reviewed the draft
Add chain observer interface
PR #281 that will be completed shortly. We have also talked about the first implementation of this interface using the Cardano cli related toRetrieve SD from cardano-cli in Aggregator/Signer
#275. This will require some modifications of the Mithril nodes (Docker development containers, CI generated docker) and it will also some adaptation work on the end to end test runner and on the devnet. -
A few tickets have been added to the board:
-
We have talked about the need for integration tests, particularly to increase test coverage of the Aggregator runtime (which is very well unit tested and correctly end to end tested). We will start to work on that point shortly.
-
We had a meeting with the Light Wallets team in order to start understanding:
- How we can use the Mithril protocol for light wallets
- What technical difficulties we might encounter
-
We could use Mithril on two sides:
- On the backend side by enabling fast bootstrapping of full nodes used by light wallets (aligned with the current use case we are working on)
- On the frontend side, which could be a very interesting use case:
- Provide a SPV mechanism for light wallets (as described in this documentation from Mithril website)
- It may be possible to use Mithril certificates and embed more information about the Cardano chain than what we currently do such as Utxo set (which should be possible with the current Mithril network architecture). A light wallet would be able to certify the chain (including stake distribution) up to the previous epoch by following the Mithril certificate chain. From this point, as stake distribution is fixed up to the next epoch, a 'classical' SPV mechanism a la Bitcoin could be possible. However this assumption should be validated at first
- Maybe we could run a full node on the client side?
-
A light wallet would need to run the
mithril-core
library to verify the multi signatures:- An iOS or Android app, that would use it as a static library
- A browser plugin, that would also embed it as a static library
- A browser web page, with WASM compiled from Rust (however, we don't know at this stage if it will be complicated due to the underlying cryptographic backend)
- In any case, there should a single library (audited and trusted) on top of which to build applications
-
We will carry on investigating on these use cases with regular meetings and workshops
-
We have prepared the tickets for the next iteration
-
We have also discussed about the best way to handle the
Handle the poolId along with party_id in Core/Aggregator/Signer
issue #276. Here is the strategy we agreed on:- Remove the
party_id
inmithril-core
interface (and recompute it inside the lib with custom sort based on stake and verification_key) - Switch the
party_id(u64)
withpool_id(string)
inmithril-aggregator
- Switch the
party_id(u64)
withpool_id(string)
inmithril-signer
- Remove the
-
We have paired on fixing the last issues with
Implement state machine runtime in Mithril Aggregator
#221. Theruntime state machine
PR #261 has been merged 🥳 -
We have also prepared the demo path for this iteration:
- Mithril Multi Signatures End To End: On new devnet with Real Evolving Snapshots / new Aggregator runtime state machine
# Setup demo
## Download source (if needed)
git clone https://github.com/input-output-hk/mithril.git
#or
git clone git@github.com:input-output-hk/mithril.git
## Checkout correct commit
cd mithril/
git checkout 5fef1c7427cc7f11fad0e9fcc7f550e259768185
---
# Demo: Bootstrap and start a Mithril/Cardano devnet
## Change directory
cd mithril-test-lab/mithril-devnet
## Run devnet with 1 BTF and 2 SPO Cardano nodes
MITHRIL_IMAGE_ID=main-f9a51d8 NUM_BFT_NODES=1 NUM_POOL_NODES=2 ./devnet-run.sh
## Watch devnet logs
watch -n 1 LINES=5 ./devnet-log.sh
## Watch devnet queries
watch -n 1 ./devnet-query.sh
## Visualize devnet topology
./devnet-visualize.sh
## Stop devnet
./devnet-stop.sh
- Mithril Test Lab: Rust End to end tests
# Setup demo
## Download source (if needed)
git clone https://github.com/input-output-hk/mithril.git
#or
git clone git@github.com:input-output-hk/mithril.git
## Make build (if needed)
mkdir bin
cd mithril/
git checkout 5fef1c7427cc7f11fad0e9fcc7f550e259768185
cargo build --release
cp target/release/mithril-{aggregator,client,signer,end-to-end} ../bin/
#cp target/release/mithril-{aggregator,client,signer,end-to-end} ~/.cabal/bin
cd ..
rm -rf mithril
---
# Demo: Bootstrap a Cardano node from a testnet Mithril snapshot
# Launch test end to end (error timeout)
./bin/mithril-end-to-end --db-directory ./db.timeout/ --bin-directory ./bin
# Launch test end to end (success)
./bin/mithril-end-to-end --db-directory ./db/ --bin-directory ./bin
- Mithril Restore from GCP: On testnet snapshots (if it works 🤞)
# Setup demo
## Download source (if needed)
git clone https://github.com/input-output-hk/mithril.git
#or
git clone git@github.com:input-output-hk/mithril.git
## Make build (if needed)
cd mithril/
git checkout 5fef1c7427cc7f11fad0e9fcc7f550e259768185
cd mithril-client && make build && cp mithril-client ../../ && cd ..
cd ..
rm -rf mithril
---
s
# Demo: Bootstrap a Cardano node from a testnet Mithril snapshot
# Aggregator
## GCP logs
ssh curry@aggregator.api.mithril.network -- docker-compose logs -f mithril-aggregator
## Show pending certificate
watch -n 1 "curl -s -X 'GET' 'http://aggregator.api.mithril.network/aggregator/certificate-pending' -H 'accept: application/json' | jq ."
## Show snapshots
watch -n 1 "curl -s -X 'GET' 'http://aggregator.api.mithril.network/aggregator/snapshots' -H 'accept: application/json' | jq ."
# Client
## Get Latest Snapshot Digest
LATEST_DIGEST=$(curl -s http://aggregator.api.mithril.network/aggregator/snapshots | jq -r '.[0].digest')
echo $LATEST_DIGEST
## List Snapshots
NETWORK=testnet AGGREGATOR_ENDPOINT=http://aggregator.api.mithril.network/aggregator ./mithril-client list -vvv
## Show Latest Snapshot
NETWORK=testnet AGGREGATOR_ENDPOINT=http://aggregator.api.mithril.network/aggregator ./mithril-client show $LATEST_DIGEST -vvv
## Download Latest Snapshot (Optional)
NETWORK=testnet AGGREGATOR_ENDPOINT=http://aggregator.api.mithril.network/aggregator ./mithril-client download $LATEST_DIGEST -vvv
## Restore Latest Snapshot
NETWORK=testnet AGGREGATOR_ENDPOINT=http://aggregator.api.mithril.network ./mithril-client restore $LATEST_DIGEST -vvv
## Launch a Cardano Node
docker run -v cardano-node-ipc:/ipc -v cardano-node-data:/data --mount type=bind,source="$(pwd)/data/testnet/$LATEST_DIGEST/db",target=/data/db/ -e NETWORK=testnet inputoutput/cardano-node
-
We have reviewed and merged the following PR:
-
We have talked about the demo path for this iteration
-
We have paired on the
Mithril Aggregator runtime state machine
#221:- Implementation of the
/certificate-pending
route with the certificate pending store - End to end tests to make sure that the new runtime is working properly
- We have noticed a bug with
digest
encoding in the certificate that will be solved in the PR - We will merge it shortly 💪
- Implementation of the
-
We have reviewed the
Implement state machine runtime in Mithril Aggregator
issue #221. The issue is almost completed and should be merged shortly 😄 -
We have also reviewed and paired on the
Deploy test network w/ nodes in SPO mode
issue #249. It is still impossible to launch a working Cardano private devnet in Docker Compose 🤔 Although we fixed networking issues that triggered warnings regarding the IP Subscriptions, we still receiveTraceNoLedgerView
errors and we can see that the nodes don't produce any blocks. We will carry on investigating this issue. However, the devnet behaves properly when launched with the shell -
Finally, we have reviewed and paired on the
Migrate test-lab to Rust
issue #248. We will merge the code shortly 🚀. Once this issue and the previous one are done, we will followup with uncommission the "legacy" version of the end to end test and we will also work on improving the coverage of the tests
-
The bug
Fix digester hash is not deterministic
#241 has been closed: the bug does not reproduce anymore on GCP 🥳 -
We have reviewed and merged the PRs related to issue
Use "local" SD in Mithril Aggregator
#252: -
We have also reviewed the PR (in progress) related to issue
Implement state machine runtime in Mithril Aggregator
#221 -
Also we have talked and reviewed works on the rewritten end to end test:
-
Migrate test-lab to Rust
issue #248- Handling of return code in
mithril-client
- Macro implementation for retries/delays
- Handling of return code in
-
Deploy test network w/ nodes in SPO mode
issue #249- The devnet is now working with SPO nodes
- The code is being cleaned-up
- There are still issues with running Cardano nodes when launched with Docker Compose
-
- We have reviewed the following PRs:
- Setup Devnet with nodes in SPO mode launched with Docker, related to issue #249. There are still some instablity on the nodes and investigations are in progress to solve the issue
- Rust adaptation of the end to end test of the issue #248
- Implementation of the aggregator state machine runtime related to #221
-
The
Flaky Tests CI
bug #207 has been apparently fixed by switching the crypto backend of themithril-core
library to the one used byzcash
🥳. There is a configuration available to switch back toblst
. In the mean time a bug has been created on the repository of theblst
backend. -
We have reviewed the development (in progress) of the Devnet with nodes in SPO mode launched with Docker, that is related to issue #249
-
We have reviewed the Rust adaptation of the end to end test of the issue #248
-
We have also reviewed the implementation of the aggregator state machine runtime linked to #221. We have done some pairing on this issue in order to refine the state machine transitions
-
The PR
Use verification key/beacon stores in Mithril Aggregator multi signer
#247 has been reviewed and merged -
We have prepared the tickets for the sprint
-
We have talked about the
Flaky Tests CI
issue #207 and tried to find ways to fix it (cargo-valgrind
,signal_hook
, using other backend crypto libraries using the same curve). Some tests are in progress 🤞 -
We have paired on the Mithril Aggregator runtime state machine (linked with
Use store for Pending Certificates
#221) -
We have also talked about the end to end tests migration to Rust and we have stated that:
- The first development will be iso with the current version in Haskell (checking that the current scenario is working:
snapshots/certificates are produced by querying the aggregator REST API
) - It will be integrated as a new dependency in the Cargo Workspace of the repository (and will not embed any other dependency from it)
- It should be easy to add new scenario in it when required
- It will bootstrap itself from a previously created snapshot (if possible, should require keeping previous state in a new run)
- The first development will be iso with the current version in Haskell (checking that the current scenario is working:
-
We have concentrated our efforts on pairing to fix the bug #244 that froze the API service of the Mithril Aggregator during the snapshot digest computation and the snapshot archive creation. The problem has been fixed in the PR #246 and is deployed on the GCP environment 🥳
-
We also discussed about how to implement the end to end test in Rust
-
We have discussed about the
Flaky Tests CI
issue #207 and the tests that we have been running yesterday. Here are our findings:- We have not succeeded in outputting a stack trace: we are completely blind and have no clue of what causes the crash
- The issue has started at job run #801 just after the release of the new
blst
crypto backend - We have tried to reproduce the flaky behavior before this release by relaunching multiple times the job run #766 but it never failed
- The
SIGILL
error is happening multiple times in the same job run with the same test executable as in #1022 - We have downloaded the test binary file computed as an artifact by the CI in #1024:
- It causes the
SIGILL
in the CI - It does not cause the
SIGILL
on 2 computers out of the CI
- It causes the
- It appears that the problem could be located between:
- The machine allocated by Github actions that is not consistent
- The
blst
backend or itsunsafe
implementation inmithril-core
-
We have prepared the demo path that will showcase a full end to end snapshot creation/certification/verification/restoration:
# Mithril Multi Signatures End To End
# With Real Snapshot
# Resources
## Github
google-chrome https://github.com/input-output-hk/mithril
## Architecture
google-chrome mithril-mvp-architecture.jpg
## Interact with the aggregator through the OpenAPI UI
google-chrome http://mithril.network/openapi-ui/
---
# Setup demo
## Download source (if needed)
git clone https://github.com/input-output-hk/mithril.git
#or
git clone git@github.com:input-output-hk/mithril.git
## Make build (if needed)
cd mithril/
git checkout 4f387911f7747b810758b3a4783134856307c13a
make build
cp ./target/release/{mithril-aggregator,mithril-client,mithril-signer} ../
cd ..
rm -rf mithril/
---
# Demo Step 1: Create a real certificate/multisignature from a real snapshot
## Prepare store
rm -rf ./stores
## Launch Aggregator
NETWORK=testnet SNAPSHOT_STORE_TYPE=local SNAPSHOT_UPLOADER_TYPE=local PENDING_CERTIFICATE_STORE_DIRECTORY=./stores/aggregator/pending-cert_db CERTIFICATE_STORE_DIRECTORY=./stores/aggregator/cert_db URL_SNAPSHOT_MANIFEST= ./mithril-aggregator -vvvv --db-directory=./db --runtime-interval=30
## Launch Signer #0
PARTY_ID=0 RUN_INTERVAL=30000 NETWORK=testnet DB_DIRECTORY=./db AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-signer -vvvv
## Launch Signer #1
PARTY_ID=1 RUN_INTERVAL=30000 NETWORK=testnet DB_DIRECTORY=./db AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-signer -vvvv
## Display pending certificate
curl -s "http://localhost:8080/aggregator/certificate-pending" | jq .
---
# Demo Step 2: Restore a real snapshot with a real certificate/multisignature validation
## Get Latest Snapshot Digest
LATEST_DIGEST=$(curl -s http://localhost:8080/aggregator/snapshots | jq -r '.[0].digest')
echo $LATEST_DIGEST
## List Snapshots
NETWORK=testnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-client list -vvv
## Show Latest Snapshot
NETWORK=testnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-client show $LATEST_DIGEST -vvv
## Download Latest Snapshot (Optional)
NETWORK=testnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-client download $LATEST_DIGEST -vvv
## Restore Latest Snapshot
NETWORK=testnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-client restore $LATEST_DIGEST -vvv
## Launch a Cardano Node
docker run -v cardano-node-ipc:/ipc -v cardano-node-data:/data --mount type=bind,source="$(pwd)/data/testnet/$LATEST_DIGEST/db",target=/data/db/ -e NETWORK=testnet inputoutput/cardano-node
-
We have reviewed and closed multiple PRs:
-
We have opened a complimentary issue to the bug #223 in order to handle
Resolvable snapshots url in Mithril Aggregator
#232 -
We have discussed about the
Flaky Tests CI
issue #207 in order to find ways to understand and solve the issue 🤔 -
Also we have paired on the creation of a
Verification Store
that will help implement smoothly the runtime of the aggregator. A PR should be created shortly for the issueUse verification key store
#233
-
We have talked about the bug #223 and we have reviewed the proposed solution
-
We have paired on the state machine specification of the aggregator runtime and we have produced the following diagram that summarize the way it works:
-
We have reviewed and merged the PR #219 related to signing real snapshot digest in the client
-
We have also reviewd the PR #227 that closes the work on creating/using real certificates in the aggregator and the client. We have paired on fixing the end to end test that was not working anymore due to the underlying changes. This PR will be merged shortly.
-
We have talked about the fix on the CI warnings #213:
- One fix is temporary (with an associated TODO in the code)
- The long term fix is to create a config for the
AggregatorRuntime
that implements theFrom
trait with the generalConfig
struct of the node
-
The aggregator runtime needs more tests than what is available currently. After the current work on the data stores is merged into it, we will start working on its integration tests 💪
-
We talked about the cryptographic library recent updates:
- Everything works fine with the latest merged implementation so far
- An update has been added regarding the
serde
configuration for theAggregate Verification Keys
- We agreed that the panics should be replaced by errors for better error handling
- Some investigations will be lead regarding the flaky tests #207, specifically in the
unsafe
parts of the code to check if they are responsible for this behavior
-
We have also paired on implementing a more testable version of the snapshot digester:
-
The data stores PR #211 has been merged 🥳 We have also paired on implementing a declination store for the Certificate Store #222 that will be merged shortly
-
Now that the data stores are available, we will pair on using them:
- In the runtime
- In the http handlers
- In the multi signer
-
We have reviewed and merged the
Cargo Workspace
PR #210 🥳 -
We have reviewed the latest version of the
Real Certificate Production in Mithril Aggregator
PR #209:- It has been merged
- A more robust hash computation is done in a new PR #212
-
We have also reviewed the
add generic adapters
PR #211 that is being finalized and will be merged shortly. Once it is merged, we will pair on the wiring implementation of the stores in the aggregator
-
We have reviewed the PR #209 that takes care of the Real Certificate Production in the Mithril Aggregator. We have paired on implementing the
fixed
crate in order to handle theHash
andPartialEq
traits implementation for theProtocolParameters
struct -
We have also reviewed the PR #211 in relation with the data stores implementation in the store
-
These two PRs will be merged shortly, and then the stores will be wired in the Aggregator
Runtime
andMultiSigner
-
We have also discussed in details about the implementation of the data stores and the link between the
Beacons
and theCertificatePendings
(with some Miro charts) -
Also, we have reviewed the PR #210 that enables a
Cargo Workspace
along with an enhanced CI workflow
-
We have reviewed the PR #203 related to the Local Snapshot Store in the Mithril aggregator and we have merged it 🥳
-
A difficulty that was met during this PR is to define which parameters should be used in the
clap
Args
struct and those in theConfig
struct. It appears that we should:- Use the
Args
for setting therun mode
and theverbosity level
- Use the following order when setting a parameter in the config:
- Value in config file first
- Then in env var if exists
- Then in cli args
- An ADR will be written describing this rule
- Use the
-
We have paired on improving the end to end test so that it waits for the aggregator to be up and ready before running the tests on this PR #208 which has also been merged
-
We have discussed about the data stores of the aggregator and we have decided to use a simple approach. We will save the different certificate pending versions and embed directly the verification keys in them. A PR is in progress that will follow this approach
-
Also we have talked about a possible headless design in which the aggregator would be only responsible to produce static files (snapshot list, snapshot details, certificate details that would be the only information needed to restore a snapshot ; without any direct access to the aggregator). This would allow to use these materials to self-bootstrap a end to end test.
- We have done our first technical retrospective and have decided the following actions:
- Action 1: Reschedule the Mithril Sessions in the morning and focus them on pairing
- Action 2: Start working on the Rust version of the Test Lab
- Action 3: Fix flaky tests in CI #207
-
We have reviewed the doc optimization PR #205. Minor modifications will be made before merging it. We will work on it in an iterative manner until we have a satisfactory result.
-
We have paired on the Aggregator local snapshot #206 and fixed some issues regarding the serving of static files with
warp
. -
We have prepared the next iteration with goal:
Produce/verify multi-signature & certificates for evolving snapshot
(with fake stake distribution) -
An important point that we discussed is related to the data storage of the aggregator node:
- The data stores can be described as
beaconized key value
stores - They must provide access to data depending on a
beacon
, as thevalue
associated to akey
may evolve from one beacon to another (e.g. the signer verification key of a party that is registered during an epoch and available to use in the next one) - They must expose a function to list
all values
for onebeacon
- They must expose a function to list the
n latest values inserted
for one beacon - They must handle pruning of the data with a retention parameter on the
beacon
- The data stores can be described as
-
We have merged the following PRs:
-
We have talked about some difficulties due to the fact that we are moving from
fake
toreal
data. We will add corresponding tasks to the next iteration in order to be able to produce/verify real certificates from real Cardano node data -
We have taken some time to review the documentation PR #205 and to make some modifications. It will shortly be ready for merging.
-
We have prepared the demo path for this iteration:
- Showcase of:
- The Mithril Aggregator node producing an evolving beacon following the database of the Cardano node
- The Mithril Signers proceeding to their registration with the Mithril Aggregator
- The production of the associated multi signatures by the nodes accordingly to the pending certificate broadcast
- Presentation of the updated doc related to the open sourcing of the repository
- Showcase of:
# Mithril Multi Signatures End To End
# With Real Beacon / Real Signer Registration
# Resources
## Github
google-chrome https://github.com/input-output-hk/mithril
## Architecture
google-chrome mithril-mvp-architecture.jpg
## Interact with the aggregator through the OpenAPI UI
google-chrome http://mithril.network/openapi-ui/
---
# Setup demo
## Download source (if needed)
git clone https://github.com/input-output-hk/mithril.git
#or
git clone git@github.com:input-output-hk/mithril.git
## Make build (if needed)
cd mithril/
git checkout d896b557b69b3db120dcbece784611671141a635
cd mithril-aggregator && make build && cp target/release/mithril-aggregator ../../mithril-aggregator && cd ..
cd mithril-signer && make build && cp target/release/mithril-signer ../../mithril-signer && cd ..
cd mithril-client && make build && cp target/release/mithril-client ../../mithril-client && cd ..
cd ..
rm -rf mithril/
---
# Demo Step 1: Display real beacon
## Prepare immutables
rm -f ./db/immutable/{00011,00012}.{chunk,primary,secondary} && tree -h ./db
## Launch Aggregator
NETWORK=testnet URL_SNAPSHOT_MANIFEST=https://storage.googleapis.com/cardano-testnet/snapshots.json ./mithril-aggregator -vvvv --db-directory=./db --snapshot-interval=30
## Display pending certificate
curl -s "http://localhost:8080/aggregator/certificate-pending" | jq .
---
# Demo Step 2: Display real beacon with new `immutable_file_number`
## Copy next immutables
cp -f ./db/immutable.next/00011.{chunk,primary,secondary} ./db/immutable/ && tree -h ./db
## Display pending certificate
curl -s "http://localhost:8080/aggregator/certificate-pending" | jq .
---
# Demo Step 3: Register signers and show signers field updated in aggregator pending certificate route
# Then signers sends single signatures that are aggregated in a multi signature by the aggregator
# At this point, there is no more pending certificate
## Launch Signer #0
PARTY_ID=0 RUN_INTERVAL=30000 NETWORK=testnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-signer -vvvv
## Launch Signer #1
PARTY_ID=1 RUN_INTERVAL=30000 NETWORK=testnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-signer -vvvv
## Display pending certificate
curl -s "http://localhost:8080/aggregator/certificate-pending" | jq .
## Display pending certificate (204)
curl -v -s "http://localhost:8080/aggregator/certificate-pending" | jq .
---
# Demo Step 4: Display real beacon with another `immutable_file_number`
# Then signers run another round of signatures for the pending certificate
## Copy next immutables
cp -f ./db/immutable.next/00012.{chunk,primary,secondary} ./db/immutable/ && tree -h ./db
## Display pending certificate
curl -s "http://localhost:8080/aggregator/certificate-pending" | jq .
-
We have fixed the very long test end to end execution time which is now back to nominal work 🚀
-
We have reviewed the following PRs, that will be merged shortly:
-
We have decided to gather in a
Technical Retrospective
every Friday at the end of the sprint:- Manage the technical debt and code quality (Review of
TODOs
) - Talk/Redact ADRs with at least:
- Expose internal libs
- Errors handling and propagating
- Containerization and CI integration
- Configuration with env vars
- Manage the technical debt and code quality (Review of
-
We have continued pairing on fixing the end to end test lab for the Realish Beacon issuance PR:
- The issue is more difficult to fix than expected
- In order to move forward, we have temporarily removed the multi signature in the Mithril Client (and accordingly commented the code)
- We will put it back, when the digest/certificate hash computation is final
- The end to end test is very long to run in the CI. This issue will be fixed shortly: Cardano node db files will be embedded as artifacts
- The PR has been merged 🥳
-
We have also reviewed the Real Signers registration PR:
-
We have tried to test the struct mocking with the
mockall
crate, but we didn't achieved a good result. We will try again during another session -
We have reviewed the Real Beacon Producer PR:
- The code is ok and is ready to be merged
- This PR will be merged first as the digest computation is needed for remaining tickets of the iteration
- The end to end tests is red in the CI
-
We have paired on fixing the end to end test lab that is red, but we have not found a solution yet. We suspect a de synchronization of the digests between the nodes. We will carry on our investigation tomorrow in order to fix the test and proceed to the merge.
-
We talked about the penultimate
immutable number
and its consequences regarding security. Apparently, the ledger state is computed with the latest immutable. We will continue our work by using a digest computed from the penultimateimmutable number
and include in the snapshot the latestimmutable number
associated files, as well as the ledger state. In parallel, we will investigate this issue and assess the security risks -
We still have some troubles with the CI that is very flaky from time to time. As there is no clear patter, we will carry on our investigation and re-trigger the failed jobs manually in the mean time
-
Adding a cargo workspace looks like a good idea, but will require to modify the CI pipeline. As it is not a priority, we will work on it later
-
We have reviewed the matrix build PR of the CI and it will be merged shortly, after some minor fixes. It looks very good 👍
-
We have also reviewed the PRs in progress:
-
During these reviews, we have talked about some difficulties on how to better synchronize our developments and avoid going in different directions. One of the issues was to better understand/define the role of the stores in the current implementation of the nodes. We have stated that the business components (
Snaphotter
,MultiSigner
, ... ):- should embed the stores in their local implementations and determine where and how to use them
- the store should be injected at startup time in the
main.rs
dependencies init sequence
-
An idea would be to implement alphabetical order of the dependencies in the
Cargo.toml
files for better readability. Thecargo-sort
tool does this and we could implement it themake check
calls and as a new step in the CI build jobs
-
We have reviewed and talked about the implementation of the upcoming
VerificationKeyStore
-
We have also reviewed the in progress work on the
Snapshotter
and itsdigest/fingerprint
computation feature -
⚠️ Question: What data do we need to certify and to embed in the snapshot ?- If we work with an
immutable number
that is the penultimate, it means that we can't certify the latestimmutable number
associated files - But maybe these files are used by the Cardano node to compute the ledger states
- Is it a security hole to embed the latest
immutable number
files even if they are not certified? - If we can't embed them, can we still use the latest ledger state (which could considered as tampered by the Cardano node and/or could be a security hole if tampered and used jointly with tampered latest
immutable number
files)?
- If we work with an
-
We have noticed that some tests are flaky and fail from time to time. For example this one. We need to investigate further to understand if it is due to the recent changes of the CI made to analyze the Rust test results or if this is linked to the modifications done in the
mithril-core
library.
-
We have reviewed and merged the
crypto_helper
module inmithril-common
and talked about issues when importing#[cfg(test)]
modules from the common library (and how to bypass them) -
Some modifications on the
mithril-core
library are in progress and will need to be adapted in thecrypto_helper
(mainly types renaming at this time) -
We have paired on the computation of the real beacon based on the immutable files of the Cardano node:
- We have noticed that the latest immutable number associated files are updated until the following immutable number files are created
- Thus we have decided to rely on the penultimate immutable number to compute the beacon
- The Open API specifications and corresponding
entities
type will be modified accordingly by removing the now deprecatedblock
field and replacing it by animmutable_number
field
-
We have created the tickets for the next sprint on the Kanban board and we have talked about what tasks they require
-
Apart from these tickets, we have included some optimization/cleanup tasks that we will work on during the sprint in order to lower the technical debt
-
We have talked about the freshly redesigned
mithril-core
library and the remaining questions/issues:-
StmSig
andStmMultiSig
are not yet serde compliant because of the hashblake2
. A solution is under progress - We will rely on the
StmInitializer
serialization instead of the signer secret key (and thus thesecret_key
accessor will be removed in the core library) - Two tests from the protocol demonstrator tool are not working anymore as is (a protocol parameters modification was necessary, by changing the
phi_f
from0.50
to0.65
). An investigation is under progress in order to understand what happened (https://github.com/input-output-hk/mithril/blob/79f0c5f48d7f30ef1821782a8777c9302ab7a612/demo/protocol-demo/src/demonstrator.rs#L604 and https://github.com/input-output-hk/mithril/blob/79f0c5f48d7f30ef1821782a8777c9302ab7a612/demo/protocol-demo/src/demonstrator.rs#L634)
-
-
We have made some pairing in order to include an internal lib with a
lib.rs
that exposes (and re-exposes) the internal modules. A PR and an ADR are in progress. -
We have also reviewed and merged the PR of the new
mithril-common
library that will be used to implement shared features among the nodes. In order to simplify the CI, it may be a good idea to implement matrix builds with parameters inside the jobs definition, instead of simply duplicating the code.
-
We talked about the
mithril-core
library:- How to replace
ark_ff
FromBytes
andToBytes
traits implementation. The preferred option is to useserde
as a backbone for these tasks as it is very convenient to use and will offer possibility to import/export from different formats (json, yaml, ...) - Next step to handle the backend update to
blst
: once all the work has been done on the core library, we will rebase themain
branch on it and fix the code that uses it - We should use these RNG out of the tests (that can still use a seeded one):
OsRng
orThreadRng
- We should not need to use re exported modules from the core lib as it should fully wrap what's under the hood. If this needs arise, maybe we will have to modify the core library accordingly.
- How to replace
-
In order to facilitate E2E tests, we shoud implement a verify only or dry run mode in the Mithril Client
-
An idea would be to add a UI to the client by adding a http server attached to it
-
The newly bootstrapped Mithril Signer node should be included in the
Terraform
deployment of the CI: this will allow to producelive
multi signatures certificates. We will do it during the next iteration. -
There is a question about how to enforce that values passed to the aggregator are correctly formatted (e.g. base64 certificate hashes):
- Add explicit
400
http errors in OpenAPI specification and in the http handlers implementation - Investigate further on how this issue has been elegantly addressed by the Hydra team
- Add explicit
-
The security alerts thrown by
dependabot
have been screened:- We will run a
cargo update
command at the end of every sprint in to use latest versions of the crates - We could take advantage of the
[cargo-audit]
(https://lib.rs/crates/cargo-audit) tool that is also available as a GitHub action - Maybe the CI could turn red when a Critical level vulnerability is found? We need to see if it is a good option.
- We will run a
-
We have made a test showcase of the sprint demo path and everything worked as expected 😅
- We have created the following demo path for the sprint demo:
# Mithril Multi Signatures End To End
## Github
google-chrome https://github.com/input-output-hk/mithril
## Architecture
google-chrome mithril-mvp-architecture.jpg
## Interact with the aggregator through the OpenAPI UI
google-chrome http://mithril.network/openapi-ui/
## Download source (if needed)
git clone https://github.com/input-output-hk/mithril.git
# or
git clone git@github.com:input-output-hk/mithril.git
# optional
git switch ensemble/fix-client-certificate-hash-encoding
## Make build (if needed)
cd mithril/
cd mithril-aggregator && cargo build && cp target/debug/mithril-aggregator ../../mithril-aggregator && cd ..
cd mithril-client && cargo build && cp target/debug/mithril-client ../../mithril-client && cd ..
cd mithril-signer && cargo build && cp target/debug/mithril-signer ../../mithril-signer && cd ..
cd ..
# Signer #0
PARTY_ID=0 RUN_INTERVAL=120000 NETWORK=testnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-signer -vvvv
# Signer #1
PARTY_ID=1 RUN_INTERVAL=120000 NETWORK=testnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-signer -vvvv
# Aggregator
NETWORK=testnet URL_SNAPSHOT_MANIFEST=https://storage.googleapis.com/cardano-testnet/snapshots.json ./mithril-aggregator -vvv
# Client
## Get Latest Snapshot Digest
LATEST_DIGEST=$(curl -s http://aggregator.api.mithril.network/aggregator/snapshots | jq -r '.[0].digest')
echo $LATEST_DIGEST
## List Snapshots
NETWORK=testnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-client list -vvv
## Show Latest Snapshot
NETWORK=testnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-client show $LATEST_DIGEST -vvv
## Download Latest Snapshot (Optional)
NETWORK=testnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-client download $LATEST_DIGEST -vvv
## Restore Latest Snapshot
NETWORK=testnet AGGREGATOR_ENDPOINT=http://localhost:8080/aggregator ./mithril-client restore $LATEST_DIGEST -vvv
## Launch a Cardano Node
docker run -v cardano-node-ipc:/ipc -v cardano-node-data:/data --mount type=bind,source="$(pwd)/data/testnet/$LATEST_DIGEST/db",target=/data/db/ -e NETWORK=testnet inputoutput/cardano-node
-
We have paired on the Mithril Signer single signatures and finished the work; the PR should be merged shortly #167
-
We have also reviewed the Mithril Client multi signature verification in the PR #166. Some minor fixes are in progress before merging.
-
We will have to prepare a demo path tomorrow morning for tomorrow afternoon sprint demo (once all PRs are merged)
-
The
mithril-network
folder has been merged into the root of the repository 🥳 -
Here are some points that we should address shortly:
- Implement typed error with
thiserror
crate in all nodes where it is not done yet, and redact an ADR for error handlig - Implement a common library in a
mithril-common
folder - Use the new version of the
mithril-core
library and handle the breaking changes - Move the deterministic RNGs used in the code and use non deterministic ones
- Reexport libraries from the
mithril-core
library in order to simplify the dependencies in theCargo.toml
files
- Implement typed error with
-
We have paired on the Mithril Signer and followed up the work started yesterday
-
We have talked about the simplification of the Mithril Verifier without using a
Clerk
. It could work if we embed these information in theStake Distribution
certificate as the Merkle Tree root of the stake distribution (party_id
,stake
,verification_key
). There are still many open questions. -
We should open source the repository in mid June in order to be ready for the Cardano Summit
- Intros
- Explanation of Mithril
- Explanation of the POC and MVP architecture and scope
- Discovery of the Github repository
- Demonstration of the first version of the Mithril Network that has been showcased last sprint
- Q&A session
- Plan next days/weeks
-
We will setup a Rust workspace for
mithril-core
and all themithril-network
nodes as a first step -
We will also move the folders from
mithril-network
to the root of the repository
-
We have reviewed and merged the PR #163
-
Some issues with dependencies appeared with incompatible versions between the core library and the aggregator. To fix the issue, the
jsonschema
was roll-backed to an earlier compatible version. We think it is a good idea to import the core library as a crate instead of a path, but this will be possible only once the repository is open sourced.- It's possible to have a file-based registry to/from which crates can be retrieved but this seems not worth the hassle. It's unfortunate GitHub does not support Cargo registry out of the box
-
The code should make a better use of errors and avoid the use of
unwrap
. There is an ADR in progress regarding a standardization of the error handling and propagation that will help achieve this target.- Need to properly type all errors, instead of using
String
- Use
foo?
construct to automatically propagate and wrap errors to pass to caller - Rest layer = Imperative shell where errors are catched and formatted for consumption by client
- Business layer = "functional" core where errors are part of return type, either using
Result
or dome more specific type
- Need to properly type all errors, instead of using
-
The
certificate_certificate_hash
handler should make a call on aCertificateStorer
in order to not need to modify the handler code when updating its implementation. -
The
create_multi_signature
function should be called on the fly when a certificate is requested instead of each time a single signature is registered. This function should return an enum type (for clarity instead of aResult<Option>>
). Also the function should be pure.- Example of specific return type incorporating error:
create_multi_signatures
should returnenum MultiSig { QuorumNotReached, MultiSig(Multisignature), MultiSigError(..) }
- Example of specific return type incorporating error:
- We paired on implementing the retrieve pending certificate and register signatures features and started to write tests. We will continue the pairing during tomorrow session.
-
We have paired on the structured logs and found a way to filter the log level with a cli argument. We also investigated further multiple drains implementation
-
We noticed a bug on the CI that showed green status when a Rust test failed. The bug has been fixed and deployed to the main branch
-
We have investigated the security alerts that show up in the repository and we will talk about how to fix/ignore them in a next Mithril Session. As the security fix are generally delivered in more recent versions of the libraries, we think that we should run dependency updates regularly with
cargo update
(once a sprint/month) -
We noticed a flaky test in the Mithril Client and a fix is in progress
-
We have paired on implementing the
slog
in the Mithril Signer -
We implemented a drain that will asynchronously (with (
slog-async
)) produce structured json logs (withslog-bunyan
) and implemented custom vars inside them -
We found a way to use a global logger (which would not require referencing a
log
object everywhere in the code). However, this approach is not recommended in case of a library. This means that we should shortly create amithril-network/mithril-common
library that would be used by the nodes instead of using the aggregator both as a binary and a lib -
We still need to find a way to use the
VerbosityLevel
cli flag to be able to control the log level that is displayed -
Question: Do we need to add a terminal specific drain? In that case, do we need to log jsons to a different place that
stdout
?
-
We have reviewed the PR in progress for creating multi-signatures in the aggregator #163
-
A first live demonstration of the REST API of the aggregator
MultiSigner
in local environment was done:- create a multi-signature from two single signatures
- retrieving it and its certificate
-
We noticed that there is no
Verifier
entity available in the cryptographic library: it needs to use aClerk
to verify signatures and thus uses the full stake distribution, whereas a verifier would only need the root of the associated Merkle tree. An issue has been created for this: #162. This will have consequences on the structure of the certificate that we will need to address. -
The current PR #155 that implements a production ready backend library will not be merged during this sprint in order to avoid conflicts with the current developments.
-
This PR will be completed with some optimizations in the mean time:
- Make the implementation thread-safe
- Add native import/export functions for keys and signatures
-
When the cryptographic code is done, we will have to make some adaptations to the code of the Mithril Network nodes that use it before merging.
-
We have also talked about separating the Mithril Core library from the Mithril Network (by using a versioned crate of the core library). This will be possible as soon as the repository is open sourced.
-
We have reviewed the PR that updated the CI workflows in order to retrieve save/compile the Rust test results. The PR has been merged #142
-
We have also paired on bootstrapping the Mithril Signer node #143
-
Some dependencies conflicts occurred while importing the core library in the aggregator code base:
- We will update the dependencies of the core library and make sure they don't introduce any regression
- It could be a good idea to use a Rust Workspace to handle the dependencies smoothly. We'll investigate that option shortly
-
Some crates used in the core library are not production ready and need to be replaced:
-
Nice features to have in the core library are:
- type aliasing for readability and ease of use
- import/export of hex values of verification keys, single and multi signatures
-
We have reviewed an implementation attempt of
anyhow
in theSnapshotStore
of the aggregator for clean error propagation (see #152) -
We decided to use
thiserror
instead as it is better to get typed error.Anyhow
could be used to take advantage of itsContext
feature that helps tracking the context in which the error occurred -
The structured logging ADR has been adopted unanimously and we will implement it during the Mithril Signer Node bootstraping #143
-
A
mithril-common
library will be created in themithril-network
folder in order to provide a shared source for components that are needed in multiple nodes (such asSnapshotter
)
-
The end to end tests that are currently coded in Haskell could be migrated to Rust for easier maintenance. We need to find a way to bootstrap the cryptographic materials needed by the local cardano nodes used during the test.
-
The current tests relies on external data hosted on Google Cloud. In order to get clean tests, we will need to decouple the snapshotter:
- separate the
SnapshotProducer
(produces the snapshot archive, computes the digest) - from one/multiples
SnapshotReplicator
(replicates the archive file to CDN, IPFS and/or BitTorrent) - implement a configurable version of the aggregator that would init specific dependencies for these e2e tests
- separate the
-
Some tests have been conducted in order to get a first verification that the Cardano node only needs valid
immutables
to bootstrap a node #137 -
This will allow us to create snapshot from unmodified Cardano node and providing the fastest bootstrap at the same time
-
Further investigations need to be done with the Consensus team to get a final validation (scheduled for next sprint)
We had a thorough discussion about what the REST API should look like as the current resources and verbs are not very consistent, and not very RESTish. The discussion devolves into investigating what the structure of certificates would look like and how chaining worked.
Stakes:
-
GET /stakes/0
-> signed by genesis key -
GET /stakes/{epoch number}?limit=&from=&to
-> Retrieve the (signed) stake distribution for some epoch -> link to previous epoch's stakes distribution
Snapshots:
-
GET /snapshots?limit=1&from=123&to=345
-> retrieve a list of snapshots, latest first? -
GET /snapshots/{mithril round}
-> return snapshots for given round -> contains:- beacon information (what are we signing, at what epoch/block/slot)
- pparams
- list of signers -> with a link
- digest of message for some beacon (mithril round)
- multisignature -> link to signatures?
- stake distribution hash + link to /stakes/{epoch number}
Signers:
-
PUT /signers/{mithril round}/{party id}
= register some signer's key for next "round" -> needs to be signed! by the skey associated with this party's identification vkey -> body: verification key (mithril specific key) + signature
Signatures:
-
(later)
GET /signatures/{mithril round}
- 200: list signatures for given round
-
POST /signatures/{mithril round}
- contain signature(message digest || stake distribution hash)
- 400 -> hash exists but you're not allowed to sign now
- 404 -> no message with that hash to sign
scenario from signer perspective:
Assumption: We have already registrered for signing epoch 40000
-
PUT /signers/40000/123456789
-> register signer 123456789 to sign certificate at round 40001 -
GET /snapshots/40000
-> 404 while epoch has not changed - threshold => mithril round changes, list of signers is closed
-
GET /certificates/40000
-> gives me parameters for signing the certificate POST /signatures/40000
- epoch change -> sign new stake distribution
-
GET /stakes/{epoch}
-> gives me parameters for signing the stake dsitribution POST /stakes/40000
scenario from client perspective:
GET /snapshots
-
GET /snapshots/40000
-> verify multisignature
Question:
- What if signature fails? Esp. problematic for signing the stake distribution
- Previous model was assuming we would sign a certificate at epoch n + 2, using stake at epoch n, and containing stake at epoch n + 1
-
We have created the tickets for the next sprint on the Kanban board and we have sliced them into multiple tasks
-
On top of the new increment tickets, we have included some optimization/cleanup tasks that we will work on today
-
We have paired on improving the dependency manager in the aggregator #131:
- We added a
SnapshotStoreHTTPClient
(an implementation of aSnapshotStorer
) as a new dependency - We improved the API Spec in order to better handle
default
status code - We have implemented a
with_snapshot_storer
used by the handlers of the http server - The configuration implemented relies only on a config file, which is a problem with the test lab for example
- In order to fix this issue, we will implement the config crate so that we can easily substitute file configuration with env var configuration
- We added a
-
The legacy POC Node has been removed from the CI (the associated Docker registry should be erased)
-
The aggregator snapshotter has been slightly modified in order to produce snapshots at startup (so that CD redeployment does not cancel them)
-
A test will be conducted during next sprint in order to understand if the digest could be computed only from the immutables:
- What happens if the ledger state distributed with the snapshot is tampered (but the immutables are genuine)?
- Try to restore a Cardano node with a past or future ledger state
- Issue created to for this test #137
-
Next week Mithril sessions will be used to talk about:
- Optimization of the aggregator REST API
- ADRs definition #128
- Error typology to put in place (codes, meaning, ...)
-
All the PRs have been merged and we are now able to make a full end to end demo from snapshot creation to restoration of a Cardano Node with a cloud hosted aggregator 🤓
-
Here is the path for the sprint demo:
## Download source
git clone https://github.com/input-output-hk/mithril.git
## Go to Mithril Client directory
cd mithril/mithril-network/mithril-client
## Build Mithril Client
make build
## Get Latest Snapshot Digest
LATEST_DIGEST=$(curl -s http://aggregator.api.mithril.network/aggregator/snapshots | jq -r '.[0].digest')
## List Snapshots
./mithril-client -c ./config/testnet.json list
## Show Latest Snapshot
./mithril-client -c ./config/testnet.json show $LATEST_DIGEST
## Download Latest Snapshot
./mithril-client -c ./config/testnet.json download $LATEST_DIGEST
## Explore Data Folder
tree -h ./data
## Restore Latest Snapshot
./mithril-client -c ./config/testnet.json restore $LATEST_DIGEST
## Explore Data Folder
tree -h ./data
## Launch a Cardano Node
docker run -v cardano-node-ipc:/ipc -v cardano-node-data:/data --mount type=bind,source="$(pwd)/data/testnet/$LATEST_DIGEST/db",target=/data/db/ -e NETWORK=testnet inputoutput/cardano-node
-
We have a few cache issues with the snapshot manifest file that is stored on Google Cloud which we will try to fix asap
-
We have talked about the aggregator REST API optimizations that we could do with a more resources oriented interface
-
We have paired on implementing a dependency manager in the aggregator #131
- CI takes way too much time, eg. > 20 minutes
- Reduced full execution time to 10-12 minutes by:
- Moving mithril-core test execution to a different job so that
docker-mithril-aggregator
anddocker-mithril-client
do not depend on it - Reuse executables produced by
build-mithril-aggregator
and-client
jobs in thedocker-xxx
steps as they are binary compatible and it's pointless to do a rebuild that takes several minutes
- Moving mithril-core test execution to a different job so that
- Initially made
terraform
step only depend ondocker-xx
jobs but then added a dependency on tests, which might or might not be a good idea
-
We have paired on many PRs in order to be ready for a end to end demo at the end of the sprint:
- Reviewing/finalizing the
SnapshotStorer
implementation of the aggregator - Reviewing the
Snapshotter
implementation of the aggregator and the Terraform deployment to Google Cloud - Reviewing/finalizing the
Unpack
snapshot archive implementation of the client
- Reviewing/finalizing the
-
We have also talked about how to showcase the e2e demo (given the delays needed to download & restore for the
testnet
) -
During the reviews we talked about architectural question that arose during the latest developments:
- Which pattern(s) to adopt to handle best errors?
- Which implementation(s) to use for a clean architecture?
- How to keep code readable, concise and not too complex?
-
In order to keep track of these decisions, we will add an ADR topic to the documentation website that will help us keep track of them #128
-
As it is not necessary at the moment, and as it adds complexity in the CI pipelines, we will remove the triggering of the CI for PRs #121
-
We have also talked about separating the
Aggregator
component from theSnapshotter
component, which would be only responsible of creating the archive of the snapshot and serving/uploading it
- Provide a Cloud DNS configuration for aggregator service. It can now be accessed at http://aggregator.api.mithril.network/
- This required setting up a zone managed by cloud DNS and update the NS entries in the
mithril.network
DNS zone definition - Had a look at configuring a HTTPS load balancer to "guard" access to the mithril aggregator service
Goal is to automate the snapshotting process, emulating what the aggregator should be doing when it receives enough aggregated signatures, and upload the corresponding archives to a publicly available Gcloud storage bucket that will allow mithril-client to retrieve, verify and use them to bootstrap a cardano-node.
-
Initially thought I would use a simple
cron
-based process running themithril-snapshotter-poc
scripts. Trying to write some simple code in theaggregator
that invokesmithril-snapshot.sh
script to build an archive and then uploads the archive to gcloud storage. This won't work though as the container running themithril-aggregator
does not have much installed and thus cannot simply run the script. The simplest solution right now seems to rely on an external (cron driven) script that will do the necessary magic, perhaps based on some magic file that when present triggers the snapshot build and upload? -
Then decided to rewrite the scripts in Rust and run a dedicated thread alongside the
mithril-aggregator
server- We will need to be able to do it anyway, both for signing and aggregating, so better learn early how this works in Rust
-
snapshotter.rs
runs in a separate thread alongside the main server, needed to create aStopper
structure that can be used to stop the thread passing it a poison pill. Assembling the various pieces in Rust:- There is a crate for interacting with gcloud storage: https://docs.rs/cloud-storage/latest/cloud_storage/. There's a way to stream the file content: https://docs.rs/cloud-storage/latest/cloud_storage/client/struct.ObjectClient.html#method.create_streamed
- Rust for compressing a directory to an archive: https://rust-lang-nursery.github.io/rust-cookbook/compression/tar.html#compress-a-directory-into-tarball
- This SO answer explains how to compute sha256 of a file in rust: https://stackoverflow.com/a/69790544/137871
- Then listing all files in a direcotry: https://natclark.com/tutorials/rust-list-all-files/#listing-files-but-not-folders
-
Archive generation is somewhat tricky, the example above does not highlight the fact one needs to call
finish()
orinto_inner()
on the archive to ensure everything's flushed, which took me quite a while to get right. Also took me a while to compute correctly the size of the file... -
Uploading the file to gcloud storage was also somewhat involved:
- Need to pass an environment variable contianing the credentials correctly formatted
- Then struggled to find how to stream a file so that we don't load the full 10+ GB of the archive into RAM before sending it to storage, turns out it relies
tokio::fs
module andtokio-utils
crate but it's only two lines
-
the whole upload process is fully
async
based but the toplevel thread isn't: I had to pass down atokio::Runtime
to be able toblock_on
the result of the async tasks, which does not seem quite right Should probably use tokio spawn to fork the snapshotter thread instead of fiddling with Runtime::new and the like… -
Struggling to get snapshotter to work w/in docker, turnaround time is a PITA to work with as every docker build takes ages. Found this trick to speed things up by caching dependencies.
-
The snapshotter works within the container but fails to write the archive because access is denied. The problem is that the executable runs with a local user appuser which does not have rights to write to its current directory /app. Changing the directory's ownership within the build should be fine though:
WORKDIR /app/ RUN chown -R appuser /app/
Tried to use volumes but it did not work either as volumes are created as the user running the docker container which by default is root
-
Archive creation is now working in the container but it seems the container crashes when it tries to upload the file. Some testing inside the container shows that the executable segfaults when it tries to connect to gcloud which points at issues with system-level dependencies, eg. crypto or tls libraries in alpine.
-
Replaced the based docker images to be
debian:buster-slim
, and same for builder image (rust:1.60
which is debian-based)
- Create
mithril-network/mithril-infra
terraform project to host configuration of mithril aggregator stack- A virtual machine with a static address
- Running a docker-compose stack with a cardano-node (testnet) and a mithril-aggregator process
- Added
terraform
step in CI to automate deployment:terraform apply
is run when pushing tomain
branch - Configure secrets in github for
GOOGLE_APPLICATION_CREDENTIALS_JSON
containing service account's key that will be passed to the mithril-aggregator for uploading snapshots, andGCLOUD_SECRET_KEY
to be used by the GitHub action running terraform- Struggled to get escaping right both when passing the arguments to the github action's steps and to the terraform process
- In github Yaml, enclose the reference to the variable's content in single quotes: the variables are interpolated before being passed to the shell and their content is thus not interpreted
terraform apply -auto-approve -var "image_id=${{ env.BRANCH_NAME }}-${{ steps.slug.outputs.sha8 }}" \ -var 'private_key=${{ env.GCLOUD_PRIVATE_KEY }}' \ -var 'google_application_credentials_json=${{ env.GOOGLE_CREDENTIALS }}'
- Same goes for the terraform configuration file:
provisioner "remote-exec" { inline = [ "IMAGE_ID=${var.image_id} GOOGLE_APPLICATION_CREDENTIALS_JSON='${var.google_application_credentials_json}' docker-compose -f /home/curry/docker-compose.yaml up -d" ] }
- The
GCLOUD_SECRET_KEY
contains the secret key associated with a public key stored inssh_keys
file that can be used by the terraform process to ssh into the VM, it's a RSA private in ascii armor format, including the newlines. I tried to remove or replace the newlines and it did not work, it seems the single quote escaping is just fine with newlines - The
GOOGLE_APPLICATION_CREDENTIALS_JSON
contains JSON key for service account, with newlines removed
-
There is an issue with the Docker images build pipelines that are not very reliable. After investigation, it appears that they should be fixed/enhanced. We have created an issue for this #121. The problem was apparently due to not using the latest version of the github actions: we have released a fix and monitor to see if this happens again
-
As the snapshot creation and storage on Google Cloud is now alive, we have created a task to implement a
SnapshotStore
in the Mithril Aggregator #120 -
A pairing session was done to work on this task and we achieved to retrieve the list of remote snapshots from the aggregator 🎉. The corresponding PR is #123
-
We have paired in order to fix multiple issues:
- The branch of the Init Mithril Client was corrupted after a conflict resolution #110
- After merging the PR, we found out that the documentation workflow of the CI did not work. It was due to publishing build artifacts from the
cargo doc
processes to thegh-pages
branch which reached its limit of 50MB. The fix was published in #118
-
We had talks about what is the best way to deploy the Mithril Aggregator server to public cloud:
- Several solutions were investigated (Kubernetes, Custom deployment Terraform, Google Cloud Run, ...)
- As we need to embed a Cardano Node with the Aggregator that has access to its local storage, we decided to start with a Terraform deployment of a Docker Compose setup
- This hosting setup will also allow us to produce easily and regularly snapshot archives (scheduled by a cron)
- At each merge of the
main
branch a workflow will trigger a deployment of the latest Docker image to the cloud
-
We talked about the questions raised by Inigo about yesterday's meeting with Galois. A Q&A meeting session will be organized in that matter
-
The documentation job needs to be changed. This will be done in #104
- The documentation should be produced in the builds jobs (for more efficiency)
- Artifacts (such as compressed archive of the generated files) should be uploaded as artifacts
- The documentation job should use the previously produced artifacts to prepare the new release of the documentation website
-
We decided not to use jobs templates for now in the CI jobs definition
-
We worked on a Feature Map for the increments of the components of the Mithril network
-
We have reviewed the graceful shutdown fix for the Mithril Aggregator #116
-
We have also reviewed the code of the Mithril Client #110
-
We had some thoughts on how to produce snapshots and use the remote storage as a database at first and have the Mithril Aggregator make this information available through its API
-
The aggregator has a few issues and fixes/optimizations that will be done in #109
-
We have worked on the CI pipeline architecture in order to get a clear view of it
-
We should publish the executable files and make them available on Github (Linux only, except for the Mithril Client that needs to be on macOS, Linux and Windows)
-
Question: Do we need to do the end to end testing on the executable files or the Docker images?
-
We have decided to work with the executable files at first, and then with Docker. We may need to change the way the Docker images are built and embed the previously built executable file.
-
We have successfully created a first ETE test scenario where we check that a freshly started aggregator effectively produces a snapshot and makes it available at its
/snapshots
route inassertClientCanVerifySnapshot
. It will not work with the current aggregator implementation until the number of fake snasphots displayed is reduced to 1
- We had a fruitful discussion detailing the state machine view of the signer's registration process that will be formalised in the monitor framework.
- The signer should be essentially stateless regarding the certificates, and retrieve all data from the aggregator, eg. pending certificate and certificates chain. If there are multiple aggregators it becomes possible to have multiple chains, which is probably not a big deal but could become quite messy
- Aggregators are essentially infrastructure provider: They setup some infrastructure to efficiently distribute snapshots and they have an incentive to provide as up-to-date as possible snapshots to their "Customers"
- In the long run, both signers and aggregators could use the blockchain itself as a communication medium in order to synchronise and ensure consensus over the certificates chain: The tip of the certificate could be part of a UTxO locked by a script enforcing single chain production, possibly with time bounds in order to prevent arbitrary locking of the certificate chain and thus DoS?
- The amount of transactions involved is not huge: 1000-2000 signatures + 1 aggregation per epoch (5 days) would be enough but this implies signers and aggregators spend money for posting the txs
- The problem of rewarding mithril network is still open
- The previous certificate hash is missing from the
/certificate-pending
route. An issue has been created for this task: #112
- We have setup Live Sharing on VS Code for pairing sessions (using Live Share extension)
-
We have worked on some refinements of the
openapi.yaml
file so that the documentation generated by Docusaurus does not truncate the description sentences -
In order to keep the generation process not too cumbersome, we have decided to keep the source
.md
files in thedocs/root
folder, mixed with the website files. When we need to refer to an externalREADME.md
file (for examplemithril-network/mithril-aggregator/REAME.md
) the easiest way to do it is to provide a direct Github link such as (https://github.com/input-output-hk/mithril/blob/main/mithril-network/mithril-aggregator/README.md) -
The documentation is now live and accessible at (https://mithril.network/) and the CI updates it along the way
-
As releasing production Github actions can be tricky (especially when specific operations are launched only on the *main branch), it looks like a good idea to do it in pairing
- We have reviewed the first code base of the Mithril Client available at jpraynaud/mithril-client
-
We have created the tickets/tasks on the project Kanban board following the sprint planning that took place yesterday
-
In order to facilitate the pairing sessions, we will setup Live Sharing on VS Code next week
-
Question: Do we move the
openapi.yaml
file from the root to themithril-network/mithril-aggregator
folder? -
Question: In order to have as fast as possible CI workflows completion, what are the best parallelization/caching strategies of the jobs?
-
We have talked about the sprint demo and reviewed the demo path
-
Questions asked during the demo:
- Can we cache the digest computation on the signer side? (Yes, good idea. It will lower the load on the signer side)
- Can we pipeline the download and the decompression of the snapshot? (Yes, good idea. We can do that as an optimization later)
- What happens with the Genesis certificate when the Genesis keys of the node are updated? (We need to investigate this)
- How long does it take to verify the certificate chain? (This could be done in parallel of the snapshot download/uncompress to avoid further delays)
- How do we handle the discovery of the aggregators? (This could be done on-chain. We will need to investigate this later)
- It appears that the tool db-analyser provides a good way to create deterministic snapshots of the ledger state:
- It will be much easier to use than modifying the Cardano Node snapshot policy
- We will investigate further in that direction
- We need to check if we can rely on the slot number or the block number to produce consistent snapshots
-
We have worked on the documentation and reviewed the first version of the docusaurus website
-
We still need to define precisely the final documentation structure
-
A session with the communication department will be setup to review the documentation and prepare open sourcing
-
A new CI workflow has been released that better separates the steps of the process
-
We still struggle to get access to the Docker registry. IT department is working on it
-
Question: How can we get delete previously created packages? (
mithril
package now andmithril-node-poc
later)
-
The OpenAPI specification validation unit tests in Rust are almost done
-
They allow to validate that the request and the response sent to the aggregator server are compliant with the specification (for the routes we decide to test)
-
There are a few modifications/upgrade under development that should be finalized shortly:
- Create a separate module for the
APISpec
- Create a Docker image of the aggregator
- Create a workflow for the aggregator server in the CI
- Make some refinements on the specification (include
400
errors forPOST
method, and handle more precisely the size of the string params)
- Create a separate module for the
-
Next steps are:
- create an integration test for the server
- implement the server in the ETE test lab
- optimize the tests code if necessary
-
Question: do we need to move the
openapi.yaml
file to themithril-network/mithril-aggregator
folder?
-
We have now a better understanding of how Docusaurus works
-
As a start, we will try to reproduce the Hydra documentation website (when it applies)
-
We will update the docs directory structure and create Under construction pages where it makes sense
- Here is a new breakdown of the timings to produce/certify/restore snapshots on the mainnet:
-
We have finally found a way to test that the actual implementation of the Mithril Aggregator is conform to the OpenAPI specification
-
We have worked in order to make the testing work for a
response
object and with explicit error messages when tests fail in rust tests -
We are currently trying to implement the validation of the
requestBody
according to the specification -
We have also discussed about the several layers of test and how to implement them:
- At a upper level, we'd like to have an integration (or several) that would:
- test the global conformity of the aggregator server implementation to the specification
- use the specification as the source to generate tests: fuzzy tests and/or stress tests
- we still need to investigate and find the best tooling for this and work on a CI implementation
- At a lower level, in the Rust server, we will need to test that each route is correctly implemented in unit tests:
- use the actual implementation and test it to validate that it matches the specification
- each route should be tested for every possible response status code (this will be easier when mocks will be implemented)
- each of these tests must also validate the conformity of the request/response to the OpenAPI specification
- an other unit test should also check that all the routes implemented in the server match exactly all the routes defined in the specification (we still need to find a way to list all these routes on the server - we will investigate further warp documentation)
- At a upper level, we'd like to have an integration (or several) that would:
Some more tools that could be useful for testing/checking REST APIs:
- https://github.com/s-panferov/valico
- There's a Rust implementation of Pact CDCT tool: https://github.com/pact-foundation/pact-reference/tree/master/rust
- https://github.com/apiaryio/dredd/ also provides a mechanism for "hooks" implemented in Rust
- Pact general approach seems to make sense in the long run but might be cumbersome to setup
-
We have started to review the way the documentation is implemented on Hydra and the integration that is made with Docusaurus and the CI worklows
-
We will continue this investigation tomorrow
-
We will design a tree structure for the website that will be hosted on (https://mithril.network) and work on restructuring the
docs
folder of the repository
Working on snapshot generation in the cardano-node, adding parameter to be able to generate snapshot every X blocks. Not sure how to test this though, as the test infrastructure of consensus is quite hard to grok
I was unable to build cardano-node using nix-shell -A dev
nor nix develop
as instructed in build instructions, but managed to build it installing the needed OS dependencies (forked libsodium and libsecp256k essentially). Adding the ouroboros-network packages as local packages in cardano-node's cabal.project
enables incremental build whereby changes to the former entail rebuild of the later.
- Intros
- Explanation of Mithril
- Explanation of the POC and MVP architecture and scope
- Demonstration of the cryptographic library
- Plan next days/weeks
- Dig into repository codebase and documentation
- Development environment setup
- Setup of daily stand-up and added Denis to other meetings
- First version of the server is pending review (in its fake version)
- We need to enhance the tests and make sure the server implements correctly the OpenAPI specification
- For these tests we are scouting some libraries and utilies: OpenAPI Fuzzer crate and/or OpenAPI crate
- We cannot use
openapi-fuzzer
as a lib so we try to get inspiration from it instead. Package openapiv3 seems mature enough to be used to read the specification and use that to write simple tests to check output of various routes, and we can later generate arbitrary data to input.
Created an APISpec
structure that will help us define what we test and what to expect
- We are trying to use jsonschema to check conformance of document against schema
-
https://tarquin-the-brave.github.io/blog/posts/rust-serde/ fiddling with serde between Yaml and Json as our spec is written in Yaml but
JSONSchema
requires a JSONValue
Struggling with validating our dummy aggregator server against OpenApi specification using rust tooling, seems like jsonschema-rs does not work as easily as expected. Turns out things are more complicated than we would want: The OpenAPI specification is not really a proper JSONSchema, only some fragments of it are (eg. types in requests/responses) so we would need to find the correct part of the schema to check our code against.
Possible options from there:
- Stick to use openapi-fuzzer as an external binary, running basic integration tests against a running server. Should be fine at first but will quickly break as we flesh out the internals of the server and need to tighten what's valid input
- Understand how jsonschema-rs library works and make sure we check requests against it
- Use another tool like https://github.com/Rhosys/openapi-data-validator.js/blob/main/README.md. This one is in node so this would require us to run it as an integration test. Hydra uses a similar approach based on anothoer
jsonschema
executable in node.
- Question: Should we develop the ETE tester of the test lab in Rust instead of Haskell (for simplicity and maintainability)?
- New tests have been run and it outcomes that they are ~2x faster than the previous ones:
Mainnet
Data | Node | Full | Archive | Snapshot | Upload | Download | Restore | Startup |
---|---|---|---|---|---|---|---|---|
Immutable Only | standard | 43GB | 24GB | ~28m | ~45m | ~25m | ~12m | ~420m |
With Ledger State | modified | 45GB | 25GB | ~28m | ~45m | ~25m | ~12m | ~65m |
Testnet
Data | Node | Full | Archive | Snapshot | Upload | Download | Restore | Startup |
---|---|---|---|---|---|---|---|---|
Immutable Only | standard | 9.5 GB | 3.5 GB | ~7m | ~5m | ~3m | ~2m | ~130m |
With Ledger State | modified | 10 GB | 3.5 GB | ~7m | ~5m | ~3m | ~2m | ~6m |
Host: x86 / +2 cores / +8GB RAM / +100GB HDD Network: Download 150Mbps / Upload 75Mbps Compression: gzip
- The network is now live on the mainnet 🚀
- The Project Charter page is now available here and open to comments
Added basic ETE test to spin up a cardano-node cluster of 3 nodes, now adding the test to CI probably in another parallel job as it will take a while to build...
Having trouble building on CI
trace: WARNING: No sha256 found for source-repository-package https://github.com/input-output-hk/hydra-poc 60d1e3217a9f2ae557c5abf7c8c3c2001f7c2887 download may fail in restricted mode (hydra)
fatal: couldn't find remote ref refs/heads/60d1e3217a9f2ae557c5abf7c8c3c2001f7c2887
Probably should try adding the sha256
field:
- the nar executable in https://github.com/input-output-hk/nix-archive/ will do it
nar git-nar-sha --git-dir ../whatever --hash XXXXXX
- Alternative command to compute sha256 for nix:
nix-prefetch-git <repo> <git-hash>
- When ran nix just tells you the missing sha256's value 🤦
Got a weird error when building: Cabal cannot find commit 60d1e3217a9f2ae557c5abf7c8c3c2001f7c2887 for hydra-poc which is weird as it's definitely available on github. Removing dependency and rerunning cabal again works 🤔 🤷
Created an archive for testnet and uploaded it to google storage bucket
- Not sure how to handle authentication properly, I had to upload a service account key file to impersonate the hydra-poc-builder service account, I should have been able to do it using the credentials/auths attached to the VM which is a resource
- A first version of the OpenAPI specification has been created
- An OpenAPI UI allowing us to interact with the routes is accessible and updated by the CI
- We will work on a basic REST server in Rust that implements this specification
- The
rust
folder containing the cryptographic library has been moved tomithril-core
- We don't need to copy the files somewhere when we create a snapshot (or when a signer needs to compute the digest of the snapshot)
- We will make some tests with an uncompressed archive (and/or slice the archive in smaller chunks) in order to see if this can make the restoration faster
- We will make a test to see what's happening when we restore a snapshot with only ledger state (last time it worked 1 out of 2 times)
- The following diagram shows a breakdown of the timing for the snapshots creation/restoration on the mainnet:
- We need to conduct an external audit of the crypto library (it should take ~3 months)
- The library will be published on crates.io once it is open sourced
- How to sign genesis Mithril certificate:
- we will use Cardano genesis private key to sign it for the mainnet (manually)
- we will use test genesis keys published on the repository for the testnet
- The repository is now getting ready to being open sourced. Few verifications/update must be completed before going public #92
- We can start working on a fake aggregator server that follows the OpenAPI specification
- We have talked about the Project Charter page content (that will be completed soon)
- The CI should be optimized/reorganized in order to better reflect the new structure of the repository and maybe operate faster
Starting to build an ETE test infrastructure based on the work we did for Hydra project
- Moving
mithril-test-lab
tomithril-monitor
package and adding a newmithril-end-to-end
package to host the needed code for managing ETE test infra. Compilation ofzlib
dependency is failing due to missingzlib-dev
deps -> adding it toshell.nix
- Toplevel
mithril-test-lab
is now acabal.project
containing 2 packages: One for the monitor and one for the end-to-end test that will depend on cardano nodes et al. - Trying to get the dependencies right reusing Hydra's nix files so that we can benefit from iohk's nix caching too, which implies using
haskell.nix
🤷 - Tried to remove use of hydra-cardano-api but getting the types right now becomes way too complicated, so I just added hydra-poc as a dependency for the moment
- Finished tests to create a snapshot from mainnet Cardano Node and restoring the snapshot. Currently running same tests on testnet:
Mainnet
Data | Node | Full | Archive | Snapshot | Upload | Download | Restore | Startup |
---|---|---|---|---|---|---|---|---|
Immutable Only | standard | 41GB | 25GB | ~47m | ~45m | ~25m | ~18m | ~420m |
With Ledger State | modified | 43GB | 26GB | ~47m | ~45m | ~25m | ~18m | ~65m |
Host: x86 / +2 cores / +8GB RAM / +100GB HD Network: Download 150Mbps / Upload 75Mbps Compression: gzip
-
A simple cli has been developped in order to conduct the tests
-
Here are the informations needed to create a working snapshot:
- the whole
immutable
folder of the database (required) - the
protocolMagicId
file (required) - the latest ledger state snapshot file in the
ledger
folder (optional)
- the whole
-
Question: Do we need to stop the node when the snapshot is done? (the test cli currently makes a copy of the files to snapshot in a separate folder)
-
In order to create a deterministic digest, it appears that we will need to rely on the binary content of the snapshotted files (digest calculation from snapshot archive file is not enough)
- The
rust-node
folder has been moved tomithril-prototype/test-node
- The
monitor
folder has been renamedmithril-test-lab
Worked on CEO update presentation: https://drive.google.com/file/d/19_Lrr5sYAhVatxdiMws6OGwvsVRn8p0Q/view?usp=sharing
Mainnet machine is still synchronizing the network, currently at slot 53M while tip's slot is 57M, hopefully should be done by noon.
- Mainnet node crashed when it ran out of disk space...
Current disk space usage:
3.2M cardano-configurations 4.0K configure-mainnet.sh 4.0K docker-compose-mainnet.yaml 4.0K install-nix-2.3.10 4.0K install-nix-2.3.10.asc 4.0K ipc 4.0K ledger.gpg 30G mainnet 8.4G mainnet.bak 4.8G mainnet.tar.bz2
- Compression is about 60% with bzip2 which is onpar with pkzip, and mainnet directory is now at 30G with nearly completed sync up. Going to remove the backup to regain some space
cardano-node_1 | [d77879a4:cardano.node.ChainDB:Error:5] [2022-03-29 08:21:59.82 UTC] Invalid snapshot DiskSnapshot {dsNumber = 54107040, dsSuffix = Nothing}InitFailureRead (ReadFailed (DeserialiseFailure 400482271 "end of input"))
This error is annoying: This means the snapshot needs to be reconstructed :( The error does not say which file failed to be deserialised though
It's reconstructing the ledger but from the latest snapshot file:
cardano-node_1 | [d77879a4:cardano.node.ChainDB:Info:5] [2022-03-29 08:25:44.52 UTC] Replaying ledger from snapshot at 2e90104a43cd8ecfbd8d16f03ce17ac3e46ffdff0546f93079e4b3a9e298f8ed at slot 53558101
cardano-node_1 | [d77879a4:cardano.node.ChainDB:Info:5] [2022-03-29 08:25:44.72 UTC] Replayed block: slot 53558151 out of 54107040. Progress: 0.01%
cardano-node_1 | [d77879a4:cardano.node.ChainDB:Info:5] [2022-03-29 08:26:15.85 UTC] Replayed block: slot 53567929 out of 54107040. Progress: 1.79%
cardano-node_1 | [d77879a4:cardano.node.ChainDB:Info:5] [2022-03-29 08:27:27.14 UTC] Replayed block: slot 53589561 out of 54107040. Progress: 5.73%
Seems like this should happen relatively fast as the gap is not huge?
Interestingly there's no concurrency in the ledger replay logic: Everything happens sequentially hence CPU is at 100%, eg. 1 core is occupied
So node starts up by:
- checking immutable DB and extracting the last known blocks from there
- opening volatile DB at the point where immutable DB is
- opening ledger DB and restoring it from snapshot
- in case there's a Δ between the ledger state and the immutable DB, replay the blocks to udpate ledger's state -> That's the part that takes a while (about 29 minutes for 548889 slots)
- updating ledger DB with "new" blocks
- when node crashed it was at slot 54162434, so ledger is also updated from volatile DB's stuff passed the immutable one
- then it connects to relays and start pulling new blocks and extend its state
current rate of slot validations is 95 slots/second, would be interesting to provide that information in the logs? => That could be extracted from parsing the logs obviously, or as a dashboard
- Seems I will still need about 7 hours to catch up with tip, which is not exactly true because tip keeps growing 😬
Closer to the tip here are the current disk size:
~$ du -sh mainnet/node.db/*
33G mainnet/node.db/immutable
2.8G mainnet/node.db/ledger
0 mainnet/node.db/lock
4.0K mainnet/node.db/protocolMagicId
175M mainnet/node.db/volatile
Options:
-
Work on the "restore" side using testnet archive
-
Work on the consensus code to be able to generate snapshot reliably at some block X
-
how about epoch boundaries? Can we know when change epoch in the consensus DB?
-
modify cardano-node to emit snapshot at fixed intervals
-
run cardano-node on a private testnet
Acceptance test:
- run a cardano-node on a private testnet or a cluster of cardano-nodes using
withCardanoCluster
from Hydra- the cardano-node is forked and depends on a forked ouroboros-consensus containing snapshot policy
- we need to add some CLI arguments there: https://github.com/input-output-hk/cardano-node/blob/master/cardano-node/src/Cardano/Node/Parsers.hs#L300
- we feed the private testnet a bunch of transactions
- generate n arbitrary transactions in a sequence, possibly not depending on each other
- we should end up with identical snapshots on every node of the cluster
-
What properties can we express and test?
-
We could start w/ high-level property stating that:
given signatures of a snapshot then aggregation produces a valid certificate of the expected snapshot
Once an aggregator receives valid signatures, it should produce valid certificate ultimately
Decentralised vision:
- several aggregators
- signers produce signatures in a decentralised manner
Morning session:
- We updated the Story Mapping (added details for 1st, 2nd and 6th increments)
- First investigation on the Cardano node database snapshots structure: Snapshots are located at
mainnet/node.db/ledger/{block_no}
. Need to review code to be able to produce snapshots at epoch boundary or at regular block number (this afternoon) - We need to get access to a project on a public cloud (GCP preferred or AWS). Arnaud asks Charles for credentials
- We need 2 separate credentials on the public repository: Read/Write for producing snapshots and Read for restoring them
- Later, we will also host a Mithril Aggregator on the project (Compute resources will be needed). Containers could be orchestrated by Nomad/Hashicorp (preferred) or Kubernetes. We will also need Devops resources at this time
- In order to get a clear project organization, a specific "Project Charter" page will be created on the Wiki. It will list all the resources (rituals, links, ...)
- In order to get the genesis state verification (first certificate in the Mithril chain), we will need to get it signed by someone who has access to the private key (the public key is packaged with the Cardano Node)
- First step in repository reorganization has been completed: the
go-node
has been moved tomithril-proto/mithril-node-poc
- Target repository structure is :
- .github
- demo
-
- protocol-demo
- mithril-core < rust (TBD by JP)
- mithril-network (TB Created during MVP)
-
- mithril-aggregator
-
- mithril-client
-
- mithril-signer
- mithril-proto
-
- mithril-poc-node
-
- test-node < rust-node (TBD by James)
- mithril-test-lab < monitor (TBD by James)
Afternoon pairing session, dedicated at investigating the production of Mithril compatible snapshots from a modified Cardano Node:
- LedgerDB Disk Policy
- Committed Code: Start hacking on disk snapshot
- When we will produce the digest of the snapshot, we need to make sure that the file will have deterministic names or rely on the binary content
- The minimum information required to bootstrap a Cardano Node is the
immutable
folder- We have tried to bootstrap with the ledger snapshot only (no immutable data) and it did not work. The node started from the first block
- A faster bootstrap is possible with the ledger snapshot in the
ledger
folder. (If not, the node needs to recompute the ledger state first) - A rapid test showed that it took ~20 minutes to recompute ~12,000,000 slots
- The blocks in the
volatile
folder are uncommitted to the immutable state (commit will occur after 2,160 blocks, the security parameter)
- Currently, the snapshot policy of the node is to take snapshots at regular time intervals (take
block_number < chain_tip - security_parameter
) - New snapshot policy selected is to take snapshot at regular block intervals (e.g. every 10,000 blocks):
(tip - security_parameter) % block_interval == 0
- We will need some support from the consensus team to validate the new snapshot policy
In-memory LedgerDB works like a sliding window of k ledger states so that it can be rolled-back at most k blocks in the past.
- Snapshots written is the oldest (aka. anchor) state of a given in-mem ledger state
- Snapshotting on epoch boundary might prove difficult because it's not an easy information to compute in the snapshot policy
- We might want to produce snapshots at some fixed block number
- Talked about new API for monitors which allow to express properties in a more succinct and legible way
-
everywhere
combinator is reminiscnet of cooked-validators approach, but that's perhaps only by analogy? - All in all, everything that improves the expressiveness of properties goes in the right direction
- We want to define a first or a handful of interesting properties and test them
- We need to make sure the formalism makes sense even in the context of drastically simplified network layer whereby mithril signers would communicate with a single "certifier" or through on-chain transactions posting
- at an abstract level, the information a node sends and receives from the certifier or the chain could be modelled as broadcast channel because we expect all nodes to see the same information. Then we can model and test interesting behaviour where signers have different world views
Discussions/thoughts about MVP and +:
- In order to simplify the registration process, each signer must (re)register its keys at epoch n-1 in order to be able to sign at epoch n
- The aggregator server could later keep track of the operations taking place and thus be able to produce reports & statistics
- How can we verify the genesis state? The genesis state (first certificate in chain) must be signed by a genesis key. It will be verified by the associated public key delivered on the Cardano node (TBV)
- Signers could store their single signatures on-chain
- Aggregator could store the protocol parameters on-chain
- Certificates (or hash of them) could be stored on-chain
- Verification could use the on-chain storage (after restoring it on a node) to proceed to full verification (all the previous certificates chain)
- The drawback of using on-chain transactions, is that it has a cost. We could maybe think of a "cashback" mechanism where the aggregator could refund (+reward) these transactions later by using the treasury
- Question: should we hard code the protocol parameters or allow them to be modified, and thus include them into the certificate?
- Question: do we need to prepare different snapshots for each targeted OS ?
We talked about the project organization with Roy:
-
We need to work on an estimated roadmap for the Mithril project:
- This will help check if we are on track with the goals
- For now let's focus on the MVP, from the end goal: allow people who use Cardano to bootstrap a node fast and securely
- The MVP is expected at +6 months on the real testnet (+3 months on the development testnet)
- Let's focus on priorities instead of deadlines
- We can work as for Hydra with a matrix of Features vs Priorities (must have, should have, could have, nice to have)
- Short term goal: have something working for people to start using (even though it is not full features yet)
- We will focus on the layout of the plan during the next session on Thursday
-
Agile organization:
- Let's work with agile sprints and ceremonies (duration 2 weeks)
- Ceremonies: 30 min for the demo (together with Galois for the Test Lab), 1h for the backlog grooming and sprint planning (on Thursday)
- First sprint would start on Thursday, March 24
- First demo would be on Thursday, April 07
-
Target features for the MVP would be:
- Create a snapshot from the Cardano node
- Create a certificate for a snapshot
- Store a certificate and a snapshot (centralized on a server)
- Verify a certificate and download a snapshot
- Bootstrap a Cardano node from a downloaded & verified snapshot
Questions:
- what is the thing we want to certify? => that's the thing a node would need to bootstrap
- point on the chain + corresponding ledger state at some epoch boundary
- node.db does contain stuff we don't need -> volatile db
- it contains the blocks up to some point + ledger state
Right now node assumes it has all blocks => can't run without previous blocks
- In principle, it could "forget" past of the chain
Need to select parts of the node.db:
- Node can already take a snapshot whenever we like -> current policy is snapshot 10K blocks, could be adapted
Takes in memory state in a non-blocking way and serialises it to disk: contains only the ledger state
ledger-state
directory in the node.db - ledger state is internal format but block format is not block format is guaranteed to be portable
SPOs could post their signatures on the chain -> follow the chain to produce the
TODO:
-
look at consensus code (LedgerDB) for the snapshot policy, need to understand what are the data structures stored on need
-
details on how to make sure what does the node need to set up faster
-
use Mithril certificate to sign just segment from the previous certificate -> could download/verify segments in parallel
-
Sign point = (slot nr, hash) -> enough to ensure we have the valid chain
-
include list of hashes for all epoch files?
-
Index files can be reconstructed from the epoch files -> needed for fast bootstrap
-
tradeoff between speed and robustness
-
sign db immutable files independently from the ledger state
Q: epoch files?
- Praos => large chunks (X nr of slots) per file
- Need to check alignment on the epoch boundary -> https://hydra.iohk.io/build/13272534/download/1/report.pdf We could use a 10K or whatever limit as long as it includes the epoch boundary -> it's ok to have more blocks worst case = epoch boundary + 10k ?
First Session with JP, some Q&A:
- What's the role of the TestLab?
- What about the repository's organisation? It feels a bit messy right now, with lot of inconsistencies here andther
- JP should feel free to reorganise the code to be cleaner
Work from past week:
- Understanding how it works by Looking at the go node, it's interesting to see a first stab at the node, even though it's definitely not suited for production and there are quite a few problems
- Developed a simple CLI utility in rust-demo to be an easy to use executable for mithril primitives
- currently, it can "run" mithril protocol with several parties, replicating what's done in the integration tests but more configurable
- Could be the basis for a more elaborate tool to be used by SPOs without needing a full blown network
Big question: What's a "signable state" from a cardano-node?
- The node's DB might not be identical between nodes even at the same block no.
- How do we ensure all potential signers sign the same "thing"?
- We need to add stake distribution to the signed "thing" as this is part of the validity of the certificate?
How does bootstrapping would work:
- Sign certificate with genesis keys at epoch
X
=> Then parties can produce signature and certificates starting from epochX + 1
- Issuing a new genesis certificate could be a "version boundary". A new genesis certificate is issued with metadata for versions, that makes it clearer when protocol changes, nodes get upgraded...
What if nodes come and go?
- Mithril nodes need Stake + Keys to produce signatures => Nodes that come within an epoch boundary need to wait next epoch
- Each epoch boundary the parameters are recomputed and redistributed to all participants
- Had a first demo integrating the monitor framework with a primitive rust-node: The test lab spins up a number of nodes (actual processes), connects them and intercepts all messages between them, and checks basic properties.
- Tried verifying a simple fault injection during registration process
- We noted it's perfectly fine to "extend" the node's API in order to make testing easier, esp. as there aren't any node right now, as long as we take care of not breaking the "black box" abstraction: The test lab should talk to the node through an API that regular clients could make sense of
- We decided to postpone next meeting till 2022-03-28 in order for everyone to have time to regroup, and think about what the next steps would look like
- Intros
- Explanation of Mithril
- details about the certificate validation -> chain de validation
- We should be able to run some program representing the protocol working -> Rust CLI program?
- local registration -> register all stake owners, all possible lottery winners
- signatures need to be broadcast or sent to aggregator -> could be anyone
- Q.: Certificates can be different?
- Test Lab
- Q&A
- Incentives? paying for snapshot is valid only for large amount of sync
- Plan next days/weeks
- Codebase
- Goal: Opensource repo end of march
- Goal: Write CLI simulating Mithril
- Have a recurring meeting -> 2hours block 3 times a day
- have an @iohk.io address -> Roy Nakakawa
-
Talking to Exchanges about Mithril
- Vitor talking to them abouyt Scientia
- Discussing problems about node bootstrap (run db-sync, lot of issues), exploring solutions
- Hard to find "friendly" exchanges
-
How about wallet providers?
-
Trying to talk to ops to be able to deploy a Mithril node on testnet/mainnet?
-
DApp developers could make use of a snapshot-enabled wallet
-
We could have a progressive strategy to increase %age of trust
- Maybe be cautious for mainnet?
-
Start product development from the consumption side: How would people use the snapshots/certificates, generate the certiciates "manually"
- We don't need a full fledged mithril node and network to be able to produce signatures and certificates
-
Certificates are signed using specific keys
-
We need to link the stake keys with the mithril keys
-
Cold keys -> sign -> KES keys
- Cold keys validates new KES every 2^6 epochs
- KES blocks are put on-chain
Should not be too hard to certify Mithril signing keys https://docs.cardano.org/core-concepts/cardano-keys