-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asynchronous tx confirmations #1278
Conversation
Cherry-picked from commit b63335b Author: Adi Seredinschi <adi@informal.systems> Date: Sat Jul 17 10:36:38 2021 +0200 From IbcEvent to IbcEventWithHash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, a lot of changes here. I am still not sure if this is the best approach. The original issue is only fixed for packet workers. There might still be other workers that can block them with single transactions using the same runtimes.
The code is overly complex, touching a lot of files and not sure is necessary. My thought for an initial solution that I shared with @adizere was to move the loop that confirms tx hashes from cosmos.rs
into chain/handle/prod.rs
. All workers would make use of it.
It is true that a worker would need to wait for a set of messages to be included in a block before being able to submit new messages. But I don't think this would be an issue.
I am also concerned with ordered channels where we may end up with packets not being sent in sequence and the simpler solution would avoid that.
Then, regardless of the solution, we need to have a setup to reproduce #1265. Document the testing steps in the PR. And then show that this branch fixes the issue.
And one last thought is about debugability, let's make sure we can make sense of the logs.
fn filter_events(&self, events: &[IbcEvent]) -> Vec<IbcEvent> { | ||
let mut result = vec![]; | ||
|
||
fn filter_relaying_events(&self, events: Vec<IbcEvent>) -> Vec<IbcEvent> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious on why the name change here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about that. I can revert if necessary.
There is no more &mut self reference in RelayPath, so there is no way self.channel will be updated to contain channel ID later on
I agree that there are a lot of unnecessary complexity in the code, but I think the complexity is ultimately arise from us choosing the wrong concurrency abstraction which makes it challenging to scale the relayer. If we were to use async tasks here, the whole process could be much simpler with one async task spawned for each message sent or packet relayed. There are little reason why we need single-threaded workers for each channel, as compared to many async threads running concurrently focusing on single operation. Right now this PR seems to be a patched work in attempt to manually multiplex multiple tasks on a single threaded worker. So it is quite awkward that we need a queue that is being processed slowly one at a time in a loop doing multiple unrelated things. The other reason why I think the original code in this PR used multiple indirection is because the use of Since the PR is in high priority, I suggest that we stick with the current design and think about better design at a later time. Ideally in the weekly meeting I would like to resurface the discussion on migrating to async tasks runtime to improve the scalability of the relayer. I will finish testing this PR by tomorrow so that we can release on time by Friday. |
I wouldn't say that the concurrency model we chose back then was wrong per se, but rather that we've outgrown it. This has little to do with async/await as we could potentially spawn thread-based "async" tasks for the same purpose. But I agree that async/await would provide a better and more natural way of dealing with async tasks in the process of scaling the relayer. Happy to discuss this at the next IBC meeting, as you suggest :) |
@ancazamfir I am not too sure how exactly to reproduce #1265, but here is what I tried: I assume that the key issue to observe is the long time gap between the logs, which indicate that the relayer paused for a long time to wait for the confirmation before continuing. I have the following reproduction that shows that the relayer continue producing logs and performing operations while waiting for the unconfirmed messages. To create bottleneck in the test chain environment, I have to edit the It is actually pretty tricky to generate more than one unconfirmed messages in a test scenario. I have to submit multiple raw ibc-transfers simultaneously multiple times to be able to saturate the queue with more than 2 unconfirmed/pending transactions. This is done by creating multiple wallets and running the following commands a few times: $ hermes tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 9999 -o 1000 -n 1 -k user2 &
$ hermes tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 9999 -o 1000 -n 1 -k user3 &
$ hermes tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 9999 -o 1000 -n 1 -k user4 &
$ hermes tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 9999 -o 1000 -n 1 -k user5 & After the pending transaction queue is filled, we should be able to see something like follows in the log:
I think the key observation is in the timestamp and the @romac it would be great if you can help confirm that what I am testing makes sense. |
I think you might be able to create a bottleneck more easily by setting |
Good to know! Unfortunately it does not work in this case as the pending queue is batched by |
Regarding the testing, one thought:
With many tries you should catch a scenario where the |
To clarify, I think that within the current architecture, we could have designed this to be much simpler and all the workers would have benefited from that. As it is, the packet worker has a lot of changes (could have been 0) and all the other workers use the old sync approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can merge this as is and think of improvements later.
Here is the way I verify that the PR clears the congestion:
We should be able to find something like the following in the logs:
Note that each call to In the fn send_msgs(&mut self, proto_msgs: Vec<Any>) -> Result<Vec<IbcEvent>, Error> {
crate::time!("send_msgs");
debug!("send_msgs with {} messages", proto_msgs.len());
...
} We should see the logs showing something like:
Notice that in the original version, it takes roughly 20 seconds between each call to |
* Bump all 0.6.2 crates to 0.7.0 * Bump `ibc-proto` to v0.10.0 * Bump version in guide * Add `ibc-relayer-rest` crate to README table * fixup version * Update Cargo.lock * Update dependencies * Update supported Cosmos SDK version range, cf. #1266 * Improve structure of JSON sent back by the REST server * Remove prefix from info message in start command * Document the REST server API * Include error source in `ConnectionError::Relayer` * Add .changelog entry for #1278 * Fix ibc-relayer-rest integration tests * Release changelog for 0.7.0 * Reformat the changelog
* Added non-blocking interface to submit msgs (stub) * Basic impl for submit_msgs * Minor improvements. Cherry-picked from commit b63335b Author: Adi Seredinschi <adi@informal.systems> Date: Sat Jul 17 10:36:38 2021 +0200 From IbcEvent to IbcEventWithHash * Avoid cloning in assemble_msgs * relay_from_operational_data is now generic * TxHash wrapper * unconfirmed module and mediator stub * Implemented unconfirmed::Mediator corner-cases * Moved from TxHash to TxHashes for better output * More comments & ideas * Added minimum backoff * Fixed ordering bug * Undo var renaming for easier review * Fix type errors * WIP refactoring * Refactor mediator code * Add some comments * Refactor relay_path methods to not require &mut self * Use CPS to retry submitting unconfirmed transactions * Fix clippy * Check that channel has valid channel ID in RelayPath::new() There is no more &mut self reference in RelayPath, so there is no way self.channel will be updated to contain channel ID later on * Display more information in logs * Rename send_msgs and submit_msgs with send_messages_and_wait_commit/check_tx * Remove min backoff parameter * Fix E2E test * Handle error repsponse in pending tx separately * Log RelaySummary in PacketWorker * Revert change to backoff duration in PacketWorker * Minor cleanup * Add logging message for when send_messages_* methods are called Co-authored-by: Soares Chen <soares.chen@gmail.com> Co-authored-by: Soares Chen <soares.chen@maybevoid.com>
* Bump all 0.6.2 crates to 0.7.0 * Bump `ibc-proto` to v0.10.0 * Bump version in guide * Add `ibc-relayer-rest` crate to README table * fixup version * Update Cargo.lock * Update dependencies * Update supported Cosmos SDK version range, cf. informalsystems#1266 * Improve structure of JSON sent back by the REST server * Remove prefix from info message in start command * Document the REST server API * Include error source in `ConnectionError::Relayer` * Add .changelog entry for informalsystems#1278 * Fix ibc-relayer-rest integration tests * Release changelog for 0.7.0 * Reformat the changelog
Closes #1124
Closes: #1265
Description
For contributor use:
unclog
.docs/
) and code comments.Files changed
in the Github PR explorer.