Enable logging in tests and fix error log flooding from mock chain runtime #1017

soareschen · 2021-05-31T11:20:56Z

Closes: #1018

Description

This PR should cause a large number of error logs to be printed when running cargo tests on foreign_client::tests.

For contributor use:

Updated the Unreleased section of CHANGELOG.md with the issue.
If applicable: Unit tests written, added test to CI.
Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
Updated relevant documentation (docs/) and code comments.
Re-reviewed Files changed in the Github PR explorer.

soareschen · 2021-05-31T20:05:52Z

The error log flooding has now been fixed. There are 2 causes for this:

The MockChain returns an EventReceiver with the sender end being dropped immediately.
The ChainRuntime's run() method runs in a loop that keeps polling the receivers. Because the sender end is dropped, the receiver end errors out immediately. However the chain runtime merely prints the error message and continue looping. So the number of error messages printed is affected by how long the test itself has run.

The fix is as follows:

Add the event sender end to the MockChain struct, so that it never gets dropped.
Make ChainRuntime return from run() if event_receiver is disconnected. This can potentially break the runtime, as subsequent messages for request_receiver are not handled. However in the match arm for request_receiver also returns from the loop when any error happens. So the loop is already not fault tolerant anyway. The error should instead be propagated to the caller through the JoinHandle, and have higher level logic to handle the failure such as respawning the runtime.

…ntime

adizere · 2021-06-01T06:39:09Z

modules/src/ics02_client/client_state.rs

@@ -209,6 +209,7 @@ impl From<IdentifiedAnyClientState> for IdentifiedClientState {
 #[cfg(test)]
 mod tests {
    use std::convert::TryFrom;
+    use test_env_log::test;


I'm not sure we should keep this around. Do you think it would be valuable to have logging enabled in tests going forward?

Yes, I think they help especially when the tests are failing. Not having them enabled would only make it more inconvenient for contributors to write new tests and fix failing tests. The logging setup for Rust is not very straightforward, and just figuring out how to enable that can waste minutes to hour of a contributor's time.

adizere · 2021-06-01T08:02:48Z

relayer/src/chain/runtime.rs

-                        Err(e) => error!("received error via event bus: {}", e),
+                        Err(e) => {
+                            error!("received error via event bus: {}", e);
+                            return Err(Kind::Channel.into());


Hmm I think it is by design that this match arm does not return Err(..) here but instead permits the loop to continue. For context, please see #895 (and follow-up #903). The idea is that the chain runtime would continue polling via select! here even if the monitor's connection to the full node is down (or some other error appears in the event_receiver), and it's the job of the monitor to retry and eventually heal if the connection to the node is re-established.

It would be good if you can verify that your changes do not introduce a regression. Could you please just double-check that? The recipe for testing is in the description of #903 (see "Tested with").

Also, I was trying earlier to test but was not sure how to exercise the Err path here. Any tips?

I checked the source code of crossbeam, and the only possible error is when the sender end of the channel is dropped, and so it is unrecoverable. We probably never encounter this in production, because if the channel is disconnected, it will also result in the flooding of error messages in the relayer log.

Also, if you check the other branches, each of them are called with ? at the end. This means any error happening in those places will break the loop and return error. We should have another issue to address this by having some way to recover from any error inside the loop.

Thank you for the findings & explanation!

I agree with you that the loop is not fault-tolerant anyway. I originally thought that returning in the Err path here could make the runtime (and the Hermes process in general) exit early (instead of waiting on the monitor to retry). But it seems there is a regression in master (#1026) and that has nothing to do with the changes in this branch.

We should have another issue to address this by having some way to recover from any error inside the loop.

I agree. If you know how to capture this in an issue, could you please handle that? We didn't notice these problems (yet), but it would be good to track this issue and add a general method for handling recovery.

Sure, filed #1025 for that.

…ntime (informalsystems#1017)

soareschen mentioned this pull request May 31, 2021

Error logs are shown in tests when logging is enabled #1018

Closed

5 tasks

soareschen changed the title ~~Enable logging in foreign_client tests~~ Enable logging in tests and fix error log flooding from mock chain runtime May 31, 2021

soareschen marked this pull request as ready for review May 31, 2021 20:06

soareschen requested review from adizere, ancazamfir and romac as code owners May 31, 2021 20:06

Enable logging in tests and fix error log flooding from mock chain ru…

d5ea005

…ntime

soareschen force-pushed the soares/enable-test-logging branch from bd5f5d9 to d5ea005 Compare May 31, 2021 20:10

adizere requested changes Jun 1, 2021

View reviewed changes

adizere self-requested a review June 1, 2021 09:32

ancazamfir mentioned this pull request Jun 2, 2021

Regression bug: Hermes is unable to re-establish monitor connection after node goes down #1026

Closed

5 tasks

romac approved these changes Jun 2, 2021

View reviewed changes

soareschen mentioned this pull request Jun 2, 2021

Quick fix for update client when trusted_height == target_height #1036

Closed

5 tasks

adizere approved these changes Jun 3, 2021

View reviewed changes

soareschen merged commit 2647051 into master Jun 3, 2021

soareschen deleted the soares/enable-test-logging branch June 3, 2021 11:44

hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022

Enable logging in tests and fix error log flooding from mock chain ru…

275d717

…ntime (informalsystems#1017)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable logging in tests and fix error log flooding from mock chain runtime #1017

Enable logging in tests and fix error log flooding from mock chain runtime #1017

soareschen commented May 31, 2021 •

edited

Loading

soareschen commented May 31, 2021

adizere Jun 1, 2021

soareschen Jun 1, 2021

adizere Jun 1, 2021

adizere Jun 1, 2021

soareschen Jun 1, 2021

adizere Jun 1, 2021 •

edited

Loading

soareschen Jun 1, 2021

Enable logging in tests and fix error log flooding from mock chain runtime #1017

Enable logging in tests and fix error log flooding from mock chain runtime #1017

Conversation

soareschen commented May 31, 2021 • edited Loading

Description

soareschen commented May 31, 2021

adizere Jun 1, 2021

Choose a reason for hiding this comment

soareschen Jun 1, 2021

Choose a reason for hiding this comment

adizere Jun 1, 2021

Choose a reason for hiding this comment

adizere Jun 1, 2021

Choose a reason for hiding this comment

soareschen Jun 1, 2021

Choose a reason for hiding this comment

adizere Jun 1, 2021 • edited Loading

Choose a reason for hiding this comment

soareschen Jun 1, 2021

Choose a reason for hiding this comment

soareschen commented May 31, 2021 •

edited

Loading

adizere Jun 1, 2021 •

edited

Loading