Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Critical dependency logging #4988

Merged
merged 36 commits into from
Jan 16, 2024

Conversation

eserilev
Copy link
Collaborator

@eserilev eserilev commented Dec 6, 2023

Issue Addressed

#4954

Proposed Changes

This PR proposes adding logs for the following dependencies

  • libp2p (note: there is discussion below on which libp2p dependencies we want to log)
  • discv5

Each dependency has its own separate log file. Using the tracing appender library a non blocking file stream is created for each dependency. On each tracing event a timestamp, log level and log message are written to the respective log file. The files are rotated daily

Example discv5 logs

2023-12-30 06:19:53 INFO  Discv5 Service started
2023-12-30 06:19:53 INFO  Ip4
2023-12-30 06:19:53 INFO  16Uiu2HAmEgUR5kmcKF8z3VQKwEKxbBgDoToWHp3DdAefJybbwBoV
2023-12-30 06:19:54 INFO  Local UDP socket updated to: 77.127.179.45:9000
2023-12-30 06:20:14 WARN  NODES Response failed, but was partially processed from: Node: 0x2971..de4d, addr: 174.61.110.113:9100
2023-12-30 06:20:15 WARN  NODES Response failed, but was partially processed from: Node: 0x296a..2228, addr: 104.58.212.29:9002
2023-12-30 06:20:23 WARN  Query id: QueryId(2) timed out
2023-12-30 06:20:41 WARN  NODES Response failed, but was partially processed from: Node: 0xbbfc..d98e, addr: 165.154.224.42:9000
2023-12-30 06:20:50 WARN  NODES Response failed, but was partially processed from: Node: 0x7e81..ef2c, addr: 216.212.216.227:9005
2023-12-30 06:20:50 WARN  Query id: QueryId(7) timed out
2023-12-30 06:20:52 WARN  NODES Response failed, but was partially processed from: Node: 0x6b0a..7d2e, addr: 176.127.215.52:12103
2023-12-30 06:20:53 WARN  Query id: QueryId(8) timed out
2023-12-30 06:21:18 WARN  NODES Response failed, but was partially processed from: Node: 0x0a0a..2412, addr: 45.50.92.150:9000
2023-12-30 06:21:20 WARN  Query id: QueryId(12) timed out
2023-12-30 06:21:23 WARN  Query id: QueryId(13) timed out
2023-12-30 06:21:34 WARN  NODES Response failed, but was partially processed from: Node: 0x00d2..309e, addr: 90.197.108.62:12103
2023-12-30 06:21:35 WARN  NODES Response failed, but was partially processed from: Node: 0x003f..82ad, addr: 120.226.39.101:12000
2023-12-30 06:21:51 WARN  Query id: QueryId(18) timed out
2023-12-30 06:21:53 WARN  Query id: QueryId(19) timed out

Example libp2p-gossipsub logs

2023-12-30 06:34:41 WARN  Rejected message not in cache 879b74cbcbd330d52d1872b6cbd4a55c148628b0
2023-12-30 06:34:41 WARN  Rejected message not in cache fb859e0b1525fb96ab4ef1173e0e15b12231c9d2
2023-12-30 06:34:41 WARN  Rejected message not in cache 8739ca4f524019c786ade9c1082205a1c213ef83
2023-12-30 06:34:41 WARN  Rejected message not in cache 769a533ce08502ace31f7eda9d07037e3ec9e907
2023-12-30 06:34:41 WARN  Rejected message not in cache 22397a3372d2ecc518f68434a5d8ab1a0f65031c
2023-12-30 06:34:41 WARN  Rejected message not in cache a0eba3cdc2b94a888cab3fc79e641b762c4d257d
2023-12-30 06:34:41 WARN  Rejected message not in cache 1adca446acb3abe928f8b0d5ee47d93a9fe7374d
2023-12-30 06:34:41 WARN  Rejected message not in cache ad0dc640889db2209a6b012360a04ef3ec827210
2023-12-30 06:34:41 WARN  Rejected message not in cache d817cbe7d3095c31c2a17fe23696d0b881330d16
2023-12-30 06:34:41 WARN  Rejected message not in cache 6de7d7fb40600cf73a4e6764d3eefe040c18ec7e
2023-12-30 06:34:41 WARN  Rejected message not in cache 2fad3b164441bc50e27177f736b84b6cc0306a9b

Additional Info

This builds on top of changes made in #4979. It may be a bit difficult to review until that PR is merged

Most of my changes can be found in the following files

  • lighthouse/src/main.rs
  • lighthouse/common/logging/src/lib.rs
  • lighthouse/common/logging/src/tracing_logging_layer.rs

@eserilev eserilev changed the title [WIP] Critical dependency logging Critical dependency logging Dec 11, 2023
@eserilev eserilev marked this pull request as ready for review December 11, 2023 22:57
Copy link
Collaborator

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs some changes to correctly obtain at least the gossipsub messages, there are also some open questions about other deps

Comment on lines 42 to 53
match *meta.level() {
tracing_core::Level::INFO => {
let _ = file_writer.write(visitor.message);
}
tracing_core::Level::WARN => {
let _ = file_writer.write(visitor.message);
}
tracing_core::Level::ERROR => {
let _ = file_writer.write(visitor.message);
}
_ => {}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you filtering out debug messages?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no real reason, I'll add debug messages

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think you should be filtering here by level, since that can be controlled with the RUST_LOG flag


lazy_static! {
pub static ref TRACING_LOGGING_DEPENDENCIES: Vec<String> =
vec!["libp2p".to_string(), "discv5".to_string()];
Copy link
Collaborator

@divagant-martian divagant-martian Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that none of the libp2p dependencies has a target/module path named libp2p. I hacked my changes to get metrics on debug to showcase some of the names we will actually get

dep_debug_total{target="libp2p_core"} 11
dep_debug_total{target="libp2p_dns"} 8
dep_debug_total{target="libp2p_gossipsub"} 273
dep_debug_total{target="libp2p_mplex"} 31
dep_debug_total{target="libp2p_quic"} 2
dep_debug_total{target="libp2p_swarm"} 15
dep_debug_total{target="libp2p_tcp"} 13

we probably need to decide which of these we want and what to do with enabled logs of deps not on the list @AgeManning

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now capturing libp2p_gossipsub, but can make changes to include some, or all, of the libp2p dependencies based on what we decide

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think just gossipsub is fine initially. We can maybe add some others in later if things get really spicy.

@divagant-martian divagant-martian added the waiting-on-author The reviewer has suggested changes and awaits thier implementation. label Dec 12, 2023
Copy link
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! This looks great! Thanks a lot!

Had a few comments, some are just questions :)

let _ = NonBlockingFileWriter::clear_file(&path);
}

if let Err(e) = writeln!(file, "{}", message) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be useful to tag on the time here.

Typically when things go wrong, they break at some specific time. It would be very useful to have a timestamp so we can cross-reference with metrics/lighthouse logs.

}

fn clear_file(path: &PathBuf) -> std::io::Result<()> {
File::create(path)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for us to maybe just nuke half the file?

Just worried about the case that we get a bug, but we have just recently removed all our logs and have no history.

if matches.is_present("env_log") {
Builder::from_env(Env::default()).init();
eprintln!("The -l flag is DEPRECATED. Dependency logging will be on by default.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably just remove this conditional altogether.

I dont think we ever use logging to terminal outside of slog, so probably not worth having this exception here. I'm not sure anyone is really using this flag anyway, except for the devs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok talked to paul and he tells me you discussed this. We are then NOT removing the flag, simply the warning. Leaving here for the record

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in c222b36

@@ -214,6 +224,56 @@ impl TimeLatch {
}
}

pub fn create_tracing_layer(logfile_max_size: u64, base_tracing_log_path: PathBuf) {
let filter_layer = match tracing_subscriber::EnvFilter::try_from_default_env()
.or_else(|_| tracing_subscriber::EnvFilter::try_new("warn"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we use debug by default?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one's for me, will change to debug

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in c222b36

}
};

for message in receiver {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something here.

Shouldn't we have a loop here somewhere. It seems like this thread just reads all the messages, writes them, then ends?

It also looks like there is one of these threads per dependency/file. I was expecting an await, or perhaps this tasks continually gets created, but I only saw one initialisation. I'm likely missing something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we have a loop here somewhere. It seems like this thread just reads all the messages, writes them, then ends?

As far as i understand it, the mpsc channel creates a sender and receiver. The receiver blocks until a message is available (as long as sender still exists).

https://doc.rust-lang.org/std/sync/mpsc/fn.channel.html

So that for message in receiver loop blocks until a message is received and should never end as long as the sender exists.

It also looks like there is one of these threads per dependency/file. I was expecting an await, or perhaps this tasks continually gets created, but I only saw one initialisation. I'm likely missing something.

There is one thread per dependency, the stream initialization per dependency happens here

Might be a bit overkill to spawn a new stream per dependency, I can change it to one stream for all dependencies if that makes more sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah all good. Sorry have been in tokio async land too long and forgot about std channels :p

{
Ok(filter) => filter,
Err(e) => {
eprintln!("Failed to initialize dependency tracing {e}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point we don't have a useable slogger, so we are forced into prtinln! right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that's the reason

@AgeManning
Copy link
Member

@divagant-martian has suggested some traits that might make this a bit nicer.

There is this trait: https://docs.rs/tracing-subscriber/0.3.18/tracing_subscriber/fmt/writer/trait.MakeWriter.html which might help with logging fancy logs and timestamps etc.

There is this one: https://docs.rs/tracing-appender/latest/tracing_appender/non_blocking/struct.NonBlocking.html
Which may help with the non-blocking writing of logs.

And this one: https://docs.rs/tracing-appender/latest/tracing_appender/rolling/struct.RollingFileAppender.html

Which might help with writing to a file with in a rolling fashion so we can maintain the size rather than clearing half periodically, etc.

@eserilev
Copy link
Collaborator Author

eserilev commented Dec 30, 2023

those suggestions seem to have simplified some things. I think this should be ready for another review

@eserilev
Copy link
Collaborator Author

discv5 logs

2023-12-30 06:19:53 INFO  Discv5 Service started
2023-12-30 06:19:53 INFO  Ip4
2023-12-30 06:19:53 INFO  16Uiu2HAmEgUR5kmcKF8z3VQKwEKxbBgDoToWHp3DdAefJybbwBoV
2023-12-30 06:19:54 INFO  Local UDP socket updated to: 77.127.179.45:9000
2023-12-30 06:20:14 WARN  NODES Response failed, but was partially processed from: Node: 0x2971..de4d, addr: 174.61.110.113:9100
2023-12-30 06:20:15 WARN  NODES Response failed, but was partially processed from: Node: 0x296a..2228, addr: 104.58.212.29:9002
2023-12-30 06:20:23 WARN  Query id: QueryId(2) timed out
2023-12-30 06:20:41 WARN  NODES Response failed, but was partially processed from: Node: 0xbbfc..d98e, addr: 165.154.224.42:9000
2023-12-30 06:20:50 WARN  NODES Response failed, but was partially processed from: Node: 0x7e81..ef2c, addr: 216.212.216.227:9005
2023-12-30 06:20:50 WARN  Query id: QueryId(7) timed out
2023-12-30 06:20:52 WARN  NODES Response failed, but was partially processed from: Node: 0x6b0a..7d2e, addr: 176.127.215.52:12103
2023-12-30 06:20:53 WARN  Query id: QueryId(8) timed out
2023-12-30 06:21:18 WARN  NODES Response failed, but was partially processed from: Node: 0x0a0a..2412, addr: 45.50.92.150:9000
2023-12-30 06:21:20 WARN  Query id: QueryId(12) timed out
2023-12-30 06:21:23 WARN  Query id: QueryId(13) timed out
2023-12-30 06:21:34 WARN  NODES Response failed, but was partially processed from: Node: 0x00d2..309e, addr: 90.197.108.62:12103
2023-12-30 06:21:35 WARN  NODES Response failed, but was partially processed from: Node: 0x003f..82ad, addr: 120.226.39.101:12000
2023-12-30 06:21:51 WARN  Query id: QueryId(18) timed out
2023-12-30 06:21:53 WARN  Query id: QueryId(19) timed out

libp2p logs

2023-12-30 06:34:41 WARN  Rejected message not in cache 879b74cbcbd330d52d1872b6cbd4a55c148628b0
2023-12-30 06:34:41 WARN  Rejected message not in cache fb859e0b1525fb96ab4ef1173e0e15b12231c9d2
2023-12-30 06:34:41 WARN  Rejected message not in cache 8739ca4f524019c786ade9c1082205a1c213ef83
2023-12-30 06:34:41 WARN  Rejected message not in cache 769a533ce08502ace31f7eda9d07037e3ec9e907
2023-12-30 06:34:41 WARN  Rejected message not in cache 22397a3372d2ecc518f68434a5d8ab1a0f65031c
2023-12-30 06:34:41 WARN  Rejected message not in cache a0eba3cdc2b94a888cab3fc79e641b762c4d257d
2023-12-30 06:34:41 WARN  Rejected message not in cache 1adca446acb3abe928f8b0d5ee47d93a9fe7374d
2023-12-30 06:34:41 WARN  Rejected message not in cache ad0dc640889db2209a6b012360a04ef3ec827210
2023-12-30 06:34:41 WARN  Rejected message not in cache d817cbe7d3095c31c2a17fe23696d0b881330d16
2023-12-30 06:34:41 WARN  Rejected message not in cache 6de7d7fb40600cf73a4e6764d3eefe040c18ec7e
2023-12-30 06:34:41 WARN  Rejected message not in cache 2fad3b164441bc50e27177f736b84b6cc0306a9b

Copy link
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. This looks much cleaner.

We are doing some releases at the moment. So will get this guy in once we're accepting more code.

@AgeManning AgeManning added ready-for-merge This PR is ready to merge. and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Jan 9, 2024
@AgeManning AgeManning merged commit f22e5b0 into sigp:unstable Jan 16, 2024
28 checks passed
@pawanjay176 pawanjay176 mentioned this pull request Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants