Bitswap: better diagnostics and observability on client #209

hannahhoward · 2023-03-16T15:26:49Z

When I call GetBlocks on the Bitswap client, I get back a channel and a synchronous error that is almost always nil. The primary way bitswap "errors" is by the context timing out. I don't know if we count find peers with the block, or we couldn't connect to them, or when we did we found they responded with don't have. So I have no way to track "what went wrong". I also don't have much of a way to track what went right -- if I get a block back, I don't know whom I got it from, the latency, the throughput, anything really.

This makes it very hard for consumers of the Bitswap library to measure performance or debug errors. Right now I am just turning on a bunch of debug logs, which is fine but means I can't do anything with those errors in code, or report them for metrics.

I realize as asynchronous code that ultimately talks to a pubsub protocol, returning errors or even channels of errors on GetBlocks may be hard. However, the client should at least be provided hooks so they can see more about why their requests are failing.

This hook seems like a good start: https://github.com/ipfs/go-libipfs/blob/main/bitswap/client/client.go#LL78C8-L78C8

For starters, we could have a hook that offered the same information, but also exposed Haves and Don't Haves.

Additionally though, it would be nice to be able to observe:

FindProvider calls to the DHT and what they returned
Message sending errors at the network level

Maybe such hooks exist somewhere? At the top level bitswap there is https://github.com/ipfs/go-libipfs/blob/main/bitswap/tracer/tracer.go#L10 but that seems not to exist for just the client? Even the tracer appears to only track successfully sent messages, not errors.

Jorropo · 2023-03-16T16:03:49Z

Maybe such hooks exist somewhere? At the top level bitswap there is https://github.com/ipfs/go-libipfs/blob/main/bitswap/tracer/tracer.go#L10 but that seems not to exist for just the client? Even the tracer appears to only track successfully sent messages, not errors.

I don't know if that what you are asking for, but on the message tracking you can register for just the client: https://github.com/ipfs/go-libipfs/blob/e45eb146ba119528076276b4efb4f6123786eb29/bitswap/client/client.go#L72-L76

BigLep · 2023-03-20T16:24:14Z

@hannahhoward : ACK on the request for better observability.

If I understand correctly, message tracking on the client above keeps you unblocked for Rhea. Please share if that isn't the case. Also, I'm happy to get a synchronous meeting setup to figure out how to solve specific blocking usecases,

For things like "FindProvider calls to the DHT and what they returned", I wonder if we should be developing the strategy of external tooling were can do deeper diagnostics (along the lines of IPFS check, vole, etc.).

Rewriting the Bitswap client is in the plans for 2023 in light of the operational events and continued pain with this code. Traceability will be an important element. The specifics or timeline haven't been flushed out, but cross-linking to #73

willscott · 2023-03-27T19:04:17Z

I chatted briefly about this recently.
Being able at the end of a session to learn

has any / how many peers have been discovered that might have this cid?
how many of those peers are / have been connected to?

would be enough for useful triage / returned error messages

BigLep · 2023-03-31T20:39:34Z

@Jorropo @guseggert : I'm not committing to when we will do this work yet (as a lot happening right now before IPFS Thing!) but it will be important/useful to get a sense of:

Is this on the path of work we were planning to do anyways, or is this "throwaway" work?
How much effort (ideal developer days) would it take to do this right?

I'm happy to talk in Slack or sync if we need some more back and forth.

I'd like to have a sense of this before 2023-04-03 Rhea meeting. Thank you!

willscott · 2023-04-12T18:30:29Z

This came up again this week as something pretty important for rhea: we need visibility into what blocks are coming from a specific peer.

BigLep · 2023-04-12T22:12:53Z

Ack. Two things here:

Seems worthwhile to get the specific interfaces defined in while all together in Brussels. I'll put that on my list of things to schedule.
This will be in the 0.21 release cycle, which will be started on after IPFS Thing.

Jorropo · 2023-04-13T10:09:58Z

@willscott @hannahhoward

I have a few questions:

Do you want a programmatic API or could this be done with Otel Tracing ? (Otel is easier to setup and view.)

FindProvider calls to the DHT and what they returned

Assuming contextes are correctly passed (I need to test #172 with tracing), this could be done easily by viewing events in the traces. Would that be fine ?

Message sending errors at the network level

You can wrap the bitswap network object to do this. I don't think it make sense for this to be exposed in the API because GetBlock{,s} don't care if a message failed to send, it is eventually consistent and will retry later or with other peers.

If you want thoses errors at the GetBlock{,s} level, then I think the best solution is to errors.Join network errors with ctx.Err() when ctx.Done() expires.
In other words, keep the eventually consistent behaviour, but also decorates ctx.Err() with reasons why the request might have not been executed in time when the context.Context expires.

What of the propositions above would solve your problem ?

has any / how many peers have been discovered that might have this cid?

That is easy to do, this already exists as SessionPeerManager but is not exposed.
I guess this wont be high quality tho, because not all downloads from a sessions goes through the session but this should help at finding bugs.

how many of those peers are / have been connected to?

This is not a metric bitswap internally keeps track of nor use in the algorithm, it wouldn't be very hard to expose this.
But I think that exposing metrics which are not used to make decisions about the download would lead to a missrepresentation of how the bitswap client works.

willscott · 2023-04-13T14:27:03Z

Do you want a programmatic API or could this be done with Otel Tracing ?

Hannah's original post mentions "Right now I am just turning on a bunch of debug logs, which is fine but means I can't do anything with those errors in code, or report them for metrics."

At the core we'd like to be able to correlate the peer-block-fetch activity back to sessions / requests.

I think if instrumentation was through tracing we'd need to set up a custom exporter on each call to retrieve the spans? I don't know if that ends up being practical/efficient - my sense is that the tracing is meant to be sampled when run at scale for efficiency, but i might be wrong.

it is eventually consistent and will retry later or with other peers.

I think i agree that a decorated context error is a reasonable way to surface what happened in timeout cases.
My main question is how that works?
isn't the caller that will cancel? how does bitswap decorate that error? - per https://pkg.go.dev/context#CancelCauseFunc it looks like errors set in a child context cancellation will not propagate back up to the parent.

would lead to a missrepresentation of how the bitswap client works.

some visibility into connectability is the main thing here: separating the cases into:

we couldn't find peers
we found a peer, but couldn't connect
we connected, but couldn't get the block

Jorropo · 2023-04-13T16:12:40Z

My main question is how that works?
isn't the caller that will cancel? how does bitswap decorate that error? - per https://pkg.go.dev/context#CancelCauseFunc it looks like errors set in a child context cancellation will not propagate back up to the parent.

We would use errors.Join or write a custom wrapping error type and return some errors along side ctx.Err() and return this decorated in the error from GetBlock{,s}.
https://pkg.go.dev/github.com/ipfs/boxo@v0.8.0/exchange#Fetcher.GetBlock

This isn't propagated using the context, this is returned, but we wrap ctx.Err() in the decorator error type before returning it.

This will correctly supports https://pkg.go.dev/context#CancelCauseFunc assuming consumers errors.Is and errors.As as they should.

hannahhoward · 2023-04-17T13:55:54Z

I don't wrapping a context error is the right approach, cause we need metrics information on all requests, not just failures.

For example, in the current cost challenges we had around web3.storage, the metric we needed to track is total downloaded from Web3.storage. But there's no public way to know "when I called GetBlock or GetBlocks, and got data successfully, who actually sent me the the block?" -- having in the bitswap code myself, I know this is knowable, it's just not exposed. It would also be great to know-- who did I ask for the block? How many did I send a want-have vs want-block to?

Ultimately, Lassie wants to aggregate these values into metrics to surface things like:

"How much data did we download from this peer in this time period"?
"How often did this peer respond successfully to a request for data it claimed to have?"
"How quickly does this peer respond (when already connected) ? -- i.e. TTFB"

If we don't want to modify GetBlock signature, then we could certainly return information through a listener of some sort.

I agree that these are slightly different requests than in the original ticket, cause our understanding has evolved over time. One thing that might be useful to look at is the SP Impact section of the Lassie metrics dashboard -- this is what we're able to assemble from GraphSync requests -- what would it take to assemble the same precision of data from Bitswap? But yes, I agree with @BigLep the best thing would be to discuss this synchronously at IPFS Thing.

BigLep · 2023-04-17T22:39:19Z

I've asked @guseggert and @Jorropo to get time scheduled per https://filecoinproject.slack.com/archives/C03FFEVK30F/p1681771135478849

Jorropo · 2023-04-18T09:49:01Z

I'm gonna submit a pull request that creates a custom block type which indicates from which peer we downloaded a block and how long did it took from first attempt to successful download because it's not that hard and can get some valuable information.
For long term solution solving most problems we are looking at a rewrite, either with something like the selfish bitswap client, rapide or a new lower level network API which helps building new clients like rapide or selfish bitswap client. Technically it's also possible to hack content routing awareness in the current client but it will make the code much worst given it's broadcast and timing oriented right now.

BigLep · 2023-05-17T16:45:37Z

For visibility, kubo/boxo maintainers are planning to do the short term item @Jorropo discussed here as part of this current iteration cycle (0.21). This issue is on the board.

Fixes #209

BigLep · 2023-06-02T22:13:14Z

2023-06-02 update on where we're at. This issue hasn't been forgotten. Week of 2023-06-05, @Jorropo will be

submit you a PR on lassie implementing the required metrics as he understands them (using what already exists in boxo/bitswap).
bitswap/client: add basic traceable blocks #308 will also be completed, but isn't critical for the short term ask.

@Jorropo will do this once he completes the Kubo 0.21 release blocker items on his plate (the first 3 "required" items in ipfs/kubo#9814).

Screenshot:

BigLep · 2023-06-08T18:22:09Z

Per 2023-06-08 maintainer conversation, @Jorropo still expects to complete this by end of week (2023-06-09).

BigLep · 2023-06-20T19:32:32Z

Notes from 2023-06-20 maintainer conversation:

There was a communication gap to to acknowledge and update on the date slippage.

What will be delivered: given Lassie having a custom metric setup, a PR won't be made in Lassie. Instead @Jorropo will produce a Prometheus example of bitswap metrics in Kubo.

This has been delayed due to fixing goroutine leaks and deadlocks discovered in the latest Kubo 0.21 RC. @Jorropo is doing the Thunderdome testing to complete the 0.21 release (tracking issue).

The only other items ahead of doing the bitswap work above is to complete PL EngRes performance review activities.

We expect this to be completed by the end of the week (2023-06-23). @Jorropo will update otherwise.

Fixes #209

BigLep · 2023-07-10T18:49:08Z

@Jorropo: please update if I have this incorrect, but my understanding from last week is that this will get done this week (week of 2023-07-10), and will happen after ipfs/kubo#9927.

For #209 Need examples.

BigLep · 2023-07-20T23:44:14Z

@Jorropo has posted a PR: #413

willscott · 2023-07-24T18:31:21Z

is there some background / motivation behind the new approach in #413 vs the previous iteration in #308 ?
it seemed like we already had what we needed in #308 so i wonder if we are just adding extra work in iterating again.

cc @davidd8

BigLep · 2023-07-25T01:05:20Z

is there some background / motivation behind the new approach in #413 vs the previous iteration in #308 ? it seemed like we already had what we needed in #308 so i wonder if we are just adding extra work in iterating again.

Per slack, "boxo 308 and Boxo 413 solves completely different problems". My understanding is that #413 is doing what we said we'd do in this issue here.

Anyways, #308 will happen this week given the feedback that that is higher priority.

Fixes #209

hannahhoward added the need/triage Needs initial labeling and prioritization label Mar 16, 2023

BigLep added this to IPFS Shipyard Team Apr 12, 2023

BigLep moved this to 🥞 Todo in IPFS Shipyard Team Apr 24, 2023

BigLep assigned Jorropo Apr 24, 2023

BigLep mentioned this issue Apr 26, 2023

[ipfs/go-bitswap] go-bitswap overhaul 2023 #73

Open

BigLep moved this from 🥞 Todo to 🏃‍♀️ In Progress in IPFS Shipyard Team May 18, 2023

Jorropo added a commit that referenced this issue May 18, 2023

bitswap/client: add basic traceable blocks

4537dcf

Fixes #209

Jorropo mentioned this issue May 18, 2023

bitswap/client: add basic traceable blocks #308

Merged

Jorropo added a commit that referenced this issue May 18, 2023

bitswap/client: add basic traceable blocks

a6bfd5e

Fixes #209

hannahhoward pushed a commit that referenced this issue Jun 30, 2023

bitswap/client: add basic traceable blocks

545cdf1

Fixes #209

BigLep mentioned this issue Jul 10, 2023

bitswap: reenable unit tests #327

Open

BigLep moved this from 🏃‍♀️ In Progress to 🥞 Todo in IPFS Shipyard Team Jul 12, 2023

Jorropo added a commit that referenced this issue Jul 20, 2023

bitswap: add peer-prom-tracker

d9abb63

For #209 Need examples.

Jorropo mentioned this issue Jul 20, 2023

bitswap: peer prom tacker #413

Closed

3 tasks

BigLep moved this from 🥞 Todo to 🏃‍♀️ In Progress in IPFS Shipyard Team Jul 20, 2023

Jorropo moved this from 🏃‍♀️ In Progress to 🥞 Todo in IPFS Shipyard Team Jul 24, 2023

hacdias pushed a commit that referenced this issue Jul 26, 2023

bitswap/client: add basic traceable blocks

05ffe06

Fixes #209

hacdias closed this as completed in #308 Jul 26, 2023

github-project-automation bot moved this from 🥞 Todo to 🎉 Done in IPFS Shipyard Team Jul 26, 2023

BigLep mentioned this issue Aug 3, 2023

Release 0.22 ipfs/kubo#9911

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bitswap: better diagnostics and observability on client #209

Bitswap: better diagnostics and observability on client #209

hannahhoward commented Mar 16, 2023

Jorropo commented Mar 16, 2023

BigLep commented Mar 20, 2023 •

edited

Loading

willscott commented Mar 27, 2023

BigLep commented Mar 31, 2023

willscott commented Apr 12, 2023

BigLep commented Apr 12, 2023

Jorropo commented Apr 13, 2023 •

edited

Loading

willscott commented Apr 13, 2023

Jorropo commented Apr 13, 2023 •

edited

Loading

hannahhoward commented Apr 17, 2023

BigLep commented Apr 17, 2023

Jorropo commented Apr 18, 2023

BigLep commented May 17, 2023

BigLep commented Jun 2, 2023

BigLep commented Jun 8, 2023 •

edited

Loading

BigLep commented Jun 20, 2023

BigLep commented Jul 10, 2023

BigLep commented Jul 20, 2023

willscott commented Jul 24, 2023

BigLep commented Jul 25, 2023

Bitswap: better diagnostics and observability on client #209

Bitswap: better diagnostics and observability on client #209

Comments

hannahhoward commented Mar 16, 2023

Jorropo commented Mar 16, 2023

BigLep commented Mar 20, 2023 • edited Loading

willscott commented Mar 27, 2023

BigLep commented Mar 31, 2023

willscott commented Apr 12, 2023

BigLep commented Apr 12, 2023

Jorropo commented Apr 13, 2023 • edited Loading

willscott commented Apr 13, 2023

Jorropo commented Apr 13, 2023 • edited Loading

hannahhoward commented Apr 17, 2023

BigLep commented Apr 17, 2023

Jorropo commented Apr 18, 2023

BigLep commented May 17, 2023

BigLep commented Jun 2, 2023

BigLep commented Jun 8, 2023 • edited Loading

BigLep commented Jun 20, 2023

BigLep commented Jul 10, 2023

BigLep commented Jul 20, 2023

willscott commented Jul 24, 2023

BigLep commented Jul 25, 2023

BigLep commented Mar 20, 2023 •

edited

Loading

Jorropo commented Apr 13, 2023 •

edited

Loading

Jorropo commented Apr 13, 2023 •

edited

Loading

BigLep commented Jun 8, 2023 •

edited

Loading