Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use find providers async context #172

Closed
wants to merge 1 commit into from
Closed

Conversation

hannahhoward
Copy link
Contributor

Goals

In Bitswap, when I call FindProvidersAsync on a call to the ProviderQueryManager, the ultimate call to DHT should retain the context I used with the ProviderQueryManager.

Background

Since I imagine the review here may or may not have ever even worked with the ProviderQueryManager in Bitswap, here's some background on it.

Essentially I wrote this piece of code almost three years ago, which has two purposes:

  • Limit the number of concurrent DHT queries running at the same time
  • Reuse queries for the same CID across sessions (and or the global context if for some reason you're not using sessions)

It is a lot of code to manage all that, and looking at it now, I'm like, "wow new Go programmer me you were really go-routine happy back then". It all works (cause you know, this runs in every bitswap) but could be simplified a lot.

But, today is not that day.

Implementation

I was suprised to find the context I passed to initialize a bitswap session in Lassie was not being retained here. Instead, I was getting global context. Digging into this, I found what I can only assume is a bug in the code I wrote. While the session context is passed across the go-routine barrier, it isn't even used. Instead the global context is. Perhaps because of the wonderfully similar named variables npqm and pqm to represent the structs that have the local and global context respectively.

I am pretty sure this is a bug, cause otherwise why pass the session context over the go-routine variable? So I corrected it.

For Discussion

Looking at my code... well I'm not actually sure of intent here. There are implications to using the session context -- if two queries are called for the same CID in two different sessions, the first sessions context is used, so if that session gets cancelled, then the others query is also cancelled. Maybe that's why I used the global context? But then why pass the context across the go-routine at all (alternate solution: don't do so for clarity).

The use case we're dealing with is trying to extract value's from the context. Maybe that's too much of an edge case to optimize for.

But this PR is aimed to identify the bug (there's a context passed over a go routine but not used) and let you all decide the right outcome.

@hannahhoward hannahhoward requested a review from a team as a code owner February 15, 2023 21:13
@ipfs ipfs deleted a comment from welcome bot Feb 15, 2023
@codecov
Copy link

codecov bot commented Feb 15, 2023

Codecov Report

Merging #172 (e9a7c12) into main (fe55533) will decrease coverage by 39.58%.
Report is 2 commits behind head on main.
The diff coverage is 0.00%.

❗ Current head e9a7c12 differs from pull request most recent head add7775. Consider uploading reports for the commit add7775 to get more accurate results

Impacted file tree graph

@@             Coverage Diff             @@
##             main     #172       +/-   ##
===========================================
- Coverage   65.58%   26.01%   -39.58%     
===========================================
  Files         207      100      -107     
  Lines       25540    11061    -14479     
===========================================
- Hits        16750     2877    -13873     
- Misses       7323     7842      +519     
+ Partials     1467      342     -1125     
Files Coverage Δ
...ernal/providerquerymanager/providerquerymanager.go 0.00% <0.00%> (-85.48%) ⬇️

... and 193 files with indirect coverage changes

@Jorropo Jorropo self-requested a review February 15, 2023 21:23
@aschmahmann aschmahmann added P2 Medium: Good to have, but can wait until someone steps up kind/bug A bug in existing code (including security flaws) labels Aug 21, 2023
@aschmahmann
Copy link
Contributor

Assuming this is low priority for now, if not reach out.

@Jorropo
Copy link
Contributor

Jorropo commented Nov 28, 2023

@hannahhoward I would like to merge your pull request because it fixes an issue with bitswap where opentelemetry traces between bitswap inputs are not linked with the content router however your fix bring an other issue, the providerquerymanager deduplicate FindProvidersAsync call.

In other words if this sequence happen:

  • Session A starts to download Qmfoo.
  • Session B starts to download Qmfoo
  • Session A starts a FindProvidersAsync for Qmfoo.
  • Session B starts a FindProvidersAsync for Qmfoo.
  • Sessions's A context is canceled.
  • Session's B FindProvidersAsync call fails because it has been dedupped to sessions's A call. The previous code is correct.

This is captured by the TestCancelOneRequestDoesNotTerminateAnother failing test.
The previous version of the code you changed is correct, by using a context scoped to a broader context, this is a nuclear way to ensure this wont happen (something smart could do refcounting, but this only solves cancellations, not Values and thus don't help me with my tracing.)

I'm not sure why bitswap needs to deduplicate content routing queries like that. I would suggest we remove all of this if possible, how likely is it you have two sessions requesting the same blocks anyway ?
And if this is an issue I think this should be extracted in a cachingContentRouter in it's own package, so users can DI all the flow.


Is there something else I am missing, do you think of anything bad that will happen when I remove the providerquerymanager ?

Jorropo added a commit that referenced this pull request Dec 29, 2023
Jorropo added a commit that referenced this pull request Dec 29, 2023
Jorropo added a commit that referenced this pull request Dec 29, 2023
Closes: #172
See #172 (comment) too.

providerQueryManager took care of:
- Deduping multiple sessions doing find providers for the same CID.
- limiting global find providers.

None of which we care:
- This is rare, if this happens it's fine to run the same query twice.
  If we care then we should make a deduping content router so we can inject it anywhere a content router is needed.
- It's fine to allow one concurrent find peer per session. No need to limit this at 6 globally after that, it's a great way to stall nodes doing many queries.
Jorropo added a commit that referenced this pull request Jan 3, 2024
Closes: #172
See #172 (comment) too.

providerQueryManager took care of:
- Deduping multiple sessions doing find providers for the same CID.
- limiting global find providers.

None of which we care:
- This is rare, if this happens it's fine to run the same query twice.
  If we care then we should make a deduping content router so we can inject it anywhere a content router is needed.
- It's fine to allow one concurrent find peer per session. No need to limit this at 6 globally after that, it's a great way to stall nodes doing many queries.
Jorropo added a commit that referenced this pull request Jan 5, 2024
Closes: #172
See #172 (comment) too.

providerQueryManager took care of:
- Deduping multiple sessions doing find providers for the same CID.
- limiting global find providers.

None of which we care:
- This is rare, if this happens it's fine to run the same query twice.
  If we care then we should make a deduping content router so we can inject it anywhere a content router is needed.
- It's fine to allow one concurrent find peer per session. No need to limit this at 6 globally after that, it's a great way to stall nodes doing many queries.
Jorropo added a commit that referenced this pull request Jan 11, 2024
Closes: #172
See #172 (comment) too.

providerQueryManager took care of:
- Deduping multiple sessions doing find providers for the same CID.
- limiting global find providers.

None of which we care:
- This is rare, if this happens it's fine to run the same query twice.
  If we care then we should make a deduping content router so we can inject it anywhere a content router is needed.
- It's fine to allow one concurrent find peer per session. No need to limit this at 6 globally after that, it's a great way to stall nodes doing many queries.
Jorropo added a commit that referenced this pull request Jan 16, 2024
Closes: #172
See #172 (comment) too.

providerQueryManager took care of:
- Deduping multiple sessions doing find providers for the same CID.
- limiting global find providers.

None of which we care:
- This is rare, if this happens it's fine to run the same query twice.
  If we care then we should make a deduping content router so we can inject it anywhere a content router is needed.
- It's fine to allow one concurrent find peer per session. No need to limit this at 6 globally after that, it's a great way to stall nodes doing many queries.
Jorropo added a commit that referenced this pull request Jan 16, 2024
Closes: #172
See #172 (comment) too.

providerQueryManager took care of:
- Deduping multiple sessions doing find providers for the same CID.
- limiting global find providers.

None of which we care:
- This is rare, if this happens it's fine to run the same query twice.
  If we care then we should make a deduping content router so we can inject it anywhere a content router is needed.
- It's fine to allow one concurrent find peer per session. No need to limit this at 6 globally after that, it's a great way to stall nodes doing many queries.
Jorropo added a commit that referenced this pull request Feb 14, 2024
Closes: #172
See #172 (comment) too.

providerQueryManager took care of:
- Deduping multiple sessions doing find providers for the same CID.
- limiting global find providers.

None of which we care:
- This is rare, if this happens it's fine to run the same query twice.
  If we care then we should make a deduping content router so we can inject it anywhere a content router is needed.
- It's fine to allow one concurrent find peer per session. No need to limit this at 6 globally after that, it's a great way to stall nodes doing many queries.
Jorropo added a commit that referenced this pull request Feb 16, 2024
Closes: #172
See #172 (comment) too.

providerQueryManager took care of:
- Deduping multiple sessions doing find providers for the same CID.
- limiting global find providers.

None of which we care:
- This is rare, if this happens it's fine to run the same query twice.
  If we care then we should make a deduping content router so we can inject it anywhere a content router is needed.
- It's fine to allow one concurrent find peer per session. No need to limit this at 6 globally after that, it's a great way to stall nodes doing many queries.
Jorropo added a commit that referenced this pull request Feb 16, 2024
Closes: #172
See #172 (comment) too.

providerQueryManager took care of:
- Deduping multiple sessions doing find providers for the same CID.
- limiting global find providers.

None of which we care:
- This is rare, if this happens it's fine to run the same query twice.
  If we care then we should make a deduping content router so we can inject it anywhere a content router is needed.
- It's fine to allow one concurrent find peer per session. No need to limit this at 6 globally after that, it's a great way to stall nodes doing many queries.
Jorropo added a commit that referenced this pull request Feb 16, 2024
Closes: #172
See #172 (comment) too.

providerQueryManager took care of:
- Deduping multiple sessions doing find providers for the same CID.
- limiting global find providers.

None of which we care:
- This is rare, if this happens it's fine to run the same query twice.
  If we care then we should make a deduping content router so we can inject it anywhere a content router is needed.
- It's fine to allow one concurrent find peer per session. No need to limit this at 6 globally after that, it's a great way to stall nodes doing many queries.
@Jorropo Jorropo removed their request for review March 4, 2024 07:59
@Jorropo Jorropo removed their assignment Mar 4, 2024
Copy link
Contributor

@gammazero gammazero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend that we do not take this PR, because when creating a new inProgressRequestStatus is is necessary to use a context derived from pqm.ctx, and not the context from the request (npqm.ctx). This is because the inProgressRequestStatus applies to all in-progress requests for the CID.

Using the request context would be a problem because If the first request was canceled, then all other requests would be cancelled. Also, if any request other than the first were cancelled, it would no effect.

A "problem" with not accepting the PR is that for tracing this means that only the span from the first request-in-progress for a CID is used, even if there are multiple requests for the same CID. This seems acceptable.

This PR should be rejected, and a new PR created to the change FindProvidersAsync to not take a context, since that is not useful to the caller. A session context can still be created within FindProvidersAsync to carry the tracing span.

@aschmahmann
Copy link
Contributor

There are implications to using the session context -- if two queries are called for the same CID in two different sessions, the first sessions context is used, so if that session gets cancelled, then the others query is also cancelled. Maybe that's why I used the global context?

@gammazero I agree that we shouldn't just swap the context here to use the per-request context. Given one of the functions (and sources of complexity) of the providerquerymanager is to deduplicate provider lookups this would defeat the purpose.

I'm not sure why bitswap needs to deduplicate content routing queries like that. I would suggest we remove all of this if possible, how likely is it you have two sessions requesting the same blocks anyway ?
And if this is an issue I think this should be extracted in a cachingContentRouter in it's own package, so users can DI all the flow.

If we think that functionality is unnecessary we could remove or extract the functionality that does this as recommended here, although it would might require setting aside some time to evaluate performance under the situations likely to strain it (e.g. running this code in high request environments like public gateways and trying with the regular and accelerated DHT clients to see if it makes a difference. To some extent running this test likely means extracting a wrapper anyway which might be reasonable.

Closing this seems reasonable to me and we can use a new issue to discuss the better options here should we need to.


Synthesizing my understanding from above, there are two main issues with the current setup:

  1. For tracing only the first request's context is used -> IIUC there are some options here around using OpenTelemetry Links, but I'm not sure they'll end up being usable here (e.g. what if the second context was being recorded but not the first, how does this end up working)? However, this shouldn't be a problem if only a single request at a time happens in practice.
  2. If all the callers cancel their contexts the query still continues -> probably not the biggest deal given we have the hard coded max timeout, but there may be some options if we wanted to deal with this while keeping the deduplication behavior. For example, we could create a new context for the query and cancel it if all outstanding queries get cancelled.

Note: in the time since this PR we've added some tracing and at least copied the span information from the first request into the context here so that information propagates through.

IMO investing time fixing either of these isn't worth thinking about until we've taken the time to re-evaluate if the query deduplication is actually required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) P2 Medium: Good to have, but can wait until someone steps up skip/changelog
Projects
No open projects
Status: 🔎 In Review
Development

Successfully merging this pull request may close these issues.

4 participants