Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help review and get DHT pipelining merged (go-libp2p-kad-dht) #433

Closed
laser opened this issue May 14, 2018 · 20 comments
Closed

Help review and get DHT pipelining merged (go-libp2p-kad-dht) #433

laser opened this issue May 14, 2018 · 20 comments
Labels
P3 Low - not important right now

Comments

@laser
Copy link
Contributor

laser commented May 14, 2018

Split out from @whyrusleeping 's original catch-all issue.

Description

vyzo's DHT Request Pipelining story appears to be code-complete, but requires additional tests in order for @whyrusleeping to feel good about merging it:

@vyzo what would make me feel better about merging this, is having a test that tests delayed responses. We could do this by using a datastore with delays (https://github.com/ipfs/go-datastore/blob/master/delayed/delayed.go) and asserting that the requests in aggregate took below a certain amount of time.

@laser laser assigned acruikshank and laser and unassigned acruikshank and laser May 14, 2018
@laser laser added the dep:ipfs label May 15, 2018
@laser laser changed the title Write tests for DHT Request Pipelining (go-libp2p-kad-dht) DHT Request Pipelining (go-libp2p-kad-dht) May 15, 2018
@laser laser changed the title DHT Request Pipelining (go-libp2p-kad-dht) Help review and get DHT pipelining merged (go-libp2p-kad-dht) May 15, 2018
@laser laser added this to the Sprint 10 milestone May 15, 2018
@laser
Copy link
Contributor Author

laser commented May 17, 2018

PR which adds tests to go-libp2p-kad-dht here: libp2p/go-libp2p-kad-dht#149

@mishmosh mishmosh modified the milestones: Sprint 10, Sprint 11, Sprint 12 May 24, 2018
@laser
Copy link
Contributor Author

laser commented May 31, 2018

I've rebased vyzo's branch onto master and fixed some conflicts. As per my conversation with @whyrusleeping , changes introduced in commit 97741ed has impacted the performance of the request pipelining branch. Specifically, all PutValue requests were finishing in ~19 seconds (pre-rebase) and are now taking ~37 seconds (on my machine).

I will spend the remainder of the day looking through the contents of 97741ed in order to get some insights into why the changeset slowed down PutValue calls.

@laser
Copy link
Contributor Author

laser commented Jun 1, 2018

I've run some tests and found some interesting data:

| Branch                                | PutValue    |             | GetValue    |              | Total       |             | 
|---------------------------------------|-------------|-------------|-------------|--------------|-------------|-------------| 
|                                       | mean (sec)  | stdev       | mean        | stdev        | mean        | stdev       | 
| master                                | 64.50748129 | 10.64644363 | 19.69437027 | 11.29886723  | 87.42311111 | 10.64644363 | 
| feat/request-pipeline (rebased)       | 63.20182138 | 11.4527859  | 21.17909721 | 17.55022852  | 86.2031     | 11.4527859  | 
| feat/request-pipeline (before rebase) | 22.43344761 | 2.43136656  | 6.437939408 | 0.5471811575 | 29.3784     | 2.43136656  | 

https://docs.google.com/spreadsheets/d/1f4fWeepjNOAOh_-eEGREkxcyRCjxfFxCQGHvPvATL3o/edit?usp=sharing

I ran my perf test 10 times with requests=60 against the master branch (no pipelining), pipelining branch (pre-rebase) and pipelining (post-rebase). The data show that, post-rebase, the pipelining branch doesn't perform much better than master (which doesn't include pipelining).

@whyrusleeping
Copy link
Member

Good find, I think you talked to @Stebalien and are moving forward with a different method of simulating latency, right?

@Stebalien
Copy link
Member

Yes. The intrusive (not suitable for test cases) method I tried (introducing ~100ms of latency in the handler loop) showed a ~5x performance increase on rights and ~6x on reads.

@mishmosh mishmosh modified the milestones: Sprint 12, Sprint 13 👻 Jun 5, 2018
@laser
Copy link
Contributor Author

laser commented Jun 5, 2018

I just pushed an update to the PR's branch which replaces the delayed datastore with a Mocknet configured with 100ms of latency.

libp2p/go-libp2p-kad-dht#92

@mishmosh
Copy link
Contributor

Cole will review on Wed 6/13.

@mishmosh mishmosh modified the milestones: Sprint 13 👻, Sprint 14 Jun 13, 2018
@dignifiedquire dignifiedquire removed this from the Sprint 14 milestone Jul 3, 2018
@dignifiedquire
Copy link
Contributor

@laser @acruikshank is this still blocked?

@laser
Copy link
Contributor Author

laser commented Aug 13, 2018

@dignifiedquire - Just spoke with Cole about this. He said:

was actually just talking about this. essentially no one picked up the ball on this one... it's quite close so i'll make the stylistic changes and we should be ogod

@phritz phritz added the P3 Low - not important right now label Sep 9, 2018
@laser
Copy link
Contributor Author

laser commented Oct 25, 2018

@mishmosh - Can you help me figure out how to get this thing merged? I could use the help.

@phritz
Copy link
Contributor

phritz commented Oct 25, 2018

@laser I'll get @mgoelzer on it

@mishmosh
Copy link
Contributor

mishmosh commented Oct 25, 2018

Almost-Halloween is a great time for PRs to come back from the dead 👻. Got 75% of the way through this recap before @mgoelzer was summoned, so here it is anyway. I'll watch for pings, but otherwise over to you @mgoelzer.

IIUC here is the state of things:

Current blockers for libp2p/go-libp2p-kad-dht#92:

  1. Failing tests
  2. bufferedDelimitedWriter disappearance (details here)

Does it make sense for @laser to investigate? Or someone from libp2p land?

@whyrusleeping
Copy link
Member

let's try and get someone from libp2p-land to investigate, but its not a super high priority, so not too much pressure on them.

@ghost
Copy link

ghost commented Oct 26, 2018

Hi @phritz @mishmosh @laser, I did some digging into this and I think that @vyzo probably needs to be the one to rebase this. It changes code in places that have since changed in master, so it's going to be a complex rebase.

One question: is there a particular reason why this bubbled up again recently? We're definitely aware of sporadic instances of extremely slow DHT lookups (on the order of minutes), but our solution to that is to instrument the code and run it in a test harness to understand what's happening before making any changes. Right now, we don't have any data to suggest it's related to this particular lock. By the way, if you all have a reproducible case where the DHT is either slow or loses values, we would love to study that.

@laser
Copy link
Contributor Author

laser commented Oct 29, 2018

Hi @mgoelzer

One question: is there a particular reason why this bubbled up again recently?

Not a particularly-good one. I noticed that the issue was still assigned to me and had been for several weeks in spite of me having no plan to move the work forward.

Thanks again for following up.

@laser laser assigned ghost and phritz and unassigned acruikshank and laser Oct 29, 2018
@ghost
Copy link

ghost commented Jan 18, 2019

@vyzo What's the status on this?

@vyzo
Copy link

vyzo commented Jan 20, 2019

@anacrolix is taking the torch on merging; we will remove the timing test and fix the regression before merging.

@phritz phritz assigned ZenGround0 and unassigned phritz Mar 27, 2019
@phritz
Copy link
Contributor

phritz commented Mar 27, 2019

@ZenGround0 assigning over to you as the person with the relationship on the gfc side

@anacrolix
Copy link
Contributor

I'm actively working on this now. I'm hoping to get it wrapped up soon.

@raulk
Copy link
Member

raulk commented Apr 15, 2019

Update: @anacrolix has been working on an alternative pool-based approach to achieve the same effect (enable parallel requests to the same peer). Our tests show no worthwhile improvement under the 99th percentile, under test loads. See discussion in the linked issue for more details.

We should validate these results with more realistic loads (e.g. within an IPFS node or a DHT booster).

@ZenGround0 ZenGround0 removed their assignment May 28, 2019
@anorth anorth closed this as completed Mar 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 Low - not important right now
Projects
None yet
Development

No branches or pull requests