Skip to content
This repository has been archived by the owner on Mar 10, 2020. It is now read-only.

fix: revert to serialized pubsub operations #319

Merged
merged 3 commits into from
Jul 3, 2018
Merged

Conversation

alanshaw
Copy link
Contributor

@alanshaw alanshaw commented Jul 2, 2018

During the refactor (#290) I took the opportunity to parallelise some pubsub operations that didn't explicitly depend on each other. This worked perfectly while testing locally, but on CI it is a different story. I found that tests for js-ipfs-api (which are run against go-ipfs) failed for seemingly random reasons.

After much investigation I finally tried re-serialising the operations I had refactored to be parallel and the tests started to pass. It seems that the pubsub implementation in go-ipfs has some issues with concurrency.

I also found two intermittent issues with swarm.connect in go-ipfs (seen significantly more often on CI):

  1. Issuing two calls to this function from the same node might end up in the second not actually creating a connection and no error message reported to the user
  2. Even after the response to the user it takes a few milliseconds for a connection to actually be connected

I intend to open issues on go-ipfs and write examples demonstrating these problems.

I created a utility function connect to temporarily mitigate these issues in one place. The utility serialises calls from a single node to another and pauses after each to allow the connection to properly establish.

Merging this PR will bring us significantly closer to merging ipfs-inactive/js-ipfs-http-client#785

During the refactor I took the opportunity to parallelise some pubsub operations that didn't explicitly depend on each other. This worked perfectly while testing locally, but on CI it is a different story. I found that tests for js-ipfs-api (which are run against go-ipfs) failed for seemingly random reasons.

After much investigation I finally tried re-serialising the operations I had refactored to be parallel and the tests started to pass. It seems that the pubsub implementation in go-ipfs has some issues with concurrency.

I also found two intermittent issues with `swarm.connect` in go-ipfs (seen significantly more often on CI):

1. Issuing two calls to this function from the same node might end up in the second not actually creating a connection and no error message reported to the user
2. Even after the response to the user it takes a few milliseconds for a connection to actually be connected

I intend to open issues on go-ipfs and write examples demonstrating these problems.

I created a utility function `connect` to temporarily mitigate these issues in one place. The utility serialises calls from a single node to another and pauses after each to allow the connection to properly establish.

License: MIT
Signed-off-by: Alan Shaw <alan@tableflip.io>
License: MIT
Signed-off-by: Alan Shaw <alan@tableflip.io>
@daviddias
Copy link
Contributor

It seems that the pubsub implementation in go-ipfs has some issues with concurrency.

@richardschneider hit this many times in the past and got documented somewhere. Can you make that there is a issue on go-ipfs? //cc @Stebalien

@alanshaw
Copy link
Contributor Author

alanshaw commented Jul 2, 2018

Can you make that there is a issue on go-ipfs?

I will triple check but all signs seem to be indicating that this is the case. I've seen these tests pass on js-ipfs on CI with the concurrency i.e. without this PR.

Like I said in the PR description:

I intend to open issues on go-ipfs and write examples demonstrating these problems.

😉

@ghost ghost assigned daviddias Jul 3, 2018
@daviddias daviddias merged commit 4b5534e into master Jul 3, 2018
@daviddias daviddias deleted the fix/re-serialise branch July 3, 2018 15:02
@ghost ghost removed the in progress label Jul 3, 2018
@Stebalien
Copy link
Contributor

Which version of go-ipfs were you using?

I also found two intermittent issues with swarm.connect in go-ipfs (seen significantly more often on CI):

Issuing two calls to this function from the same node might end up in the second not actually creating a connection and no error message reported to the user

If you call ipfs swarm connect on the same node twice, we explicitly try to create only one connection. Was no connection created?

Even after the response to the user it takes a few milliseconds for a connection to actually be connected.

One side will generally learn about the connection before the other. Is that what's happening here?

@alanshaw
Copy link
Contributor Author

alanshaw commented Jul 9, 2018

If you call ipfs swarm connect on the same node twice, we explicitly try to create only one connection. Was no connection created?

Sorry I should have been more explicit - two calls to connect but with different addresses

One side will generally learn about the connection before the other. Is that what's happening here?

hmm that's an interesting one, does the response to connect come before both nodes have been connected together?

I'd have to double check the code but in think in general when we need node A to know about B we'll connect A to B rather than B to A before issuing commands.

@Stebalien
Copy link
Contributor

hmm that's an interesting one, does the response to connect come before both nodes have been connected together?

No, it's just that one node learns that the connection is ready before the other. We could delay until the other learns but that'd cost us a round trip. For example, take TCP. In a TCP connection, the client always learns that the connection is ready before the server and can start sending data immediately after sending it's ACK (but the server needs to wait for the ACK before sending data).

I'd have to double check the code but in think in general when we need node A to know about B we'll connect A to B rather than B to A before issuing commands.

That's the right way to do this.

Sorry I should have been more explicit - two calls to connect but with different addresses

For two different nodes? That's definitely not supposed to happen. I'm trying to reproduce but I can't (connecting and disconnecting from two nodes in a loop).

Is the failing connect call returning no response or just no error? That is, is it returning a success response (the peer ID of the target peer)?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants