JSON-RPC messages arrive out of order #2992

josepot · 2024-01-19T00:36:16Z

Is there an existing issue?

I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

This is not a support question.

Description of bug

Very often, when using the chainHead group of functions of new JSON-RPC API against a Polkadot-SDK node, there are "subscription" messages that come out of order. E.g:

What happens is that the message that announces the "opeartionId" that the client should be paying attention to, comes after the notifications for that "operationId". Since the client does not recognize those messages, then the messages get ignored.

I first experienced this against different public endpoints, and then I tried it against a local development node and I experienced the same issue. In order to further confirm that this was a problem with the implementation on the node, I tried running the same requests against smoldot while using a chainspec that had my local development node as the only bootnode, trying the same requests through smoldot (with the chainspec of a local development node) works flawlessly.

Steps to reproduce

I have setup the following nodejs repo to reproduce the issue. I recommend cloning the repo and trying it out in your local environment. However, you can also try it out using stackblitz.

In order to reproduce/debug what's happening:

clone the repo
run pnpm i
run pnpm start

Every time that the script detects a message out of order it will print these kinds of messages on the terminal:

The script also generates a file named wire-logs.txt, where you will find all the send/received messages that were seen over the WebSocket connection in the order that they happened:

If you want to try against a different endpoint, then just edit the following line.

I hope that this repo helps to investigate the issue, because it's taken me a while to set it up. 🤞

The text was updated successfully, but these errors were encountered:

Polkadot-Forum · 2024-01-19T08:44:06Z

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/papi-is-coming-showcasing-the-validator-selector-tool-dapp/5756/1

niklasad1 · 2024-01-19T08:50:54Z

All right, these seems to be caused by chainHead_unstable_storage and from the logs sometimes operationStorageEvent are emitted before chainHead_unstable_storage call is finished.

To give an example what is happening the following is extracted from the logs:

# tx storage call
>> {"jsonrpc":"2.0","id":675,"method":"chainHead_unstable_storage","params":["yImvwio6XDT9EiEF","0x885b921e752c2eb3627e14188c73b9285474da5f6c933024c7371320ee099772",[{"key":"0x2aeddc77fe58c98d50bd37f1b90840f9cd7f37317cd20b61e9bd46fab8704714b4a5da87e533f5dff58927b296a23cbf25794b7cbb8fe31c5e68f2bcd2c2483e9c9cd0712216f232","type":"value"}],null]}

# rx subscription event 1
<< {"jsonrpc":"2.0","method":"chainHead_unstable_followEvent","params":{"subscription":"yImvwio6XDT9EiEF","result":{"event":"operationStorageItems","operationId":"373","items":[{"key":"0x2aeddc77fe58c98d50bd37f1b90840f9cd7f37317cd20b61e9bd46fab8704714b4a5da87e533f5dff58927b296a23cbf25794b7cbb8fe31c5e68f2bcd2c2483e9c9cd0712216f232","value":"0x040100000002007db52a2f0000000000000000000000000f3125202d2047545354414b494e47001768747470733a2f2f67747374616b696e672e636f6d2f14406761757468387a3a6d61747269782e6f7267166761757468387a4067747374616b696e672e636f6d00000a40475374616b696e67"}]}}}

# rx subscription event 2
<< {"jsonrpc":"2.0","method":"chainHead_unstable_followEvent","params":{"subscription":"yImvwio6XDT9EiEF","result":{"event":"operationStorageDone","operationId":"373"}}}

# rx unstable call
<< {"jsonrpc":"2.0","result":{"result":"started","operationId":"373","discardedItems":0},"id":675}

Thus, if I understand this correctly we should really wait to emit events on the subscription with operationID 373 until the unstable call has been answered?!

Relevant code: https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/rpc-spec-v2/src/chain_head/chain_head.rs#L361-#L370

lexnv · 2024-01-19T09:50:00Z

Thanks Josep for raising this and providing the repro case 🙏

As Niklas pointed out, this is the culprit of the behavior:

polkadot-sdk/substrate/client/rpc-spec-v2/src/chain_head/chain_head.rs

Lines 361 to 371 in 9db9211

    
           	let fut = async move { 
        
           		storage_client.generate_events(block_guard, hash, items, child_trie).await; 
        
           	}; 
        
           	self.executor 
        
           		.spawn_blocking("substrate-rpc-subscription", Some("rpc"), fut.boxed()); 
        
           	Ok(MethodResponse::Started(MethodResponseStarted { 
        
           		operation_id, 
        
           		discarded_items: Some(discarded), 
        
           	})) 
        
           }

The race is happening because:

we are spawning a future that provides the responses of the chainHead_storage method with the associated operation ID on the chainHead_follow
then, we return the chainHead_storage response, which contains the operation ID (Ok(operationId))

It is possible that before the user receives the operationID of the storage query; the chainHead_follow already produced a response for that operationID.

To mitigate this:

We could implement an ACK send from the client side. The ACK is a response that tells the server that the operationID has been received. Without this mechanism, we could end-up with a race where the jsonrpsee thinks the message was propagated to the user; when the message (operationID) is only propagated to the kernel buffers of the server.
Specify in the spec, that because of the async nature of the client - server communication; the chainHead_follow could produce a response with an operationID that has not been provided to the user yet. In those cases, users are encouraged to catch the response until they can correlate it with a query.

I would opt for option 2, since we already have a precedent in the rpc-v2 spec for the stopOperation.

Would love to get your thoughts on this!

jsdw · 2024-01-19T10:19:56Z

My gut feeling is that 2 is the right approach; we probably can't guarantee that nothing will return on the subscription until after the user has received a response on the separate call. Even if we make sure that messages are sent in the right order, they may not be received in the right order.

So the right approach to working with this API is probably to start buffering messages from the subscription before making the call, and then once you get a response from the call with the operation ID, you can look for it in all of the subscription messages inc those that you buffered. (I checked what we do in Subxt here and that's essentially it).

josepot · 2024-01-19T10:59:33Z

Specify in the spec, that because of the async nature of the client - server communication; the chainHead_follow could produce a response with an operationID that has not been provided to the user yet.

This is IMO absolutely unacceptable.

Even if we make sure that messages are sent in the right order, they may not be received in the right order.

WebSocket connections work over TCP, so this statement is false. Messages sent in the right order are guaranteed to be received in the right order over a WebSocket connection.

The whole point of having a JSON-RPC API is to be able to guarantee things that can't be guaranteed over lower level protocols like UDP. One of those things is the order in which the messages arrive. It is unreasonable to force the client to keep around (for how long?) some notifications that the client doesn't know to which subscription they belong yet.

bkchr · 2024-01-19T11:26:58Z

Yeah, I'm here on line with @josepot.

From your problem description it reads like the problem is in the server implementation that doesn't wait for the chain_head_unstable_storage to return before pushing data on the stream from somewhere else.

josepot · 2024-01-19T11:27:15Z

So the right approach to working with this API is probably to start buffering messages from the subscription before making the call, and then once you get a response from the call with the operation ID, you can look for it in all of the subscription messages inc those that you buffered. (I checked what we do in Subxt here and that's essentially it).

The JSON-RPC spec defines how notifications work. It is IMO completely wrong to keep around all the non-identified notifications, because there is no guarantee that a "correlation id" will be make public at a later stage.

jsdw · 2024-01-19T11:41:04Z

I suppose that since all of this traffic goes down a single websocket connection now, my concern about a race in what message the client receives first doesn't make so much sense.

My concern in any case was that if you have (pseudo) code like:

// Perform some operation and get back ID
let operationId = await chainHeadCall(..);

// Then, get a handle to your follow subscription events and filter the ones you want
let operationEvents = chainHeadFollowSub.events().filterById(operationId);

Then it's possible for an event to arrive between getting the operation ID and starting to watch the follow subscription (because when you receive the operationId, it may already be in flight).

I suspect though that in JS this isn't an issue since that task wouldn't yield control to anything else from the moment the operation ID comes back until after it's monitoring the subscription though. With my head in Rust-land, where it's possible for events with that operation to be received on a separate thread between those two lines, I still need to care about potential races here I think.

In any case, let's see what we can do to enforce that the messages are sent down the WS connection in the right order!

jsdw · 2024-01-19T11:58:40Z

Also FWIW the spec doesn't say anything about which order one can expect things in, only saying about the response containing an operationId:

This return value indicates that the request has successfully started.

operationId is a string containing an opaque value representing the operation.

If we want to guarantee that this retuns before any corresponding events on the follow subscription, let's make it clear in the spec too.

josepot · 2024-01-19T12:32:58Z

My concern in any case was that if you have (pseudo) code like:
// Perform some operation and get back ID
let operationId = await chainHeadCall(..);

// Then, get a handle to your follow subscription events and filter the ones you want
let operationEvents = chainHeadFollowSub.events().filterById(operationId);
Then it's possible for an event to arrive between getting the operation ID and starting to watch the follow subscription (because when you receive the operationId, it may already be in flight).

This makes absolutely no sense... Not only because that pseudocode makes no sense (what even is the type of the variable operationEvents?). Anyways, the argument is absurd because of the idea that you may miss an event because it happened "in between".

In JS every piece of code that runs, runs as a consequence of a callback triggered by an I/O event-queue. If an event arrives "in between", then the event still gets added to the I/O event-queue. So, as long as there is a listener interested, then the event that came "in between" will trigger the callback of the listener. If your argument is that it's technically possible to attach the the listener "too late", well then: sure, it's possible to write buggy code. However, in JS you will have to try very hard to shoot yourself in the foot in order to attach the listener too late. I mean, sure, it's well stablished that developers can mess up and write buggy code, but the whole idea that in JS you may miss events because you add their listener while the event is coming is absolutely nonsensical, especially in this case where the listener is a "child" listener of an already attached event-listener.

Also FWIW the spec doesn't say anything about which order one can expect things in, only saying about the response containing an operationId:

This return value indicates that the request has successfully started.
operationId is a string containing an opaque value representing the operation.

If we want to guarantee that this retuns before any corresponding events on the follow subscription, let's make it clear in the spec too.

I mean, the spec does say things like:

When an operation needs to be performed, it is put in a FIFO queue

and

In order to provide a good quality of service, the time between the moment a request is received by the server and the moment the response is sent back must be short.

So, sure... Perhaps it could be even more explicit, but IMO it's already mentioning that the order must be preserved.

On top of that, the documentation of the event clearly states that:

operationId is a string returned by chainHead_unstable_storage.

Ergo, the operationId must have been returned by the time that the event happens.

lexnv · 2024-01-22T09:06:51Z

Is there anything stopping the user from having separate WebSocket connections?

one socket connection to chainHead_follow
a separate socket connection to chainHead_call

In this case, we can only guarantee the chainHead_call response arrived in the kernel buffer of the server; and not to the application layer of the user. That is, without introducing an ACK method chainHead_userHasReceivedThisOperationID.

@josepot @bkchr would love to hear your thoughts on this 🙏

josepot · 2024-01-22T09:09:30Z

Is there anything stopping the user from having separate WebSocket connections?

Different WebSocket connections implies different connections, which implies different subscriptions, which means: how is this even relevant to this discussion?

one socket connection to chainHead_follow

a separate socket connection to chainHead_call

This makes absolutely no sense. You can not mix the chainHead_follow subscription of one connection with the subscription of the other connection. They are different subscriptions. Even if the server returned the same ID, that ID is only relevant within the context of the connection that you are in.

niklasad1 · 2024-01-22T09:20:48Z

This makes absolutely no sense. You can not mix the chainHead_follow subscription of one connection with the subscription of the other connection. They are different subscriptions. Even if the server returned the same ID, that ID is only relevant within the context of the connection that you are in.

It's technically possible to create a subscription on different connection than the chainHead_call but that's really an edge-case and nothing I personally care about. Totally possible to fetch the subscription ID from a separate connection

Let's get back to main point, for a single connection we should enforce the order of the messages period.
No need to discuss it further here, we are on the same page.

We would need a more low-level API in jsonrpsee because the response is sent about when the RPC method returns and we emit these "events at same time" and race-conditions can occur.

lexnv · 2024-01-22T09:24:13Z

The chainHead_follow gives you a subscription ID at the application layer from one socket.
The subscription ID is the identifier of the subscription, not the originated socket connection.

Users then can make a chainHead_call from a different socket connection using the above subscription ID identifier.

This makes absolutely no sense... Even if the server returned the same ID, that ID is only relevant within the context of the connection that you are in.

There is nothing in the server implementation or the spec; that would stop or restrict this behavior.

You can not mix the chainHead_follow subscription of one connection with the subscription of the other connection.

To make the call to the chainHead_call (or any other method); you'd need two things:

the follow_subscription subscription ID that identifies the chainHead_follow subscription that receives events
the pinned block hash reported by chainHead_follow

The server doesn't look at the socket connection or the IP of the user etc:

polkadot-sdk/substrate/client/rpc-spec-v2/src/chain_head/chain_head.rs

Line 382 in e2caa81

    
           let mut block_guard = match self.subscriptions.lock_block(&follow_subscription, hash, 1) {

Any change to the spec should take into account this information. Or restrict / clarify user connections.

And for the time being, we should be looking at fixing the "single connection" use-case since we have more control there 👍

josepot · 2024-01-22T09:24:17Z

It's technically possible to create a subscription on different connection than the chainHead_call but that's really an edge-case and nothing I personally care about.

Is it possible to create different subscriptions on different connections? Sure!

Can a user use the subscriptionId returned in the context of one connection within the context of another connection?

Obviously, the answer is: no, it can not, because WebSocket connections -unlike REST APIs- are stateful. So, this is a moot point.

jsdw · 2024-02-05T11:35:06Z

Just to echo Niklas's reply on paritytech/json-rpc-interface-spec#125:

The following PR in jsonrpsee should be merged soon, which will give jsonrpsee server RPC method impls the ability to wait until some message is queued for delivery from one RPC method before performing some action:

paritytech/jsonrpsee#1281

When that is merged, @niklasad1 will cherry pick that into a small jsonrpsee release and open a PR here to address this issue! We are aiming to have this PR ready to merge this week.

lexnv · 2024-02-05T11:41:06Z

Kudos to @niklasad1 for adding a nice API to handle the single connection race in: paritytech/jsonrpsee#1281.

From the substrate perspective, I have created this issue to allow chainHead operations to be made only from a single connection:

[RPC-Spec-V2] Allow chainHead operations to be made from a single connection #3207

At the time being, we can use the subscriptionId of one connection in another connection to make calls (see chainhead-cli tool for more info)

Close #2992 Breaking changes: - rpc server grafana metric `substrate_rpc_requests_started` is removed (not possible to implement anymore) - rpc server grafana metric `substrate_rpc_requests_finished` is removed (not possible to implement anymore) - rpc server ws ping/pong not ACK:ed within 30 seconds more than three times then the connection will be closed Added - rpc server grafana metric `substrate_rpc_sessions_time` is added to get the duration for each websocket session

…tech#3230) Close paritytech#2992 Breaking changes: - rpc server grafana metric `substrate_rpc_requests_started` is removed (not possible to implement anymore) - rpc server grafana metric `substrate_rpc_requests_finished` is removed (not possible to implement anymore) - rpc server ws ping/pong not ACK:ed within 30 seconds more than three times then the connection will be closed Added - rpc server grafana metric `substrate_rpc_sessions_time` is added to get the duration for each websocket session

Polkadot-Forum · 2024-04-16T17:19:47Z

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-api-2024-q1-update/7462/1

josepot added I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known. labels Jan 19, 2024

josepot assigned jsdw, niklasad1 and lexnv Jan 19, 2024

jsdw mentioned this issue Jan 19, 2024

chainHead_call (et al): clarify that response from this comes back before any events on the follow subscription. paritytech/json-rpc-interface-spec#125

Closed

niklasad1 mentioned this issue Feb 6, 2024

rpc: bump jsonrpsee v0.22 and fix race in rpc v2 chain_head #3230

Merged

niklasad1 closed this as completed in #3230 Feb 14, 2024

josepot mentioned this issue Feb 21, 2024

orphan subscriptions temporary work around polkadot-api/polkadot-api#281

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON-RPC messages arrive out of order #2992

JSON-RPC messages arrive out of order #2992

josepot commented Jan 19, 2024 •

edited

Loading

Polkadot-Forum commented Jan 19, 2024

niklasad1 commented Jan 19, 2024 •

edited

Loading

lexnv commented Jan 19, 2024

jsdw commented Jan 19, 2024

josepot commented Jan 19, 2024 •

edited

Loading

bkchr commented Jan 19, 2024

josepot commented Jan 19, 2024 •

edited

Loading

jsdw commented Jan 19, 2024 •

edited

Loading

jsdw commented Jan 19, 2024

josepot commented Jan 19, 2024 •

edited

Loading

lexnv commented Jan 22, 2024

josepot commented Jan 22, 2024 •

edited

Loading

niklasad1 commented Jan 22, 2024 •

edited

Loading

lexnv commented Jan 22, 2024

josepot commented Jan 22, 2024 •

edited

Loading

jsdw commented Feb 5, 2024

lexnv commented Feb 5, 2024

Polkadot-Forum commented Apr 16, 2024

JSON-RPC messages arrive out of order #2992

JSON-RPC messages arrive out of order #2992

Comments

josepot commented Jan 19, 2024 • edited Loading

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Steps to reproduce

Polkadot-Forum commented Jan 19, 2024

niklasad1 commented Jan 19, 2024 • edited Loading

lexnv commented Jan 19, 2024

jsdw commented Jan 19, 2024

josepot commented Jan 19, 2024 • edited Loading

bkchr commented Jan 19, 2024

josepot commented Jan 19, 2024 • edited Loading

jsdw commented Jan 19, 2024 • edited Loading

jsdw commented Jan 19, 2024

josepot commented Jan 19, 2024 • edited Loading

lexnv commented Jan 22, 2024

josepot commented Jan 22, 2024 • edited Loading

niklasad1 commented Jan 22, 2024 • edited Loading

lexnv commented Jan 22, 2024

josepot commented Jan 22, 2024 • edited Loading

jsdw commented Feb 5, 2024

lexnv commented Feb 5, 2024

Polkadot-Forum commented Apr 16, 2024

josepot commented Jan 19, 2024 •

edited

Loading

niklasad1 commented Jan 19, 2024 •

edited

Loading

josepot commented Jan 19, 2024 •

edited

Loading

josepot commented Jan 19, 2024 •

edited

Loading

jsdw commented Jan 19, 2024 •

edited

Loading

josepot commented Jan 19, 2024 •

edited

Loading

josepot commented Jan 22, 2024 •

edited

Loading

niklasad1 commented Jan 22, 2024 •

edited

Loading

josepot commented Jan 22, 2024 •

edited

Loading