Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http requests fails with timeout on large block (json.hpp issue?) #677

Closed
greg7mdp opened this issue Jan 31, 2023 · 27 comments · Fixed by #696
Closed

http requests fails with timeout on large block (json.hpp issue?) #677

greg7mdp opened this issue Jan 31, 2023 · 27 comments · Fixed by #696
Assignees

Comments

@greg7mdp
Copy link
Contributor

$ curl -X POST -d '{"block_num_or_id": 291953778}' http://localhost:8888/v1/chain/get_block
{"code":500,"message":"Internal Service Error","error":{"code":2,"name":"timeout_exception","what":"deadline 2023-01-31T22:14:07.564 exceeded by 914656us ","details":[{"message":"deadline 2023-01-31T22:14:07.564 exceeded by 914656us ","file":"json.hpp","line_number":63,"method":"operator()"}]}}#

in eosq block explorer.

The message mentions json.hpp:63, which is in a string to_string() function. Maybe we have an issue where creating json output takes excessively long for large trees?

@matthewdarwin
Copy link

This works fine in nodeos 3.1. So regression here.

@matthewdarwin
Copy link

matthewdarwin commented Feb 1, 2023

http-max-response-time-ms = 2500 works in 3.2

@matthewdarwin
Copy link

Seems to be a deliberate but poorly documented change in #96

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Feb 1, 2023

Seems to be a deliberate but poorly documented change in #96

Probably. Although the PR #96 doesn't list get_block as one of the affected APIs?

@matthewdarwin
Copy link

I don't know why we would want to have limits on get_block. If the block is stored and requested we want to send it.

@heifner heifner added discussion and removed triage actionable bug Something isn't working labels Feb 1, 2023
@greg7mdp
Copy link
Contributor Author

greg7mdp commented Feb 1, 2023

My (limited) understanding is that if the node is used for block production, we may not want it failing to do so because its main thread is busy serving long http requests. If the node is used mainly for serving data requests, yes probably these limits should be relaxed.

@matthewdarwin
Copy link

matthewdarwin commented Feb 1, 2023

block production nodes should never be serving general http requests anyway.

@heifner
Copy link
Member

heifner commented Feb 1, 2023

I don't know why we would want to have limits on get_block. If the block is stored and requested we want to send it.

So just set a very high limit on the api node. Or do you want to avoid long running get_table calls?

@bhazzard
Copy link

bhazzard commented Feb 3, 2023

In an ENF backlog refinement session, we discussed this issue at length, but didn’t come to a clear consensus on the best path forward.

 We did all agree that the current behavior is undesirable and that more comprehensive thought is required around api architecture and where serialization should be handled. For now, we will simply communicate a workaround.

The workaround:
Ensure that max-http-response-time is configured to be greater than max-abi-serialization-time to prevent timeouts on GetBlock. To keep a lower maximum allowed http response time for endpoints other than GetBlock, run multiple nodeos instances with the required configurations, and use a proxy to direct to them as desired.

A more complete statement with context:
API endpoints often require ABI serialization. For large blocks requested with the GetBlock endpoint, this serialization can take quite a long time. To minimize the occurrence of requests that require a long running serialization failing due to exceeding the maximum allowed http response time our suggested workaround is to ensure that max-http-response-time is configured to be greater than max-abi-serialization-time.

If a node operator would like to keep a lower maximum allowed http response times for endpoints that don’t have this problem, one option is to use a proxy to direct GetBlock requests to a dedicated nodeos instance with a higher max-http-response-time, and all other requests to a nodeos instance with your preferred max-http-response-time.

@heifner
Copy link
Member

heifner commented Feb 6, 2023

Time for the abi serializer to_variant vs the conversion to json for EOS mainnet block 291953778 on my i9-12900K

to_variant calls: 1, total: 884855us, avg: 884855us
to json    calls: 1, total:  48551us, avg:  48551us

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Feb 6, 2023

Well, I guess it shows that the issue, if there is one, is in to_variant!

@heifner
Copy link
Member

heifner commented Feb 6, 2023

Well, I guess it shows that the issue, if there is one, is in to_variant!

Yeah, that is as expected. Although that to json is rather terrible as well.

@heifner
Copy link
Member

heifner commented Feb 7, 2023

For reference the same block without exploding it via abi_serializer (what a /v1/chain/get_raw_block would be):

info  2023-02-07T00:40:21.876 nodeos    stopwatch.hpp:41              report               ] get_block calls: 1, total: 1460us, avg: 1460us
info  2023-02-07T00:40:21.878 http-0    stopwatch.hpp:41              report               ] convert to variant calls: 1, total: 1692us, avg: 1692us
info  2023-02-07T00:40:21.883 http-1    stopwatch.hpp:41              report               ] to json calls: 1, total: 4546us, avg: 4546us

@heifner
Copy link
Member

heifner commented Feb 7, 2023

Move most of the abi_serialization off the main thread:

info  2023-02-07T01:12:34.801 nodeos    stopwatch.hpp:41              report               ] get_block calls: 1, total: 1817us, avg: 1817us
info  2023-02-07T01:12:34.802 nodeos    stopwatch.hpp:41              report               ] gather abis calls: 1, total: 742us, avg: 742us
info  2023-02-07T01:12:34.987 http-0    stopwatch.hpp:41              report               ] abi_serializer calls: 1, total: 184769us, avg: 184769us
info  2023-02-07T01:12:35.036 http-1    stopwatch.hpp:41              report               ] to json calls: 1, total: 46043us, avg: 46043us

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Feb 7, 2023

Wow, did that speed up the abi serialization a lot? Above you reported that to_variant took 884855us.

@heifner
Copy link
Member

heifner commented Feb 7, 2023

Wow, did that speed up the abi serialization a lot? Above you reported that to_variant took 884855us.

Yes. This uses an abi cache so that an abi is only pulled out of chainbase and an abi_serializer created once per account. In the existing implementation that is done over and over again each time the account is referenced in the block.

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Feb 7, 2023

OK, that makes more sense, looks like an great speedup, over 4x faster. So the speedup is because of the cache, not because it is done on a different thread.

I'll have to check out the cache you implemented. Is it a lru cache? And does it need to support concurrent access from multiple threads?

@heifner
Copy link
Member

heifner commented Feb 7, 2023

OK, that makes more sense, looks like an great speedup, over 4x faster. So the speedup is because of the cache, not because it is done on a different thread.

I believe so, yes.

I'll have to check out the cache you implemented. Is it a lru cache? And does it need to support concurrent access from multiple threads?

Nothing that complicated is used. The cache only lives for the life of the get_block call. Since the gather abi which creates the cache is less than 1ms in this example; I think that is fine as a final solution.

heifner added a commit that referenced this issue Feb 7, 2023
… issue with invalid abi should not throw exception.
heifner added a commit that referenced this issue Feb 7, 2023
heifner added a commit that referenced this issue Feb 7, 2023
heifner added a commit that referenced this issue Feb 7, 2023
heifner added a commit that referenced this issue Feb 7, 2023
heifner added a commit that referenced this issue Feb 7, 2023
heifner added a commit that referenced this issue Feb 8, 2023
heifner added a commit that referenced this issue Feb 8, 2023
@heifner
Copy link
Member

heifner commented Feb 8, 2023

Should we consider this closed by #696 ?

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Feb 8, 2023

Should we consider this closed by #696 ?

Yes, I think so, but maybe @matthewdarwin thinks otherwise? With your cache the request is is significantly faster so that should lessen the problem.
Also, I was wondering if you checked whether avoiding the extra string copies did speed up the abi_serializer part at all?

@heifner
Copy link
Member

heifner commented Feb 8, 2023

Also, I was wondering if you checked whether avoiding the extra string copies did speed up the abi_serializer part at all?

Yes, on my machine for the test block cut it in half.

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Feb 8, 2023

Wow, that's great. So for that block, the total request processing time is about 8x faster than before your changes?

@heifner
Copy link
Member

heifner commented Feb 8, 2023

The abi_serializer creation was only a small part of that.

info  2023-02-07T01:12:34.802 nodeos    stopwatch.hpp:41              report               ] gather abis calls: 1, total: 742us, avg: 742us

Now that is ~350us on my machine.

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Feb 8, 2023

For having a good idea of the progress made in the PR, I think it would be very helpful if you provided the updated times for the following two measurements:

to_variant calls: 1, total: 884855us, avg: 884855us
to json    calls: 1, total:  48551us, avg:  48551us

@greg7mdp
Copy link
Contributor Author

greg7mdp commented Feb 8, 2023

The abi_serializer creation was only a small part of that.

This must mean that the cache saves us a lot of duplicate abi_serializer creations. And also that having a persistent cache as we discussed probably does not provide significant benefits (since creating the cache on-demand, even for this big block, costs little - 350us)

@heifner
Copy link
Member

heifner commented Feb 8, 2023

Current PR (34baed9) perf:

info  2023-02-08T16:33:11.344 nodeos    stopwatch.hpp:50              report               ] get_block calls: 1, total: 1351us, avg: 1351us
info  2023-02-08T16:33:11.345 nodeos    stopwatch.hpp:50              report               ] create cache calls: 1, total: 340us, avg: 340us
info  2023-02-08T16:33:11.533 http-1    stopwatch.hpp:50              report               ] to variant calls: 1, total: 188137us, avg: 188137us
info  2023-02-08T16:33:11.577 http-0    stopwatch.hpp:50              report               ] to json calls: 1, total: 40403us, avg: 40403us

to_variant before would include create cache & to variant.

Nothing in the PR should make any difference in to json, so any time diff there is not real.

The exciting thing about this change is how much less time is spent on the main thread.

@heifner heifner self-assigned this Feb 8, 2023
@heifner heifner moved this from Todo to In Progress in Team Backlog Feb 8, 2023
@heifner heifner moved this from In Progress to Awaiting Review in Team Backlog Feb 8, 2023
@greg7mdp
Copy link
Contributor Author

greg7mdp commented Feb 8, 2023

Great, so we are 5x faster on to_variant, 4x faster overall (including to_json). And almost all off the main thread. That is a great PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants