-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conseil API: sudden "request timeout" #941
Comments
What is the CPU utilization of the Tezos node when you experience these Conseil issues? What about CPU utilization overall? |
Hello @vishakh ,
When I restart conseil-api it gets busier - but only for a minute or two. |
OK, I did some more testing. The What I find really interesting is that when Conseil API freezes then tezos-node (running on the same machine) freezes, too. Do you know why it could happen? |
Maybe the Postgres container is being hit hard by the Conseil queries. This reduces the available IO for the Tezos node. Can you provide us with the queries you are running so we can profile them? |
Hi, here are the calls:
|
@g574 In both cases the query is optimized since it hits a database index. Now I wonder whether your Postgres container or server is overwhelmed. Could you show us the CPU and memory utilization of the Postgres process when the problem queries are running? |
@vishakh sorry, I found a bug in my systems, the tezos-node is not affected, only Conseil API. Is the Postgres CPU/memory utilization still relevant? |
Can you clarify what you mean by finding a bug?
…On Fri, Nov 6, 2020 at 7:49 AM g574 ***@***.***> wrote:
@vishakh <https://github.com/vishakh> sorry, I found a bug in my systems,
the tezos-node is not affected, only Conseil API. Is the Postgres
CPU/memory utilization still relevant?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#941 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHDKW7IIWEBGWVMXGXWJNDSOPWEVANCNFSM4S4TJEOQ>
.
|
The way I ran queries against tezos-node was wrong so I believed it stopped responding at the same time as Conseil. |
But you are still having issues with Conseil, yes? If yes, please check the
Postgres performance when you are having Conseil problems.
…On Fri, Nov 6, 2020 at 7:54 AM g574 ***@***.***> wrote:
The way I ran queries against tezos-node was wrong so I believed it
stopped responding at the same time as Conseil.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#941 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHDKW6YWMTT5TXWZRJD7S3SOPWX3ANCNFSM4S4TJEOQ>
.
|
I don't find it too heavy. Can it be related to the high number of "idle" processes? |
Very puzzling. So none of the processes seems to be under load while you
experience Conseil issues? The host overall is not under any special load?
BTW, would using the Conseil node on https://nautilus.cloud/ work for you?
…On Fri, Nov 6, 2020 at 10:08 AM g574 ***@***.***> wrote:
mpstat:
03:57:54 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:57:54 PM all 1.44 0.00 0.69 0.51 0.00 0.03 0.10 0.00 0.00 97.24
ps -eo %cpu,%mem,cmd | grep postgres:
0.0 0.6 /usr/bin/postgres -D ...
0.0 0.0 postgres: logger process
0.0 22.8 postgres: checkpointer process
0.0 0.5 postgres: writer process
0.0 0.2 postgres: wal writer process
0.0 0.0 postgres: autovacuum launcher process
0.0 0.0 postgres: stats collector process
0.0 0.0 postgres: bgworker: logical replication launcher
2.4 3.8 postgres: ... ... ...(42560) idle
1.4 3.7 postgres: ... ... ...(42566) idle
0.0 0.6 postgres: ... ... ...(42568) idle
0.0 0.0 postgres: ... ... ...(42570) idle
0.0 0.0 postgres: ... ... ...(42588) idle
0.0 0.0 postgres: ... ... ...(42592) idle
0.0 0.0 postgres: ... ... ...(42596) idle
I don't find it too heavy. Can it be related to the high number of "idle"
processes?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#941 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHDKWZANQPUAIGZBBAGVNDSOQGO7ANCNFSM4S4TJEOQ>
.
|
Hi @vishakh , Thanks for the question but unfortunately nautilus.cloud is not an option for me right now. |
It sounds like you are just maxing out your IO on the host. The simplest
solution is to simply scale up your host. Is that an option for you?
…On Thu, Nov 12, 2020 at 3:46 AM g574 ***@***.***> wrote:
Hi @vishakh <https://github.com/vishakh> ,
sorry for the delay. I found that the tezos-node makes some extreme disk
utilization and I/O. Among other problems it may also be the cause of these
blackouts.
Is it possible to do anything about it on the Conseil end? Does lowering
the number of PostgreSQL threads make any sense?
Thanks for the question but unfortunately nautilus.cloud is not an option
for me right now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#941 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHDKW4TROLIEC7HVTXLV43SPOOHDANCNFSM4S4TJEOQ>
.
|
Hi @vishakh ,
My current config looks like this:
Does it look correct? |
hi @vishakh , I suppose there is some misconfiguration. Do you see any red flag in my config? |
@ivanopagano @piotrkosecki Could you please review @g574's configuration above? |
I think you could try changing this Other lines look fine for me. |
@g574 Please try what @piotrkosecki mentioned and let us know if you still have issues. |
Thank you. I have updated the configuration. I will observe my Conseil and let you know the result. |
Hello @piotrkosecki , @vishakh
It would be OK to have a timeout now and then but I wonder why it does not recover without a restart. |
I have switched to delphinet and the problem persists. I have not experienced it on mainnet yet. |
A new release with enhanced logging is almost ready.
…On Wed, Jan 13, 2021 at 5:10 AM g574 ***@***.***> wrote:
I have switched to delphinet and the problem persists. I have not
experienced it on mainnet yet.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#941 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHDKW5NPMLRUHHR3QKHUDDSZVWRRANCNFSM4S4TJEOQ>
.
|
Hi, the problem persists on delphinet - currently on 2021-january-release-35:
If I try |
Can you describe the hardware you are running on? What are the load averages when this issue happens? |
As advised in #932 , I built Conseil from master (
[commit-hash: 9f4eb4c]
).Since this change I see configuration-related warnings like:
And also sudden "blackouts" where the API stops answering, i.e. no response, no 503, nothing to a GET
/v2/data/tezos/carthagenet/blocks/head
call.Logs around the blackout:
It has happened several times since the update and the pattern is always the same:
(I have a process that checks the blocks/head periodically.)
I get the metadata config warnings when I restart Conseil and sometimes before the blackout, too - but not always. I am not sure if the two problems are related.
If I restart the API it works again for some time.
The text was updated successfully, but these errors were encountered: