-
-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
outbound: constant reconnect to unreachable servers #3100
Comments
A way to mitigate the issue is to set Not sure, what else we can do here since upstream module lacks abort feature. |
First, great job debugging the problem. I think the bits missing from outbound is an |
@msimerson the |
It seems this is creating quite a few 100% CPU on many of our servers. @msimerson Do you have any suggestions for a workaround/fix ? For the moment, I can submit a PR to set |
Yes, anything between 10 and 60 seconds seems eminently reasonable to me. |
…nreachable servers The generic-pool module is built on the assumption that acquire always succeeds. It is implemented as a busy loop of calling create() non-stop and there is no way to make acquire() return an error. As a mitigation error, we make acquire() fail after 10 seconds. Note that it will still busy loop for 10 seconds. We have to fix upstream module or replace generic-pool to really fix the problem. see haraka#3100
…nreachable servers The generic-pool module is built on the assumption that acquire always succeeds. It is implemented as a busy loop of calling create() non-stop and there is no way to make acquire() return an error. As a mitigation error, we make acquire() fail after 10 seconds. Note that it will still busy loop for 10 seconds. We have to fix upstream module or replace generic-pool to really fix the problem. see haraka#3100
…nreachable servers (#3104) The generic-pool module is built on the assumption that acquire always succeeds. It is implemented as a busy loop of calling create() non-stop and there is no way to make acquire() return an error. As a mitigation error, we make acquire() fail after 10 seconds. Note that it will still busy loop for 10 seconds. We have to fix upstream module or replace generic-pool to really fix the problem. see #3100
In looking into this further, I'm sorely tempted to just jettison all the pool code. It has been a persistent source of headaches and difficult to track down bugs for ages. And the documentation for it stinks. While debugging this and a couple other issues related to |
I would agree with the change. From my (casual) reading of the code when I was debugging this, that module offers a lot of options and feature which we don't quite need. And also the usage in Haraka is quite different from what the module is designed for. |
I agree with this (with the caveat that it sounds like a lot of work).
Thankfully I can say I didn't write the original :)
…On Wed, Nov 30, 2022 at 4:12 AM Girish Ramakrishnan < ***@***.***> wrote:
I would agree with the change. From my (casual) reading of the code when I
was debugging this, that module offers a lot of options and feature which
we don't quite need. And also the usage in Haraka is quite different from
what the module is designed for. generic-pool wants to pool database
connections. A database is expected to be always available. In Haraka, we
try to pool connections to servers which may or may not be there. The
pooling logic has to be aware of such, which it currently isn't. We can
either propose a fix upstream but IMO it's easier to just write a our own
simple pool (with similar API as upstream, to help migration initially).
—
Reply to this email directly, view it on GitHub
<#3100 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFBWYZZWBROZYIEXXGEQADWK4LBBANCNFSM6AAAAAAQVUSHKE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Describe the bug
Haraka tries non-stop to connect to unreachable servers or servers that refuse a connection.
Expected behavior
Haraka should not try to connect non-stop.
Additional context
This issue is same as #2694 and possibly #2507 .
I have debugged this problem and found that the core problem is this:
Outbound uses generic-pool module to manage socket connections per IP. When a server is unreachable (i.e MX points to an IP address which as no mail server) or the server refuses connection (maybe some IP based gray listing), then the socket creation fails at https://github.com/haraka/Haraka/blob/master/outbound/client_pool.js#L50 . This results in
create
factory method returning an error to generic-pool module.The issue is upstream generic-pool module has no code to make
acquire
return an error. So, https://github.com/haraka/Haraka/blob/master/outbound/client_pool.js#L110 never returns! The relevant issues upstream are .acquire() doesn't reject when resource creation fails coopernurse/node-pool#175 and CPU Usage to 100% coopernurse/node-pool#197 . The upstream code foracquire
is (in essence) a while true loop callingcreate
non-stop until it succeeds. There is no back off logic or anything. This causes Haraka to non-stop connect to unreachable servers.Issue is easy to reproduce: just enable debug logs and send mail to a random IP.
The text was updated successfully, but these errors were encountered: