outbound: constant reconnect to unreachable servers #3100

gramakri · 2022-09-26T10:19:15Z

Describe the bug
Haraka tries non-stop to connect to unreachable servers or servers that refuse a connection.

Expected behavior
Haraka should not try to connect non-stop.

Additional context

This issue is same as #2694 and possibly #2507 .

I have debugged this problem and found that the core problem is this:

Outbound uses generic-pool module to manage socket connections per IP. When a server is unreachable (i.e MX points to an IP address which as no mail server) or the server refuses connection (maybe some IP based gray listing), then the socket creation fails at https://github.com/haraka/Haraka/blob/master/outbound/client_pool.js#L50 . This results in create factory method returning an error to generic-pool module.
The issue is upstream generic-pool module has no code to make acquire return an error. So, https://github.com/haraka/Haraka/blob/master/outbound/client_pool.js#L110 never returns! The relevant issues upstream are .acquire() doesn't reject when resource creation fails coopernurse/node-pool#175 and CPU Usage to 100% coopernurse/node-pool#197 . The upstream code for acquire is (in essence) a while true loop calling create non-stop until it succeeds. There is no back off logic or anything. This causes Haraka to non-stop connect to unreachable servers.

Issue is easy to reproduce: just enable debug logs and send mail to a random IP.

The text was updated successfully, but these errors were encountered:

gramakri · 2022-09-26T10:25:06Z

A way to mitigate the issue is to set acquireTimeoutMillis. This will make acquire fail after acquireTimeoutMillis milli seconds (the default is to try forever). Note that this will busy loop for acquireTimeoutMillis at the minimum :-(

Not sure, what else we can do here since upstream module lacks abort feature.

msimerson · 2022-09-26T18:08:44Z

First, great job debugging the problem. I think the bits missing from outbound is an .on('factoryCreateError', .... handler. It's referenced in both of the upstream issues you cited, we have in smtp_client, be I didn't see it in a quick glance of the outbound code.

gramakri · 2022-09-26T18:22:45Z

@msimerson the factoryCreateError workaround posted in the upstream issues, just deque the last item. This won't work if we have multiple items (sockets) being created in parallel, no? There is a race and that code is not making sure that the work item being dequed, is the one that actually errored.

gramakri · 2022-10-14T09:35:17Z

It seems this is creating quite a few 100% CPU on many of our servers. @msimerson Do you have any suggestions for a workaround/fix ?

For the moment, I can submit a PR to set acquireTimeoutMillis to be a hardcoded 10 seconds. Does that sound like a reasonable timeout ? (Don't want to make this configurable since this is just a workaround till we work on a clear fix).

msimerson · 2022-10-14T19:04:42Z

Yes, anything between 10 and 60 seconds seems eminently reasonable to me.

…nreachable servers The generic-pool module is built on the assumption that acquire always succeeds. It is implemented as a busy loop of calling create() non-stop and there is no way to make acquire() return an error. As a mitigation error, we make acquire() fail after 10 seconds. Note that it will still busy loop for 10 seconds. We have to fix upstream module or replace generic-pool to really fix the problem. see haraka#3100

…nreachable servers (#3104) The generic-pool module is built on the assumption that acquire always succeeds. It is implemented as a busy loop of calling create() non-stop and there is no way to make acquire() return an error. As a mitigation error, we make acquire() fail after 10 seconds. Note that it will still busy loop for 10 seconds. We have to fix upstream module or replace generic-pool to really fix the problem. see #3100

msimerson · 2022-11-30T02:35:22Z

In looking into this further, I'm sorely tempted to just jettison all the pool code. It has been a persistent source of headaches and difficult to track down bugs for ages. And the documentation for it stinks. While debugging this and a couple other issues related to generic-pool, I've just found a place where it was throwing errors because a function in the module was removed. It wasn't mentioned in the upgrading guide.

gramakri · 2022-11-30T09:12:36Z

I would agree with the change. From my (casual) reading of the code when I was debugging this, that module offers a lot of options and feature which we don't quite need. And also the usage in Haraka is quite different from what the module is designed for. generic-pool wants to pool database connections. A database is expected to be always available. In Haraka, we try to pool connections to servers which may or may not be there. The pooling logic has to be aware of such, which it currently isn't. We can either propose a fix upstream but IMO it's easier to just write a our own simple pool (with similar API as upstream, to help migration initially).

baudehlo · 2022-11-30T13:47:59Z

I agree with this (with the caveat that it sounds like a lot of work). Thankfully I can say I didn't write the original :)

…

On Wed, Nov 30, 2022 at 4:12 AM Girish Ramakrishnan < ***@***.***> wrote: I would agree with the change. From my (casual) reading of the code when I was debugging this, that module offers a lot of options and feature which we don't quite need. And also the usage in Haraka is quite different from what the module is designed for. generic-pool wants to pool database connections. A database is expected to be always available. In Haraka, we try to pool connections to servers which may or may not be there. The pooling logic has to be aware of such, which it currently isn't. We can either propose a fix upstream but IMO it's easier to just write a our own simple pool (with similar API as upstream, to help migration initially). — Reply to this email directly, view it on GitHub <#3100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFBWYZZWBROZYIEXXGEQADWK4LBBANCNFSM6AAAAAAQVUSHKE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

gramakri mentioned this issue Nov 9, 2022

outbound: set acquireTimeoutMillis #3104

Merged

3 tasks

msimerson mentioned this issue Dec 2, 2022

Jettison generic pool from outbound #3115

Merged

3 tasks

msimerson closed this as completed Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

outbound: constant reconnect to unreachable servers #3100

outbound: constant reconnect to unreachable servers #3100

gramakri commented Sep 26, 2022 •

edited

Loading

gramakri commented Sep 26, 2022

msimerson commented Sep 26, 2022

gramakri commented Sep 26, 2022 •

edited

Loading

gramakri commented Oct 14, 2022

msimerson commented Oct 14, 2022

msimerson commented Nov 30, 2022 •

edited

Loading

gramakri commented Nov 30, 2022

baudehlo commented Nov 30, 2022 via email

outbound: constant reconnect to unreachable servers #3100

outbound: constant reconnect to unreachable servers #3100

Comments

gramakri commented Sep 26, 2022 • edited Loading

gramakri commented Sep 26, 2022

msimerson commented Sep 26, 2022

gramakri commented Sep 26, 2022 • edited Loading

gramakri commented Oct 14, 2022

msimerson commented Oct 14, 2022

msimerson commented Nov 30, 2022 • edited Loading

gramakri commented Nov 30, 2022

baudehlo commented Nov 30, 2022 via email

gramakri commented Sep 26, 2022 •

edited

Loading

gramakri commented Sep 26, 2022 •

edited

Loading

msimerson commented Nov 30, 2022 •

edited

Loading