Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outbound: constant reconnect to unreachable servers #3100

Closed
gramakri opened this issue Sep 26, 2022 · 8 comments
Closed

outbound: constant reconnect to unreachable servers #3100

gramakri opened this issue Sep 26, 2022 · 8 comments

Comments

@gramakri
Copy link
Collaborator

gramakri commented Sep 26, 2022

Describe the bug
Haraka tries non-stop to connect to unreachable servers or servers that refuse a connection.

Expected behavior
Haraka should not try to connect non-stop.

Additional context

This issue is same as #2694 and possibly #2507 .

I have debugged this problem and found that the core problem is this:

Issue is easy to reproduce: just enable debug logs and send mail to a random IP.

@gramakri
Copy link
Collaborator Author

A way to mitigate the issue is to set acquireTimeoutMillis. This will make acquire fail after acquireTimeoutMillis milli seconds (the default is to try forever). Note that this will busy loop for acquireTimeoutMillis at the minimum :-(

Not sure, what else we can do here since upstream module lacks abort feature.

@msimerson
Copy link
Member

First, great job debugging the problem. I think the bits missing from outbound is an .on('factoryCreateError', .... handler. It's referenced in both of the upstream issues you cited, we have in smtp_client, be I didn't see it in a quick glance of the outbound code.

@gramakri
Copy link
Collaborator Author

gramakri commented Sep 26, 2022

@msimerson the factoryCreateError workaround posted in the upstream issues, just deque the last item. This won't work if we have multiple items (sockets) being created in parallel, no? There is a race and that code is not making sure that the work item being dequed, is the one that actually errored.

@gramakri
Copy link
Collaborator Author

It seems this is creating quite a few 100% CPU on many of our servers. @msimerson Do you have any suggestions for a workaround/fix ?

For the moment, I can submit a PR to set acquireTimeoutMillis to be a hardcoded 10 seconds. Does that sound like a reasonable timeout ? (Don't want to make this configurable since this is just a workaround till we work on a clear fix).

@msimerson
Copy link
Member

Yes, anything between 10 and 60 seconds seems eminently reasonable to me.

gramakri added a commit to cloudron-io/Haraka that referenced this issue Nov 9, 2022
…nreachable servers

The generic-pool module is built on the assumption that acquire
always succeeds. It is implemented as a busy loop of calling
create() non-stop and there is no way to make acquire() return
an error.

As a mitigation error, we make acquire() fail after 10 seconds.
Note that it will still busy loop for 10 seconds. We have to fix
upstream module or replace generic-pool to really fix the problem.

see haraka#3100
gramakri added a commit to cloudron-io/Haraka that referenced this issue Nov 10, 2022
…nreachable servers

The generic-pool module is built on the assumption that acquire
always succeeds. It is implemented as a busy loop of calling
create() non-stop and there is no way to make acquire() return
an error.

As a mitigation error, we make acquire() fail after 10 seconds.
Note that it will still busy loop for 10 seconds. We have to fix
upstream module or replace generic-pool to really fix the problem.

see haraka#3100
msimerson pushed a commit that referenced this issue Nov 14, 2022
…nreachable servers (#3104)

The generic-pool module is built on the assumption that acquire
always succeeds. It is implemented as a busy loop of calling
create() non-stop and there is no way to make acquire() return
an error.

As a mitigation error, we make acquire() fail after 10 seconds.
Note that it will still busy loop for 10 seconds. We have to fix
upstream module or replace generic-pool to really fix the problem.

see #3100
@msimerson
Copy link
Member

msimerson commented Nov 30, 2022

In looking into this further, I'm sorely tempted to just jettison all the pool code. It has been a persistent source of headaches and difficult to track down bugs for ages. And the documentation for it stinks. While debugging this and a couple other issues related to generic-pool, I've just found a place where it was throwing errors because a function in the module was removed. It wasn't mentioned in the upgrading guide.

@gramakri
Copy link
Collaborator Author

I would agree with the change. From my (casual) reading of the code when I was debugging this, that module offers a lot of options and feature which we don't quite need. And also the usage in Haraka is quite different from what the module is designed for. generic-pool wants to pool database connections. A database is expected to be always available. In Haraka, we try to pool connections to servers which may or may not be there. The pooling logic has to be aware of such, which it currently isn't. We can either propose a fix upstream but IMO it's easier to just write a our own simple pool (with similar API as upstream, to help migration initially).

@baudehlo
Copy link
Collaborator

baudehlo commented Nov 30, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants