Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect(): don't try next DNS record if operation is canceled #9711

Merged
merged 1 commit into from
Mar 1, 2023

Conversation

Al2Klimov
Copy link
Member

Instead return immediately to meet the caller's expectations.

ref/NC/777425
ref/IP/44784

closes #9708
closes #9710

Before

[2023-02-28 10:19:06 +0100] information/ApiListener: Reconnecting to endpoint 'al2klimov.de' via host 'al2klimov.de' and port '5665'
[2023-02-28 10:19:06 +0100] notice/ApiListener: Current zone master: alexandersmbp2.int.netways.de
[2023-02-28 10:19:06 +0100] information/NotificationComponent: 'notification' started.
[2023-02-28 10:19:06 +0100] notice/ApiListener: Connected endpoints:
[2023-02-28 10:19:06 +0100] debug/ConfigItem: Activating object 'checker' of type 'CheckerComponent' with priority 300
[2023-02-28 10:19:06 +0100] information/CheckerComponent: 'checker' started.
[2023-02-28 10:19:06 +0100] information/ConfigItem: Activated all objects.
[2023-02-28 10:19:06 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:19:06 +0100] debug/IcingaApplication: In IcingaApplication::Main()
[2023-02-28 10:19:11 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:19:15 +0100] information/WorkQueue: #5 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2023-02-28 10:19:15 +0100] information/WorkQueue: #4 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2023-02-28 10:19:16 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:19:16 +0100] debug/ApiListener: Not connecting to Endpoint 'al2klimov.de' because we're already trying to connect to it.
[2023-02-28 10:19:16 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:19:16 +0100] debug/ApiListener: Not connecting to Endpoint 'alexandersmbp2.int.netways.de' because that's us.
[2023-02-28 10:19:16 +0100] notice/ApiListener: Current zone master: alexandersmbp2.int.netways.de
[2023-02-28 10:19:16 +0100] notice/ApiListener: Connected endpoints:
[2023-02-28 10:19:21 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:19:21 +0100] critical/ApiListener: Timeout while reconnecting to endpoint 'al2klimov.de' via host 'al2klimov.de' and port '5665', cancelling attempt
[2023-02-28 10:19:26 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:19:26 +0100] debug/ApiListener: Not connecting to Endpoint 'al2klimov.de' because we're already trying to connect to it.
[2023-02-28 10:19:26 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:19:26 +0100] debug/ApiListener: Not connecting to Endpoint 'alexandersmbp2.int.netways.de' because that's us.
[2023-02-28 10:19:26 +0100] notice/ApiListener: Current zone master: alexandersmbp2.int.netways.de
[2023-02-28 10:19:26 +0100] notice/ApiListener: Connected endpoints:
[2023-02-28 10:19:31 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:19:36 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:19:36 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:19:36 +0100] debug/ApiListener: Not connecting to Endpoint 'al2klimov.de' because we're already trying to connect to it.
[2023-02-28 10:19:36 +0100] debug/ApiListener: Not connecting to Endpoint 'alexandersmbp2.int.netways.de' because that's us.
[2023-02-28 10:19:36 +0100] notice/ApiListener: Current zone master: alexandersmbp2.int.netways.de
[2023-02-28 10:19:36 +0100] notice/ApiListener: Connected endpoints:
[2023-02-28 10:19:41 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:19:46 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:19:46 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:19:46 +0100] debug/ApiListener: Not connecting to Endpoint 'al2klimov.de' because we're already trying to connect to it.
[2023-02-28 10:19:46 +0100] debug/ApiListener: Not connecting to Endpoint 'alexandersmbp2.int.netways.de' because that's us.
[2023-02-28 10:19:46 +0100] notice/ApiListener: Current zone master: alexandersmbp2.int.netways.de
[2023-02-28 10:19:46 +0100] notice/ApiListener: Connected endpoints:
[2023-02-28 10:19:51 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:19:56 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:19:56 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:19:56 +0100] debug/ApiListener: Not connecting to Endpoint 'al2klimov.de' because we're already trying to connect to it.
[2023-02-28 10:19:56 +0100] debug/ApiListener: Not connecting to Endpoint 'alexandersmbp2.int.netways.de' because that's us.
[2023-02-28 10:19:56 +0100] notice/ApiListener: Current zone master: alexandersmbp2.int.netways.de
[2023-02-28 10:19:56 +0100] notice/ApiListener: Connected endpoints:
[2023-02-28 10:20:01 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:20:06 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:20:06 +0100] debug/ApiListener: Not connecting to Endpoint 'al2klimov.de' because we're already trying to connect to it.
[2023-02-28 10:20:06 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:20:06 +0100] debug/ApiListener: Not connecting to Endpoint 'alexandersmbp2.int.netways.de' because that's us.
[2023-02-28 10:20:06 +0100] notice/ApiListener: Current zone master: alexandersmbp2.int.netways.de
[2023-02-28 10:20:06 +0100] notice/ApiListener: Connected endpoints:
[2023-02-28 10:20:11 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:20:16 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:20:16 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:20:16 +0100] debug/ApiListener: Not connecting to Endpoint 'al2klimov.de' because we're already trying to connect to it.
[2023-02-28 10:20:16 +0100] debug/ApiListener: Not connecting to Endpoint 'alexandersmbp2.int.netways.de' because that's us.
[2023-02-28 10:20:16 +0100] notice/ApiListener: Current zone master: alexandersmbp2.int.netways.de
[2023-02-28 10:20:16 +0100] notice/ApiListener: Connected endpoints:
[2023-02-28 10:20:21 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:20:26 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:20:26 +0100] debug/ApiListener: Not connecting to Endpoint 'al2klimov.de' because we're already trying to connect to it.
[2023-02-28 10:20:26 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:20:26 +0100] debug/ApiListener: Not connecting to Endpoint 'alexandersmbp2.int.netways.de' because that's us.
[2023-02-28 10:20:26 +0100] notice/ApiListener: Current zone master: alexandersmbp2.int.netways.de
[2023-02-28 10:20:26 +0100] notice/ApiListener: Connected endpoints:
[2023-02-28 10:20:31 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:20:36 +0100] critical/ApiListener: Cannot connect to host 'al2klimov.de' on port '5665': Operation timed out

What's the matter

--- lib/base/tcpsocket.hpp
+++ lib/base/tcpsocket.hpp
@@ -79,7 +79,9 @@ void Connect(Socket& socket, const String& node, const String& service, boost::a
                        socket.async_connect(current->endpoint(), yc);

                        break;
-               } catch (const std::exception&) {
+               } catch (const std::exception& ex) {
+                       Log(LogCritical, "LOLCAT") << current->endpoint().address().to_string() << " -- " << ex.what();
+
                        if (++current == result.end()) {
                                throw;
                        }
[2023-02-28 10:25:06 +0100] information/ApiListener: Reconnecting to endpoint 'al2klimov.de' via host 'al2klimov.de' and port '5665'
[2023-02-28 10:25:06 +0100] information/NotificationComponent: 'notification' started.
[2023-02-28 10:25:06 +0100] notice/ApiListener: Current zone master: alexandersmbp2.int.netways.de
[2023-02-28 10:25:06 +0100] debug/ConfigItem: Activating object 'checker' of type 'CheckerComponent' with priority 300
[2023-02-28 10:25:06 +0100] notice/ApiListener: Connected endpoints:
[2023-02-28 10:25:06 +0100] information/CheckerComponent: 'checker' started.
[2023-02-28 10:25:06 +0100] information/ConfigItem: Activated all objects.
[2023-02-28 10:25:06 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:25:06 +0100] debug/IcingaApplication: In IcingaApplication::Main()
[2023-02-28 10:25:11 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:25:16 +0100] information/WorkQueue: #4 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2023-02-28 10:25:16 +0100] information/WorkQueue: #5 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2023-02-28 10:25:16 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:25:16 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:25:16 +0100] debug/ApiListener: Not connecting to Endpoint 'al2klimov.de' because we're already trying to connect to it.
[2023-02-28 10:25:16 +0100] debug/ApiListener: Not connecting to Endpoint 'alexandersmbp2.int.netways.de' because that's us.
[2023-02-28 10:25:16 +0100] notice/ApiListener: Current zone master: alexandersmbp2.int.netways.de
[2023-02-28 10:25:16 +0100] notice/ApiListener: Connected endpoints:
[2023-02-28 10:25:21 +0100] critical/ApiListener: Timeout while reconnecting to endpoint 'al2klimov.de' via host 'al2klimov.de' and port '5665', cancelling attempt
[2023-02-28 10:25:21 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:25:21 +0100] critical/LOLCAT: 2a01:4f8:c2c:16ce::1 -- Operation canceled

After

[2023-02-28 10:45:39 +0100] information/ApiListener: Reconnecting to endpoint 'al2klimov.de' via host 'al2klimov.de' and port '5665'
[2023-02-28 10:45:39 +0100] notice/ApiListener: Current zone master: alexandersmbp2.int.netways.de
[2023-02-28 10:45:39 +0100] information/NotificationComponent: 'notification' started.
[2023-02-28 10:45:39 +0100] notice/ApiListener: Connected endpoints:
[2023-02-28 10:45:39 +0100] debug/ConfigItem: Activating object 'checker' of type 'CheckerComponent' with priority 300
[2023-02-28 10:45:39 +0100] information/CheckerComponent: 'checker' started.
[2023-02-28 10:45:39 +0100] information/ConfigItem: Activated all objects.
[2023-02-28 10:45:39 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:45:39 +0100] debug/IcingaApplication: In IcingaApplication::Main()
[2023-02-28 10:45:44 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:45:47 +0100] information/WorkQueue: #4 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2023-02-28 10:45:47 +0100] information/WorkQueue: #5 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2023-02-28 10:45:49 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:45:49 +0100] debug/ApiListener: Not connecting to Endpoint 'al2klimov.de' because we're already trying to connect to it.
[2023-02-28 10:45:49 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:45:49 +0100] debug/ApiListener: Not connecting to Endpoint 'alexandersmbp2.int.netways.de' because that's us.
[2023-02-28 10:45:49 +0100] notice/ApiListener: Current zone master: alexandersmbp2.int.netways.de
[2023-02-28 10:45:49 +0100] notice/ApiListener: Connected endpoints:
[2023-02-28 10:45:54 +0100] critical/ApiListener: Timeout while reconnecting to endpoint 'al2klimov.de' via host 'al2klimov.de' and port '5665', cancelling attempt
[2023-02-28 10:45:54 +0100] critical/ApiListener: Cannot connect to host 'al2klimov.de' on port '5665': Operation canceled
[2023-02-28 10:45:54 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:45:59 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'alexandersmbp2.int.netways.de'.
[2023-02-28 10:45:59 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
[2023-02-28 10:45:59 +0100] debug/ApiListener: Not connecting to Endpoint 'alexandersmbp2.int.netways.de' because that's us.
[2023-02-28 10:45:59 +0100] information/ApiListener: Reconnecting to endpoint 'al2klimov.de' via host 'al2klimov.de' and port '5665'

Instead return immediately to meet the caller's expectations.
@cla-bot cla-bot bot added the cla/signed label Feb 28, 2023
@icinga-probot icinga-probot bot added area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working ref/IP ref/NC labels Feb 28, 2023
@Al2Klimov
Copy link
Member Author

Al2Klimov commented Feb 28, 2023

TODO (a failure cancels following points)

@julianbrost
Copy link
Contributor

The current state looks fine for me for checking if this actually addresses the problem in that specific installation, so go for packages.

For a final review, I'd like to know what would happen if timeout based cancellation and a socket error happen at roughly the same time, i.e. whether there could be a second attempt without a timeout in that scenario.

By the way, this PR is a perfect candidate that could have a nice and fully self-contained PR description. It could just describe a sequence of events that leads to an undesirable behavior and you wouldn't have to descend into any references or the diff to understand what is changed and conclude that it's a good idea to change this.

@Al2Klimov
Copy link
Member Author

For a final review, I'd like to know what would happen if timeout based cancellation and a socket error happen at roughly the same time, i.e. whether there could be a second attempt without a timeout in that scenario.

This PR IMAO makes something better and nothing worse. Of course we can make this stuff even more better... one nice day.

@julianbrost julianbrost added this to the 2.14.0 milestone Mar 1, 2023
@julianbrost julianbrost added the consider backporting Should be considered for inclusion in a bugfix release label Mar 1, 2023
@julianbrost julianbrost merged commit cf51705 into master Mar 1, 2023
@icinga-probot icinga-probot bot deleted the connect-cancel branch March 1, 2023 14:50
@Lprod
Copy link

Lprod commented Mar 8, 2023

Hi,

after installing the update in our environment, the error unfortunately occurred again.

According to "rmp -qi", 2.13.5+1.g1c162f83d is definitely installed.

Unfortunately, the logs on the affected master still look the same as before the update.
A "cancelling attempt" is reported and no further connection attempt can be detected afterwards.

@julianbrost
Copy link
Contributor

Hi, looks like this a different issue from the one you're having then. We will probably need a build that has extra log messages in the relevant places so that we can figure out what's happening. Well come back to you with these next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working cla/signed consider backporting Should be considered for inclusion in a bugfix release ref/IP ref/NC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants