Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved connection errors #937

Open
fpagliughi opened this issue Aug 15, 2020 · 8 comments
Open

Improved connection errors #937

fpagliughi opened this issue Aug 15, 2020 · 8 comments
Milestone

Comments

@fpagliughi
Copy link
Contributor

A problem I've been hearing about - and have hit myself a few times - is trying to figure out why a secure connection was refused to a remote broker. There are two distinct errors that are both reported by the library as the same thing (basically, "connection refused"):

  1. The client and broker are unable to create the secure SSL/TLS connection (bad certificates, etc)
  2. The underlying connection is established, but then the broker doesn't like some parameter in the CONNECT packet and immediately drops the connection.

The second one is common with a number of web services that aren't fully compliant with the protocol (AWS, Azure, etc). But when hit, most people assume it's a problem authenticating the secure connection and waste time there trying to figure out the wrong problem.

The only way I've been able to distinguish the two is by looking through the logs. But it would be great if there were separate errors back from the library for these things.

@keysight-daryl
Copy link
Contributor

+1 - spend a ton of time diagnosing ambiguous connection issues

@icraggs
Copy link
Contributor

icraggs commented Aug 24, 2020

You can get the TLS error messages by setting the ssl_error_cb function pointer in the SSL options structure. If you don't get any error messages from that, then the TLS negotiation has succeeded.

I agree it could be a good idea to differentiate between a TCP, TLS (and probably websocket) connection failures in the error code information. I think we were thinking that the TLS error callback would cover that.

On services returning error codes in the connack, or not. As the writer of a service, I might decide I'd rather not give out exact information about the error in case I'm aiding a malicious hacking attempt.

@icraggs icraggs added this to the 1.3.8 milestone Oct 21, 2020
@fpagliughi
Copy link
Contributor Author

I received this Issue from a user of one of my MQTT apps:

The error messages from [the MQTT app] for common failure scenarios are quite vague. A more precise error message or error code would be helpful in diagnosing issues in the field.

On bad credentials

Unable to connect to MQTT broker: [-1] TCP connect completion failure

On invalid url

Unable to connect to MQTT broker: [-1] TCP connect completion failure

On Network disconnect

Unable to connect to MQTT broker: General failure

On DNS resolution errors

Unable to connect to MQTT broker: General failure

I'm not sure of the best way to proceed (Lots more error return codes? A thread-local type of Paho errno? etc). But I do agree that if we can provide some better details all around, it would be really helpful.

@icraggs
Copy link
Contributor

icraggs commented Dec 3, 2020

One thing to check on the bad credentials error is what the behaviour of the broker is. If it just chops the TCP connection, then you're not going to get any more information. A broker MIGHT return an appropriate return code in the connack, but it's not obliged to, it's within its rights to terminate the TCP connection. That applies to other connack return codes too.

@fpagliughi
Copy link
Contributor Author

Ah. Yeah. Can't wait until we can all move to v5! :-)

@icraggs
Copy link
Contributor

icraggs commented Dec 3, 2020

That doesn't necessarily change with V5. It can be considered an exposure of information to say that the userid and password are wrong for instance, aiding hacking attempts.

There is already a message field in the failureData structure which provides some description. If there were a connack return code returned from the broker, then this message field should already be filled out with "CONNACK return code" so I suspect the broker is not sending back the connack.

The message field could be used to include more accurate information, about TLS errors, for instance. The protocol trace does include all the needed info, so it's a matter of making sure its included.

@fishkeeper87
Copy link

fishkeeper87 commented Jan 4, 2022

You can get the TLS error messages by setting the ssl_error_cb function pointer in the SSL options structure. If you don't get any error messages from that, then the TLS negotiation has succeeded.

I agree it could be a good idea to differentiate between a TCP, TLS (and probably websocket) connection failures in the error code information. I think we were thinking that the TLS error callback would cover that.

On services returning error codes in the connack, or not. As the writer of a service, I might decide I'd rather not give out exact information about the error in case I'm aiding a malicious hacking attempt.

Is there any good example online or in the tests that shows how to use the function callback ssl_error_cb? I'm new to using openssl and haven't really found anything yet but will keep looking. I can connect to my local mosquitto broker using TLS 1.2 Mutual Auth using mosquitto_pub iwth my self-signed certs, but I cannot connect with the paho_cs_pub.

paho_cs_pub -h 192.168.3.165 -m "test" -t tcusim/89011703278600892767/voice -p 8883 --insecure --cafile ca-certificates-local.crt --cert 89011703278600892767wIntcopy.crt --key 89011703278600892767.crt --trace protocol

Thanks!

Trace : 3, =========================================================
Trace : 3, Trace Output
Trace : 3, Product name: Eclipse Paho Synchronous MQTT C Client Library
Trace : 3, Version: 1.3.0
Trace : 3, Build level: Tue Jan 4 15:00:22 CST 2022
Trace : 3, OpenSSL version: OpenSSL 1.1.1 11 Sep 2018
Trace : 3, OpenSSL flags: compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-Flav1L/openssl-1.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX2
Trace : 3, OpenSSL build timestamp: built on: Mon Aug 23 17:02:39 2021 UTC
Trace : 3, OpenSSL platform: platform: debian-amd64
Trace : 3, OpenSSL directory: OPENSSLDIR: "/usr/lib/ssl"
Trace : 3, /proc/version: Linux version 5.4.0-91-generic (buildd@lgw01-amd64-024) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu118.04)) #10218.04.1-Ubuntu SMP Thu Nov 11 14:46:36 UTC 2021

Trace : 3, =========================================================
Trace : 4, 20220104 154701.222 3 paho-cs-pub -> CONNECT cleansession: 1 (0)
Trace : 5, 20220104 154701.232 waitfor unexpectedly is NULL for client paho-cs-pub, packet_type 2, timeout 29880
Trace : 4, 20220104 154702.241 3 paho-cs-pub -> CONNECT cleansession: 1 (0)
Trace : 5, 20220104 154702.241 waitfor unexpectedly is NULL for client paho-cs-pub, packet_type 2, timeout 28861

@fpagliughi
Copy link
Contributor Author

Looking at the logs from the C library, it seems like the useful information is being determined and logged, but not passed to the caller. Much of the info I was thinking about would be on the failure before or during the connection attempt itself.

I would love to know if the failure was one of these:

  • Address resolution failure (unknown host)
  • Socket connect TCP failure (nothing listening on host:port)
    • Maybe separate this by common TCP errors: ECONNREFUSED, ENETUNREACH, ETIMEDOUT
  • SSL/TLS error, separate from TCP error. (Caller should add SSL callback for details)
  • Timeout waiting for CONNACK (something listening on the port, but maybe it's not an MQTT broker?)
  • Server abruptly disconnected after receiving CONNECT packet. (i.e. non-confirming AWS doesn't support something you requested and just decided to hang up on you)

That sort of thing.

I assume this can be done in a non-breaking-API fashion by adding a whole bunch of new return codes. With a C int we still have room for thousands of new codes. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants