Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lpstat hangs with stopped cupsd on Solaris #156

Closed
l1gi opened this issue Mar 22, 2021 · 9 comments · Fixed by #788
Closed

lpstat hangs with stopped cupsd on Solaris #156

l1gi opened this issue Mar 22, 2021 · 9 comments · Fixed by #788
Assignees
Labels
bug Something isn't working platform issue Issue is specific to an OS or desktop priority-medium
Milestone

Comments

@l1gi
Copy link

l1gi commented Mar 22, 2021

There is an issue with cups client time out on Solaris. If I stop cupsd or use -h localhost:632 for lpstat it does not fail even if there is no daemon listening and hangs.

$ lpstat -h localhost:632

I have rebuilt the cups with debug enabled and tried again to generate more debug info. This is the interesting part of it:

T001 07:37:07.242  httpConnect2(host="localhost", port=632, addrlist=0, family=0, encryption=0, blocking=1, msec=30000, cancel=0)
T001 07:37:07.242  http_create(host="localhost", port=632, addrlist=0, family=0, encryption=0, blocking=1, mode=0)
T001 07:37:07.243  httpAddrGetList(hostname="localhost", family=AF_UNSPEC, service="632")
T001 07:37:07.243  httpReconnect2(http=701cd5790, msec=30000, cancel=0)
T001 07:37:07.243  httpAddrString(addr=701cd5558, s=7fd24f9417f0, slen=256)
T001 07:37:07.243  httpAddrString: returning "[v1.::1]"...
T001 07:37:07.243  httpReconnect2: Address [v1.::1]:632
T001 07:37:07.243  httpAddrString(addr=701cd5678, s=7fd24f9417f0, slen=256)
T001 07:37:07.243  httpAddrString: returning "127.0.0.1"...
T001 07:37:07.243  httpReconnect2: Address 127.0.0.1:632
T001 07:37:07.243  httpAddrConnect2(addrlist=701cd5550, sock=701cd5790, msec=30000, cancel=0)
T001 07:37:07.243  httpAddrString(addr=701cd5558, s=7fd24f941670, slen=256)
T001 07:37:07.243  httpAddrString: returning "[v1.::1]"...
T001 07:37:07.243  httpAddrConnect2: Trying [v1.::1]:632...
T001 07:37:07.243  httpAddrConnect2: Setting non-blocking connect()
T001 07:37:07.243  httpAddrConnect2: Finishing async connect()
T001 07:37:07.244  httpAddrConnect2: poll() returned 1 (150)
T001 07:37:07.244  pfds[0].revents=5
T001 07:37:07.244  httpReconnect2: New socket=5
T001 07:37:07.244  httpAddrString(addr=701cd5558, s=7fd24f9417f0, slen=256)
T001 07:37:07.244  httpAddrString: returning "[v1.::1]"...
T001 07:37:07.244  httpReconnect2: Connected to [v1.::1]:632...
T001 07:37:07.244  cups_enum_dests(flags=0, msec=250, cancel=0, type=0, mask=0, cb=7fca01a55cb0, user_data=7fd24f941b00)

cups.log

Even if there is nothing listening on [::1]:632, it thinks it is connected and continues trying with re-connect.

This is an interresting part of truss:

so_socket(PF_INET6, SOCK_STREAM, IPPROTO_IP, NULL, SOV_XPG4_2) = 4
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, 0x7FF8F8CCB678, 4, SOV_DEFAULT) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEPORT, 0x7FF8F8CCB678, 4, SOV_DEFAULT) = 0
setsockopt(4, SOL_SOCKET, SO_NOSIGPIPE, 0x7FF8F8CCB678, 4, SOV_DEFAULT) = 0
setsockopt(4, tcp, TCP_NODELAY, 0x7FF8F8CCB678, 4, SOV_DEFAULT) = 0
fcntl(4, F_SETFD, 0x00000001) = 0
fcntl(4, F_GETFL) = 2
fcntl(4, F_SETFL, FWRITE|FNONBLOCK) = 0
connect(4, 0x71FDC20E8, 32, SOV_XPG4_2) Err#150 EINPROGRESS
fcntl(4, F_SETFL, FWRITE) = 0
pollsys(0x7FF8F8CCBB30, 1, 0x7FF8F8CCB620, 0x00000000) = 1
getpeername(4, 0x7FF8F8CCBE50, 0x7FF8F8CCB67C, SOV_DEFAULT) Err#134 ENOTCONN
uname(0x7FF8F8CCB8A0) = 1
send(4, " P O S T / H T T P /".., 203, 0) Err#32 EPIPE
send(4, " P O S T / H T T P /".., 203, 0) Err#32 EPIPE
close(4) = 0

Firstly I don't understand why httpConnect2 is called with blocking=1 and the argument is not used in the underlying functions while they do unconditional nonblocking connect.

If this is an intention, than there is a missing mechanism to check the state of the socket before it is being polled. Maybe getsockopt() is what would 'workaround' the issue though I am in doubt that is the right fix here.

I assume it works accidentally on linux as the non blocking connec() operation finishes sooner that poll() is called. Still the check should be there as far as I understand.

Here is the patch which consider a socket on which getsockopt() fails as non connected on Solaris.

04-nonblock_connect.txt

Though I still don't understand some details of httpAddrConnect2() behavior this change makes Solaris lpstat -t fail in the same manner as on linux:

$ lpstat -t
scheduler is not running
no system default destination
lpstat: Bad file number
lpstat: Bad file number
lpstat: Bad file number
lpstat: Bad file number
lpstat: Bad file number
$

Could I ask you to rework the patch and integrate it in current cups, please?

@michaelrsweet
Copy link
Member

@l1gi This code implements an asynchronous connect so that we can try all of the server's potential addresses at the same time. Clearly Solaris doesn't support this behaviour, so my inclination is to provide a non-async code path just for Solaris.

@michaelrsweet michaelrsweet self-assigned this Mar 22, 2021
@michaelrsweet michaelrsweet added bug Something isn't working platform issue Issue is specific to an OS or desktop priority-medium labels Mar 22, 2021
@michaelrsweet michaelrsweet added this to the v2.4.0 milestone Mar 22, 2021
@l1gi
Copy link
Author

l1gi commented Apr 19, 2021

Why do you think Solaris doesn't support this? What exactly does Solaris different?

As far as I can see, poll() returns 1 with errno 150 (EINPROGRESS) for ::1. Cups starts to believe it is already connected.

Unixv7 standard says:

RETURN VALUE
Upon successful completion, poll( ) shall return a non-negative value. A positive value indicates
the total number of pollfd structures that have selected events (that is, those for which the
revents member is non-zero). A value of 0 indicates that the call timed out and no file descriptors
have been selected. Upon failure, poll( ) shall return −1 and set errno to indicate the error.

When poll() returns non-negative number it doesn't mean that the socket is connected. It just says something has happened on the socket. We must check the socket status to ensure it is connected.

@michaelrsweet
Copy link
Member

@l1gi But the pollfd events we are looking for at POLLIN and POLLOUT, not POLLERR. poll() should only return >0 when the socket is connected (OK to write or OK to read) and not on an error. So that is a Solaris bug, and given that Solaris isn't exactly a current, well-supported OS anymore I am going to opt for a fix that has the fewest possible side-effects. I am not interested in adding a bunch of code that should never be executed.

@michaelrsweet
Copy link
Member

@l1gi Try the following patch (or grab the latest Github master code...):

diff --git a/cups/http-addrlist.c b/cups/http-addrlist.c
index 89b3d92fb..54e2e9b53 100644
--- a/cups/http-addrlist.c
+++ b/cups/http-addrlist.c
@@ -337,6 +337,18 @@ httpAddrConnect2(
 	else if (FD_ISSET(fds[i], &error_set))
 #  endif /* HAVE_POLL */
         {
+#  ifdef __sun
+          // Solaris incorrectly returns errors when you poll() a socket that is
+          // still connecting.  This check prevents us from removing the socket
+          // from the pool if the "error" is EINPROGRESS...
+          int		sockerr;	// Current error on socket
+          socklen_t	socklen = sizeof(serr);
+					// Size of error variable
+
+          if (!getsockopt(fds[i], SOL_SOCKET, SO_ERROR, &sockerr, &socklen) && (!sockerr || sockerr == EINPROGRESS))
+            continue;			// Not an error
+#  endif // __sun
+
          /*
           * Error on socket, remove from the "pool"...
           */

[master e50230a] Add a workaround for Solaris in httpAddrConnect2 (Issue #156)

@l1gi
Copy link
Author

l1gi commented Apr 29, 2021

Hi Michael,

the patch works fine, just

  •      int		sockerr;	// Current error on socket
    

should be

  •      int		serr;	// Current error on socket
    

Thank you!

@l1gi l1gi closed this as completed Apr 29, 2021
@zdohnal
Copy link
Member

zdohnal commented Apr 29, 2021

@l1gi if I understood you correctly (serr is undefined, use sockerr) this PR should fix it #181 . Please let me know if you meant it other way.

@l1gi
Copy link
Author

l1gi commented Apr 29, 2021

Oh sorry, yes, vice versa.

  •      socklen_t	socklen = sizeof(serr);
    

should be

  •      socklen_t	socklen = sizeof(sockerr);
    

Thanks again.

@l1gi
Copy link
Author

l1gi commented Nov 28, 2022

Hi,

I am sorry, but it seems I did not finish my testing of your change going into CUPS codebase mentioned here:

e50230a

The fix doesn't work. Before this change poll() didn't signalize any error though the socket's connection was refused. Later the socket started to be treated as connected. My change (patch) was not nice, but worked and did additional testing of the socket status using getsockopt() before CUPS decided if the socket is connected or not. Michael offered a better looking change, but it did not work because if is placed in 'socket error' code path and thus never run. I have approved Michael's patch by mistake, so blame me for that, please.

Now this change looks similar, but works this time.

$ diff -u cups-2.4.2/cups/http-addrlist.c.~1~ cups-2.4.2/cups/http-addrlist.c
--- cups-2.4.2/cups/http-addrlist.c.~1~ 2022-05-26 08:17:21.000000000 +0200
+++ cups-2.4.2/cups/http-addrlist.c     2022-11-24 14:53:14.027551282 +0100
@@ -318,6 +318,23 @@
       {
 #  ifdef HAVE_POLL
        DEBUG_printf(("pfds[%d].revents=%x\n", i, pfds[i].revents));
+
+#  ifdef __sun
+        // Solaris connect runs asynchronously returning EINPROGRESS. Following
+        // poll() does not detect the socket is not connected and returns
+        // POLLIN|POLLOUT. Check the connection status and update error flag.
+        int            sres, serr;
+        socklen_t      slen = sizeof(serr);
+        sres = getsockopt(fds[i], SOL_SOCKET, SO_ERROR, &serr, &slen);
+        if (sres || serr)
+        {
+          pfds[i].revents |= POLLERR;
+#  ifdef DEBUG
+          DEBUG_printf(("1httpAddrConnect2: getsockopt returned: %d with error: %s", sres, strerror(serr)));
+#  endif
+        }
+#  endif // __sun
+
        if (pfds[i].revents && !(pfds[i].revents & (POLLERR | POLLHUP)))
 #  else
        if (FD_ISSET(fds[i], &input_set) && !FD_ISSET(fds[i], &error_set))
@@ -340,18 +357,6 @@
        else if (FD_ISSET(fds[i], &error_set))
 #  endif /* HAVE_POLL */
         {
-#  ifdef __sun
-          // Solaris incorrectly returns errors when you poll() a socket that is
-          // still connecting.  This check prevents us from removing the socket
-          // from the pool if the "error" is EINPROGRESS...
-          int          sockerr;        // Current error on socket
-          socklen_t    socklen = sizeof(sockerr);
-                                       // Size of error variable
-
-          if (!getsockopt(fds[i], SOL_SOCKET, SO_ERROR, &sockerr, &socklen) && (!sockerr || sockerr == EINPROGRESS))
-            continue;                  // Not an error
-#  endif // __sun
-
          /*
           * Error on socket, remove from the "pool"...
           */

Could I kindly ask you to consider it for integration, please?

Thank you, Martin.

@l1gi l1gi reopened this Nov 28, 2022
@l1gi
Copy link
Author

l1gi commented Nov 28, 2022

Just a comment that on Linux lpstat without cupsd running says:

lpstat: Bad file descriptor

With my change above I am getting:

lpstat: Scheduler is not running.

on Solaris.

@zdohnal zdohnal modified the milestones: v2.4, v2.4.x Jun 14, 2023
zdohnal added a commit to zdohnal/cups that referenced this issue Sep 14, 2023
Solaris behaves differently regarding connecting to a socket, so our
current code caused hanging.

The patch is from @l1gi's comment -
OpenPrinting#156 (comment)

Fixes OpenPrinting#156
zdohnal added a commit to zdohnal/cups that referenced this issue Sep 15, 2023
Solaris behaves differently regarding connecting to a socket, so our
current code caused hanging.

The patch is from @l1gi's comment -
OpenPrinting#156 (comment)

Fixes OpenPrinting#156
zdohnal added a commit to zdohnal/cups that referenced this issue Sep 15, 2023
zdohnal added a commit that referenced this issue Sep 15, 2023
Solaris behaves differently regarding connecting to a socket, so our current code caused hanging.

The patch is from @l1gi's comment in #156 

Fixes #156
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working platform issue Issue is specific to an OS or desktop priority-medium
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants