Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken interrupt handling when connection is inactive #1478

Closed
dgw opened this issue Feb 12, 2019 · 5 comments
Closed

Broken interrupt handling when connection is inactive #1478

dgw opened this issue Feb 12, 2019 · 5 comments
Labels
Bug Things to squish; generally used for issues Low Priority
Milestone

Comments

@dgw
Copy link
Member

dgw commented Feb 12, 2019

For the record, I'm testing on macOS today because that's what I have available. However, the same behavior occurs with identical or similar tracebacks on my Ubuntu machine at home, so I don't think it's related to the fact that I'm using Homebrew Python or anything. It's probably related to Sopel's sometimes-wonky exception handling.

Here's what I did:

  1. Run sopel
  2. Press Ctrl-C after "Connecting to <server>..."
  3. Run sopel again
  4. Press Ctrl-C while Sopel is waiting to reconnect after an expected SSL failure

Interrupting the connection phase resulted in an AttributeError, probably as expected (since quitting tries to send something to the socket, and the socket object doesn't exist before connecting finishes):

Connecting to irc.network.net:6667...
^CGot quit signal, shutting down.
Traceback (most recent call last):
  File "/Users/dgw/github/sopel/sopel/__init__.py", line 91, in run
    p.run(config.core.host, int(config.core.port))
  File "/Users/dgw/github/sopel/sopel/irc.py", line 167, in run
    self.initiate_connect(host, port)
  File "/Users/dgw/github/sopel/sopel/irc.py", line 177, in initiate_connect
    source_address=source_address))
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 707, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 748, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
  File "/Users/dgw/github/sopel/sopel/__init__.py", line 73, in signal_handler
    p.quit('Closing')
  File "/Users/dgw/github/sopel/sopel/irc.py", line 199, in quit
    self.write(['QUIT'], message)
  File "/Users/dgw/github/sopel/sopel/bot.py", line 166, in write
    irc.Bot.write(self, args, text=text)
  File "/Users/dgw/github/sopel/sopel/irc.py", line 161, in write
    self.send(temp.encode('utf-8'))
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncore.py", line 360, in send
    result = self.socket.send(data)
AttributeError: 'NoneType' object has no attribute 'send'

Interrupting the reconnect delay yields a completely different traceback, and additionally requires pressing Ctrl-C twice:

Warning: Disconnected. Reconnecting in 20 seconds...
^CGot quit signal, shutting down.
Traceback (most recent call last):
  File "./sopel.py", line 7, in <module>
    sys.exit(run_script.main())
  File "/Users/dgw/github/sopel/sopel/run_script.py", line 351, in main
    ret = run(config_module, pid_file_path)
  File "/Users/dgw/github/sopel/sopel/__init__.py", line 120, in run
    time.sleep(delay)
  File "/Users/dgw/github/sopel/sopel/__init__.py", line 73, in signal_handler
    p.quit('Closing')
  File "/Users/dgw/github/sopel/sopel/irc.py", line 199, in quit
    self.write(['QUIT'], message)
  File "/Users/dgw/github/sopel/sopel/bot.py", line 166, in write
    irc.Bot.write(self, args, text=text)
  File "/Users/dgw/github/sopel/sopel/irc.py", line 161, in write
    self.send(temp.encode('utf-8'))
  File "/Users/dgw/github/sopel/sopel/irc.py", line 327, in _ssl_send
    result = self.socket.send(data)
OSError: [Errno 9] Bad file descriptor
^CGot quit signal, shutting down.
Exception ignored in: <module 'threading' from '/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py'>
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1273, in _shutdown
    t.join()
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1032, in join
    self._wait_for_tstate_lock()
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
  File "/Users/dgw/github/sopel/sopel/__init__.py", line 73, in signal_handler
    p.quit('Closing')
  File "/Users/dgw/github/sopel/sopel/irc.py", line 199, in quit
    self.write(['QUIT'], message)
  File "/Users/dgw/github/sopel/sopel/bot.py", line 166, in write
    irc.Bot.write(self, args, text=text)
  File "/Users/dgw/github/sopel/sopel/irc.py", line 161, in write
    self.send(temp.encode('utf-8'))
  File "/Users/dgw/github/sopel/sopel/irc.py", line 327, in _ssl_send
    result = self.socket.send(data)
OSError: [Errno 9] Bad file descriptor

I've run into this second issue relatively often while testing things in the last few months, since I often intentionally interrupt Sopel during states other than "Connected to network and running normally".

The exception itself isn't a big deal (though it would be nice not to spit it out). I consider needing to press Ctrl-C twice to be the main bug here. That simply shouldn't be required.

Interrupting Sopel after the "Loading modules..." line (but before it starts to connect) yields a clean exit, though. That's kind of weird, and I haven't had time to look into why.

Since I know we have a few people digging around in Sopel's internals and refactoring things now (and doing damn good work, too!), I'm hoping one of them will look into this at some point. 😹

This isn't a high-severity bug, but I do want to get it fixed if possible, ideally in the next year or two. That need to press Ctrl-C twice carries over into, for example, sopel --quit. Sopel should never need to be told to quit twice, unless something is catastrophically wrong—and waiting to reconnect is a totally normal situation that shouldn't break things the way it does now.

@dgw dgw added Bug Things to squish; generally used for issues Low Priority Patches Welcome labels Feb 12, 2019
@dgw dgw added this to the 7.0.0 milestone Feb 12, 2019
HumorBaby added a commit to HumorBaby/sopel that referenced this issue Mar 30, 2019
Because the `time.sleep` during the disconnect phase, the signal
handling is thrown off and a new bot is spawned before the `bot.hasquit`
flag can be checked. Add a check at the top of the loop to see if the
`hasquit` flag was set during a disconnect.

Fixes sopel-irc#1478
HumorBaby added a commit to HumorBaby/sopel that referenced this issue Mar 30, 2019
Because the `time.sleep` during the disconnect phase, the signal
handling is thrown off and a new bot is spawned before the `bot.hasquit`
flag can be checked. Add a check at the top of the loop to see if the
`hasquit` flag was set during a disconnect.

Fixes sopel-irc#1478
@HumorBaby
Copy link
Contributor

I see the fix is "low priority", but it is a straightforward backport to 6.6.x (from CLI back to __init__.py). Is this something worth pursuing?

@HumorBaby
Copy link
Contributor

HumorBaby commented Mar 30, 2019

Interrupting Sopel after the "Loading modules..." line (but before it starts to connect) yields a clean exit, though. That's kind of weird, and I haven't had time to look into why.

I believe it's because, at this point, the KeyboardInterrupt is caught as-is from the MainThread, as opposed to the socket thread (or any one of the many subsequent threads that are spawned from this point on).
See:

sopel/sopel/cli/run.py

Lines 84 to 86 in f694b5b

p.run(config.core.host, int(config.core.port))
except KeyboardInterrupt:
break
vs. waiting for bot.hasquit to be set after a signal from another thread.

@HumorBaby
Copy link
Contributor

I see the fix is "low priority", but it is a straightforward backport to 6.6.x (from CLI back to __init__.py). Is this something worth pursuing?

During the development for #1527 (and I have now noticed for Docker containers, as well), I found that more often than not, the triggering the exceptions described above (^C), would lead a complete hang (subsequent ^C did not do anything) 😢 The only way to kill Sopel after this was was ^Z, kill -9 %1 (for docker, containers needed to be manually force-removed docker rm -f ...)

This docker behavior is not new, but since it's related to this, I figured I would mention it.

HumorBaby added a commit to HumorBaby/sopel that referenced this issue Apr 1, 2019
Because the `time.sleep` during the disconnect phase, the signal
handling is thrown off and a new bot is spawned before the `bot.hasquit`
flag can be checked. Add a check at the top of the loop to see if the
`hasquit` flag was set during a disconnect.

Fixes sopel-irc#1478
@dgw dgw modified the milestones: 7.0.0, 6.6.6 Apr 4, 2019
@dgw
Copy link
Member Author

dgw commented Apr 4, 2019

The broken cases described here are fixed in #1534 (and presumably in #1527 also, on the new "run" CLI, though I haven't specifically tested that yet).

As @HumorBaby and I pinned down on IRC though, fixing the handling of Ctrl-C during the "waiting to reconnect" state revealed another issue: The quit signal isn't processed until after the time.sleep(delay) period elapses, so Sopel appears to hang for several seconds.

Fixing that would require a fair bit of refactoring, so it's a separate issue. In fact, I'll just make one.

@dgw
Copy link
Member Author

dgw commented Apr 10, 2019

As #1534 was merged, I'll close this as fixed. #1527 remains in the 7.0.0 milestone, so it won't get lost.

@dgw dgw closed this as completed Apr 10, 2019
HumorBaby added a commit to HumorBaby/sopel that referenced this issue Apr 14, 2019
Because the `time.sleep` during the disconnect phase, the signal
handling is thrown off and a new bot is spawned before the `bot.hasquit`
flag can be checked. Add a check at the top of the loop to see if the
`hasquit` flag was set during a disconnect.

Fixes sopel-irc#1478
HumorBaby added a commit to HumorBaby/sopel that referenced this issue Apr 16, 2019
Because of the `time.sleep` during the disconnect phase, the signal
handling is thrown off and a new bot is spawned before the `bot.hasquit`
flag can be checked. Add a check at the top of the loop to see if the
`hasquit` flag was set during a disconnect.

Fixes sopel-irc#1478. Reapplied because of changes from sopel-irc#1493.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Things to squish; generally used for issues Low Priority
Projects
None yet
Development

No branches or pull requests

2 participants