Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix executor shutdown #392

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Fix executor shutdown #392

wants to merge 1 commit into from

Conversation

dkropachev
Copy link
Collaborator

@dkropachev dkropachev commented Dec 19, 2024

Usually Cluster.executor is getting shutdown after Cluster.scheduler, but in some cases executor is getting shutdown earlier, probably when Cluster.executor is manipulated directly or when process is getting killed and Cluster instance is still alive.

When it happens you will see this exception:

Traceback (most recent call last):
  File "lib/python3.13/threading.py", line 1041, in _bootstrap_inner
    self.run()
    ~~~~~~~~^^
  File "cassandra/cluster.py", line 4429, in run
    future = self._executor.submit(fn, *args, **kwargs)
  File "lib/python3.13/concurrent/futures/thread.py", line 171, in submit
Error:   raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown

This PR makes it handle this case gracefully not scheduling anything and not failing.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@dkropachev dkropachev force-pushed the dk/fix-executor-shutdown branch from cd0361c to ad9ffaa Compare December 20, 2024 14:11
@dkropachev dkropachev marked this pull request as ready for review December 20, 2024 17:19
Avoid exception:
Traceback (most recent call last):
  File "lib/python3.13/threading.py", line 1041, in _bootstrap_inner
    self.run()
    ~~~~~~~~^^
  File "cassandra/cluster.py", line 4429, in run
    future = self._executor.submit(fn, *args, **kwargs)
  File "lib/python3.13/concurrent/futures/thread.py", line 171, in submit
Error:   raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
@dkropachev dkropachev force-pushed the dk/fix-executor-shutdown branch from ad9ffaa to 63110b0 Compare December 20, 2024 19:44
@Lorak-mmk
Copy link

What is the task that is being scheduled? I know about this error, but I didn't fix it that way before because it is a wrong way in my opinion. We should first end all tasks / threads (or whatever are they called here), and only after they are stopped the scheduler can be stopped. Ignoring this error is not fixing the issue but just masking it.

I envision it could hurt us in the future - perhaps the queries that were not finished will be forever waiting? Or there will be something that schedules a task, and awaits its result, causing a hang? I'd like to avoid issues similar to cassandra-stress being stuck on shutdown.

@dkropachev
Copy link
Collaborator Author

dkropachev commented Dec 21, 2024

What is the task that is being scheduled? I know about this error, but I didn't fix it that way before because it is a wrong way in my opinion. We should first end all tasks / threads (or whatever are they called here), and only after they are stopped the scheduler can be stopped. Ignoring this error is not fixing the issue but just masking it.

I envision it could hurt us in the future - perhaps the queries that were not finished will be forever waiting? Or there will be something that schedules a task, and awaits its result, causing a hang? I'd like to avoid issues similar to cassandra-stress being stuck on shutdown.

It is actually already done, here _Scheduler waits for thread to complete on shutdown:

def shutdown(self):
try:
log.debug("Shutting down Cluster Scheduler")
except AttributeError:
# this can happen on interpreter shutdown
pass
self.is_shutdown = True
self._queue.put_nowait((0, 0, None))
self.join()

And here it is Cluster shutdowns _Scheduler first and then shutdown executor:

self.scheduler.shutdown()
self.control_connection.shutdown()
for session in tuple(self.sessions):
session.shutdown()
self.executor.shutdown()

Which means that given issue can happen only when shutdown procedure does not go as it should.
Either when test or user manipulates with executor directly or when process is getting killed and Cluster instance is not shutdown properly.

@Lorak-mmk
Copy link

Wait, so the error can happen in one of 3 conditions:

  1. There is some unknown bug in our code
  2. Test does something weird
  3. Process is killed

The third case is not important with regard to warnings, I think it is ok to print them.
In the second case, I think the test should be fixed so that it does not produce the warning, instead of silencing the warning globally.
Perhaps the test needs to do some weird stuff, but if that is the case maybe it can temporarily silence the warnings?

The first case is why the warning should not be ignored imo. If we fix second case, and don't kill the process, and still see the warning, then it indicates a bug which we should fix, right?

@dkropachev
Copy link
Collaborator Author

Wait, so the error can happen in one of 3 conditions:

  1. There is some unknown bug in our code
  2. Test does something weird
  3. Process is killed

The third case is not important with regard to warnings, I think it is ok to print them. In the second case, I think the test should be fixed so that it does not produce the warning, instead of silencing the warning globally. Perhaps the test needs to do some weird stuff, but if that is the case maybe it can temporarily silence the warnings?

The first case is why the warning should not be ignored imo. If we fix second case, and don't kill the process, and still see the warning, then it indicates a bug which we should fix, right?

Correct, Let's do this:

  1. Make it log exception and don't throw it
  2. Fix test case that manipulates with executor and see if this error appears on the CICD.

WDYT ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants