Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower select timeout #263

Closed
wants to merge 3 commits into from

Conversation

zbenjamin
Copy link
Contributor

This pull request fixes the previous issue with supervisorctl stop hanging.

Having them as class variables didn't have any observable issues, but
it couldn't possibly have been correct.
The previous code was checking whether it had been self.delay seconds
since the last check, not whether it had been self.delay seconds
since we started delaying.
The various deferring classes in http.py would try to delay by 0.1
seconds when data was not yet ready.  However, since the select
timeout was 1 second, we'd end up delaying that long, instead.  This
amounted to each start or stop command taking 1 second to complete.

The comment about the minimum timeout appears to have never been
true.
@zbenjamin
Copy link
Contributor Author

Any feedback here? We'd like to push this upstream as it significantly improves service start and stop times.

@jsipprell
Copy link

+1

This pull solved a tremendous scalability headache for me when automating, via configuration management, restarting large numbers of processes in a group (64+).

@mcdonc
Copy link
Member

mcdonc commented Aug 3, 2014

Note to self: need to check CPU usage of otherwise idle supervisord after trying to merge this.

@mcdonc
Copy link
Member

mcdonc commented Aug 3, 2014

Related: #131

@mcdonc
Copy link
Member

mcdonc commented Aug 4, 2014

I'm afraid I'll have to pass on this pull request, because, while it does help things start and stop faster, it's treating the symptom instead of the cause. The commit at 50d1857 represents a more involved way to make things start and stop faster without requiring that the select timeout be dropped. Hopefully I'll be able to merge it to master soon, once its tests pass.

@mcdonc mcdonc closed this Aug 4, 2014
@ngocson2vn
Copy link
Contributor

@mcdonc: I agree with you that the root cause is not related to select timeout. I also tested supervisorctl start clock (clock is a Python script) with new code in branch feature.faststartstop but supervisorctl start clock still hang up (not return shell).

This is my clock script:

#!/usr/bin/env python3
import sys, time

while True:
    sys.stdout.write("The time is: " + time.strftime("%H:%M:%S") + "\n")
    sys.stdout.flush()

I think there are some reasons as follows:

  1. In rpcinterface.py, method startit() is not finished starting process clock. It returns NOT_DONE_YET
  2. The process clock continues outputting data to stdout_logfile so the file descriptor associated with POutputDispatcher will be always readable. Therefore self.options.select(r, w, x, timeout) will return immediately.
  3. Method writable in class deferring_http_channel calculated elapsed time incorrectly (I think so). It actually didn't do delay exactly:
    def writable(self, t=time.time):
        now = t()
        if self.delay:
            # we called a deferred producer via this channel (see refill_buffer)
            last_writable_check = self.writable_check
            self.writable_check = now           # WRONG here
            elapsed = now - last_writable_check # elapsed is probably less than self.delay
            if elapsed > self.delay:
                return True
            else:
                return False

From 1. 2. and 3. we see that dispatcher object which has type deferring_http_channel will almost never be writable. Therefore, supervisorctl start clock will almost never return.

It would be very nice if we can fix startit() to start process more quickly. However, to make sure that supervisorctl start clock can return, I really want to propose my simple patch as below:

    def writable(self, t=time.time):
        now = t()
        if self.delay:
            # we called a deferred producer via this channel (see refill_buffer)
            last_writable_check = self.writable_check
            elapsed = now - last_writable_check
            if elapsed > self.delay:
                self.writable_check = now  # UPDATE here
                return True
            else:
                return False

I think now writable will actually delay a period self.delay exactly.
With this patch, supervisorctl start clock returns quickly and the process clock is still started correctly.

@mcdonc
Copy link
Member

mcdonc commented Aug 9, 2014

@ngocson2vn nice analysis and fix! I'll commit it to the https://github.com/Supervisor/supervisor/tree/feature.faststartstop branch

@mcdonc
Copy link
Member

mcdonc commented Aug 11, 2014

50d1857 is also now merged to master, FTR, which represents the "more involved" way to make starting and stopping processes much faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants