Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate block production loop when shutting down witness plugin #1314 #1332

Merged
merged 4 commits into from
Oct 8, 2018

Conversation

cogutvalera
Copy link
Member

PR for "Terminate block production loop when shutting down witness plugin #1314"

@cogutvalera
Copy link
Member Author

Travis-CI build failed perhaps related to issue #1303

@cogutvalera
Copy link
Member Author

I've restarted Travis-CI build for this PR, just to see if it will fail again or not

@abitmore
Copy link
Member

Restarted again.

@cogutvalera
Copy link
Member Author

Thanks ! Now All checks have passed ...

@pmconrad
Copy link
Contributor

Thread handling in fc is still a mystery to me.

That said, I think there's a race condition in your code. _block_production_task recreates itself by calling schedule_production_loop() from inside block_production_loop(). That means if block_production_loop is being executed while you call cancel, the cancel will only affect the currently running loop but will not prevent it from scheduling the next one. I may be wrong though.

@cogutvalera
Copy link
Member Author

Thread handling in fc is still a mystery to me.

That said, I think there's a race condition in your code. _block_production_task recreates itself by calling schedule_production_loop() from inside block_production_loop(). That means if block_production_loop is being executed while you call cancel, the cancel will only affect the currently running loop but will not prevent it from scheduling the next one. I may be wrong though.

It will wait current block production execution and after that will cancel, so no race condition must be here.

@pmconrad
Copy link
Contributor

cancel_and_wait sends the cancel signal and waits for the task to be canceled.
Where does it wait for the currently running task to finish before sending the cancel signal?

@cogutvalera
Copy link
Member Author

cancel_and_wait sends the cancel signal and waits for the task to be canceled.
Where does it wait for the currently running task to finish before sending the cancel signal?

  1. Yes, right, it will wait currently running task after sending cancel signal.
  2. It won't schedule the next one loop, why it will ? So which another currently running task do you mean if cancel_and_wait will wait it ?

@pmconrad
Copy link
Contributor

cancel_and_wait itself doesn't schedule the next loop.
The currently running task is executing block_production_loop(), and will call schedule_production_loop() as its final step. https://github.com/bitshares/bitshares-core/blob/master/libraries/plugins/witness/witness.cpp#L207

@pmconrad
Copy link
Contributor

That change makes the window for the race condition smaller, but doesn't eliminate it.

@pmconrad
Copy link
Contributor

It just doesn't work like that. You must either use locking or an atomic check-and-set of some kind.

Copy link
Contributor

@jmjatlanta jmjatlanta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the code as written will greatly reduce the chance of a production loop executing after shutdown. I have not walked through all scenarios, but I think this covers many of the bases. The comments below are just some opinions.

IMO: locking / check and set are overkill here. A volatile boolean would work if in this plugin we do not want to support a startup, shutdown, and then another startup. I'm not saying atomic_flag doesn't do the job, it is just some unneeded overhead.

I am not approving this PR, as I believe @pmconrad may be thinking of a scenario that I am not. So I will defer to him.

@pmconrad
Copy link
Contributor

pmconrad commented Oct 1, 2018

I believe there is still a race condition.
The general idea would be to use a mutex to protect access to _block_production_task in both schedule_production_loop and stop_production_loop.

@pmconrad
Copy link
Contributor

pmconrad commented Oct 1, 2018

@jmjatlanta you are right in that the implementation so far eliminates most of the risk. Nevertheless I think we should do it properly. Overhead should not be a problem since it affects only actual witnesses, and only once per second.

@abitmore abitmore added this to the 201810 - Feature Release milestone Oct 1, 2018
@abitmore
Copy link
Member

abitmore commented Oct 2, 2018

Actually I didn't get why we're making it so complicated. Why need another stop_loop call to control the thread at all?

The production loop function is NOT a big task thus won't take much time to execute. So we can simply wait for it to exit if it is still running when trying to shutting it down. So my solution would be:

  • add a member variable e.g. bool shutting_down = false to the class
  • when shutting down, change it to true
  • when entering the production loop, check the variable, if it is true, don't schedule a new task.

Thoughts?

@cogutvalera
Copy link
Member Author

we need to use mutex because there maybe different threads if I understood correctly production loop

@abitmore
Copy link
Member

abitmore commented Oct 2, 2018

IMHO we don't care whether the loop runs one more time. We don't need a mutex for the new variable because there is only one thread that writes to it and another thread only reads it. (I'm talking about my solution mentioned above, not your code)

@cogutvalera
Copy link
Member Author

perhaps production loop maybe multi-threaded so there maybe N-threads that write and N-threads that read, or am I wrong ?

@pmconrad
Copy link
Contributor

pmconrad commented Oct 3, 2018

There is at most one thread executing production_loop at any time, and there should be at most one thread calling stop_production_loop.
@abitmore 's solution should also work.

@cogutvalera
Copy link
Member Author

Thank you @abitmore @pmconrad will do it (I mean @abitmore 's solution)

@pmconrad
Copy link
Contributor

pmconrad commented Oct 3, 2018

IMO cancel_and_wait is still needed to prevent crash during shutdown.

@abitmore
Copy link
Member

abitmore commented Oct 3, 2018

For my solution, better add a check for shutting_down in block_production_loop() as well.

(BTW actually I don't know how cancel_and_wait works, so @pmconrad's comment could be correct)

@cogutvalera cogutvalera force-pushed the issue_1314 branch 2 times, most recently from 9e361df to be44d37 Compare October 3, 2018 22:54
@abitmore
Copy link
Member

abitmore commented Oct 5, 2018

@cogutvalera the code looks fine to me, but the commit message "review changes" doesn't help in case when revisiting the code/changes in the future. How to Write a Git Commit Message: https://chris.beams.io/posts/git-commit/ . I'm learning this as well.

@cogutvalera
Copy link
Member Author

@abitmore Thanks ! I've changed comment.

Copy link
Contributor

@pmconrad pmconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks good and simple test worked ok. Thanks!

@cogutvalera
Copy link
Member Author

Thanks !

@abitmore abitmore merged commit 67f313d into bitshares:develop Oct 8, 2018
@cogutvalera
Copy link
Member Author

Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants