Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

change SIGKILL to SIGTERM in local mode cancel trial job #3173

Merged
merged 7 commits into from
Dec 15, 2020

Conversation

J-shang
Copy link
Contributor

@J-shang J-shang commented Dec 10, 2020

No description provided.

if (isAlive(pid)) {
tkill(pid, 'SIGKILL');
}
}, 5 * 1000, pid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this solution, we have to wait for 5 seconds even the trail is stopped within 1 second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, setTimeout() will not block the function, just put callback into the queue after delay.

tkill(trialJob.pid, 'SIGTERM');
const pid = trialJob.pid;
setTimeout((pid: number) => {
if (isAlive(pid)) {
Copy link
Contributor

@liuzhe-lz liuzhe-lz Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if check is not atomic.
Please call tkill and catch exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it.

setTimeout(((pid: number): void => {
tkill(pid, 'SIGKILL', (err) => {
if (err){
this.log.warning(`cancel trial job {pid: ${pid}} failed: ${err.message}`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suddenly thought of a problem.
If the pid get reused between SIGTERM and SIGKILL, this will kill a random process.
If you don't want to elegantly solve the problem in this release, I suggest to reduce the delay to minimize risk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and other place use kill pid also has this risk, maybe elegant way is let them kill themselves, need discuss. This release reduce the delay to 1 second.

@maxxx580
Copy link

is this change targeted. for v2.0 ?

@J-shang
Copy link
Contributor Author

J-shang commented Dec 14, 2020

is this change targeted. for v2.0 ?

yes

let waitingTime = 0;
while(await isAlive(trialJob.pid)) {
if (waitingTime > 4999) {
tkill(trialJob.pid, 'SIGKILL');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about add break here, in case pid is not terminated even with sigkill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIGKILL can not kill process is almost impossible, if it is happened, I think there may have a serious system error. I will add catch error and break in this case.

@QuanluZhang QuanluZhang merged commit e9f832d into microsoft:master Dec 15, 2020
@J-shang J-shang deleted the killsignal branch January 7, 2021 05:51
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants