bug: stopped job's tasks in raw_exec driver get a signal to exit only after few minutes #2133

OferE · 2016-12-21T08:04:21Z

Nomad version

Nomad v0.5.1

Operating system and Environment details

Ubuntu 14.04.5 LTS

Issue

I'm using raw_exec and i wrapped my executable with a script that traps signal.
This works nice except in scale.
when working with 20 machines and 10 jobs not all tasks are getting the signal to stop fast enough.

Only after few minutes! the signal is sent to all the tasks. but few minutes is too much.
When the signal is sent the tasks end correctly.

Reproduction steps

I can't place here my entire cluster config. but it happens every time - few containers are still running. the logs prove they didn't get the signal.

Just scale a cluster to 20 machines and 10 jobs and u'll see it.

Workaround

I workaround this bug by putting in my wrapper script a check for

nomad status ${NOMAD_JOB_NAME}

and i kill the container if the job is not there.

Edit

This is a serious bug - i found out about it because i used docker swarm side by side with nomad and i saw the containers still running, I hope u will do this experiment and prioritize a fix to this critical bug.

pgporada · 2016-12-23T02:12:16Z

Just out of curiosity, can you post your wrapper script?

OferE · 2017-01-03T16:25:07Z

@pgporada sorry for the delay, look in the last lines for the workaround. I hope u'll find the script useful.

#!/bin/bash
# handler for the signals sent from nomad to stop the container
my_exit() 
{
   echo "killing $CID"
   docker stop --time=5 $CID # try to stop it gracefully
   docker rm -f $CID # remove the stopped container 
}

trap 'my_exit; exit' SIGHUP SIGTERM SIGINT

# for debugging
echo `env`
echo

# Building docker run command
CMD="docker run -d --name ${NOMAD_TASK_NAME}-${NOMAD_ALLOC_ID}"
for a in "$@"; do
   CMD="$CMD $a"
done

echo docker wrapper: the docker command that will run is: $CMD
echo from here on it is the container output:
echo 

# actually running the command
CID=`$CMD`

# docker logs is printed in the background
docker logs -f $CID &

# allows the process to listen to signals every 3 seconds
while : 
   do 
      sleep 5

      # next few lines are for monitoring the container and exiting if it is not running
      CSTATUS=`docker inspect --format='{{.State.Status}}' $CID`
      if ! [ -z "${CSTATUS}" ]; then
         if [ "${CSTATUS}" != "running" ] && [ "${CSTATUS}" != "paused" ]; then
            echo "Error - container is not in desired state - status is ${CSTATUS}. exiting"
            my_exit; exit;
         fi
      else
         echo "Error - container cannot be found exiting task... $CSTATUS"
         my_exit; exit;
      fi

      # workaround nomad bug
      nomad status ${NOMAD_JOB_NAME} &> /dev/null
      RET_VAL=$?
      if [ "${RET_VAL}" -ne 0 ]; then
         echo "going to exit since task job is not running any more"
         my_exit; exit;
      fi

   done

dadgar · 2017-01-03T17:42:54Z

@OferE Hey when you say 10 jobs can you clarify? Do you mean 10 task groups spread across 20 machines and what are you doing letting them finish to completion or doing a nomad stop?

Could you show any logs from a "missed signal"? Another test you can do to isolate between a Nomad issue and a setup issue is to remove starting a docker container and just running a sleep loop and seeing if the signal is received by your script. If it is Nomad is doing the right thing.

OferE · 2017-01-05T10:53:52Z

10 jobs each containing few tasks.
I use the above script and as u can see i have prints in it the beginning of my_exit.
The logs won't show the "killing" until after few minutes.

dadgar · 2017-01-05T18:08:27Z

@OferE Can you please give us client logs for where this occurs and time at which it happens. I think this may be related to #2119

OferE · 2017-01-05T18:39:46Z

Hi, I'll try to reproduce next week as i won't be at work this week and I already worked around this (look at the code above).
I need to deploy the cluster without my fix.

dadgar · 2017-01-05T18:52:12Z

@OferE Thank you!

dadgar · 2017-01-11T21:25:31Z

Closed by. #2177

OferE · 2017-01-13T05:15:59Z

thank u so much for solving this. highly appreciated!
Sorry for not being able to generate the client logs on time

github-actions · 2022-12-16T02:12:50Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

OferE changed the title ~~bug: raw_exec doesn't get signal to exit on scale~~ bug: raw_exec tasks get signals to exit after few minutes. Dec 21, 2016

OferE changed the title ~~bug: raw_exec tasks get signals to exit after few minutes.~~ bug: stopped job's tasks in raw_exec driver get a signal to exit only after few minutes Dec 21, 2016

bluen mentioned this issue Dec 28, 2016

Docker containers don't exit until after 5 minute timeout #2119

Closed

dadgar added the stage/waiting-reply label Jan 3, 2017

dadgar mentioned this issue Jan 11, 2017

GetAllocs uses a blocking query #2177

Merged

dadgar added type/bug theme/client and removed stage/waiting-reply labels Jan 11, 2017

dadgar closed this as completed Jan 11, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: stopped job's tasks in raw_exec driver get a signal to exit only after few minutes #2133

bug: stopped job's tasks in raw_exec driver get a signal to exit only after few minutes #2133

OferE commented Dec 21, 2016 •

edited

Loading

pgporada commented Dec 23, 2016

OferE commented Jan 3, 2017

dadgar commented Jan 3, 2017

OferE commented Jan 5, 2017 •

edited

Loading

dadgar commented Jan 5, 2017

OferE commented Jan 5, 2017 •

edited

Loading

dadgar commented Jan 5, 2017

dadgar commented Jan 11, 2017

OferE commented Jan 13, 2017

github-actions bot commented Dec 16, 2022

bug: stopped job's tasks in raw_exec driver get a signal to exit only after few minutes #2133

bug: stopped job's tasks in raw_exec driver get a signal to exit only after few minutes #2133

Comments

OferE commented Dec 21, 2016 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Workaround

Edit

pgporada commented Dec 23, 2016

OferE commented Jan 3, 2017

dadgar commented Jan 3, 2017

OferE commented Jan 5, 2017 • edited Loading

dadgar commented Jan 5, 2017

OferE commented Jan 5, 2017 • edited Loading

dadgar commented Jan 5, 2017

dadgar commented Jan 11, 2017

OferE commented Jan 13, 2017

github-actions bot commented Dec 16, 2022

OferE commented Dec 21, 2016 •

edited

Loading

OferE commented Jan 5, 2017 •

edited

Loading

OferE commented Jan 5, 2017 •

edited

Loading