Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad jobs not being killed on ubuntu #4218

Closed
jerrypeng opened this issue Apr 26, 2018 · 21 comments
Closed

Nomad jobs not being killed on ubuntu #4218

jerrypeng opened this issue Apr 26, 2018 · 21 comments

Comments

@jerrypeng
Copy link

jerrypeng commented Apr 26, 2018

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

0.7.0 and 0.8.1

Operating system and Environment details

ubuntu 16.04

Issue

I noticed a interesting problem with Nomad running in Ubuntu.

I have a nomad job running via the raw_exec driver. When I try to kill the job via "nomad stop", the job is not being killed but is stuck between "stop" state and "running" state

Here is some info on the job:

$ nomad status test4501e9e1c-aee9-4df4-ad88-4571e7d9e9bf-0

ID            = test4501e9e1c-aee9-4df4-ad88-4571e7d9e9bf-0
Name          = test4-0
Submit Date   = 2018-04-26T05:40:59Z
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = dead (stopped)
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
test4-0     0       0         1        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
12f136bf  672cdb7e  test4-0     0        stop     running  34s ago  6s ago


**$ nomad status 12f136bf**

ID                  = 12f136bf
Eval ID             = 20f04f99
Name                = test4501e9e1c-aee9-4df4-ad88-4571e7d9e9bf-0.test4-0[0]
Node ID             = 672cdb7e
Job ID              = test4501e9e1c-aee9-4df4-ad88-4571e7d9e9bf-0
Job Version         = 0
Client Status       = running
Client Description  = <none>
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created             = 1m46s ago
Modified            = 1m18s ago

Task "test4-0" is "running"
Task Resources
CPU       Memory   Disk     IOPS  Addresses
2000 MHz  563 MiB  300 MiB  0     shell_port: 10.128.0.2:20167
                                  ckptmgr: 10.128.0.2:29970
                                  tmaster_stats: 10.128.0.2:30981
                                  scheduler: 10.128.0.2:29523
                                  master: 10.128.0.2:26743
                                  metrics_cache_s: 10.128.0.2:29581
                                  metrics_mgr: 10.128.0.2:28918
                                  tmaster_ctl: 10.128.0.2:23518
                                  metrics_cache_m: 10.128.0.2:24154

Task Events:
Started At     = 2018-04-26T05:40:59Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2018-04-26T05:41:26Z  Killing     Sent interrupt. Waiting 5s before force killing
2018-04-26T05:40:59Z  Started     Task started by client
2018-04-26T05:40:59Z  Task Setup  Building Task Directory

As you can see the "Desired" state for the job is "stop" but the actual state is "running". The job is stuck in this condition. I have waited a while but the job never actually gets killed or gets transitioned into the "stop" state.

The above output is only but one of many jobs/containers that make a Heron job or topology

I only see this problem occurring on ubuntu and but not on centos. I have tested with both Nomad 0.7.0 and the latest 0.8.1. Both releases seems to have this problem on ubuntu.

This problem is impacting Heron Standalone cluster deployments (which use Nomad for scheduling) on ubuntu. Users has said that they had to manually kill all the processes since Nomad failed to kill them.

Reproduction steps

Submitted a Heron topology to a Heron standalone cluster using Nomad. Steps described here:

https://apache.github.io/incubator-heron/docs/operators/deployment/schedulers/standalone/

Nomad Server logs (if appropriate)

nomad_server_log.txt

Nomad Client logs (if appropriate)

nomad_clien_log.txt

Job file (if appropriate)

jobfile.txt

@jerrypeng
Copy link
Author

jerrypeng commented Apr 26, 2018

The nomad job described bascially runs the following script:


#!/bin/sh

set -e
set -x

CORE_RELEASE_FILE="heron-core.tar.gz"
CORE_RELEASE_DIR="heron-core"
TOPOLOGY_PACKAGE_FILE="topology.tar.gz"

# Create working directory if it does not exist
mkdir -p ${HERON_NOMAD_WORKING_DIR}

# Go to working directory
cd ${HERON_NOMAD_WORKING_DIR}

if [ "$HERON_USE_CORE_PACKAGE_URI" == "true" ]; then
  # download and extract heron core package
  curl ${HERON_CORE_PACKAGE_URI} -o ${CORE_RELEASE_FILE}
  tar zxf ${CORE_RELEASE_FILE} && rm -rf ${CORE_RELEASE_FILE}
else
  # link the heron core package directory
  ln -s ${HERON_CORE_PACKAGE_DIR} ${CORE_RELEASE_DIR}
fi

# download and extract heron topology package
${HERON_TOPOLOGY_DOWNLOAD_CMD}

# launch heron executor
trap 'kill -TERM $PID' TERM INT
${HERON_EXECUTOR_CMD} &
PID=$!
wait $PID
trap - TERM INT
wait $PID
EXIT_STATUS=$?

Since nomad doesn't yet support running processes and all subsequently start process in one process group, I attempt to trap the kill signal and kill the child process that is spawned by this script. I think that is probably where the problem occurs.

Its seems that the process running the script is killed after the nomad job is stopped but the processed spawned by ${HERON_TOPOLOGY_DOWNLOAD_CMD} remains and nomad job remains stuck between those two states.

After I manually kill the process spawned by ${HERON_TOPOLOGY_DOWNLOAD_CMD} , the status of the job finally went to "complete"

@jerrypeng
Copy link
Author

jerrypeng commented Apr 26, 2018

I also tried running this simple script on ubuntu:

set -x
set -e

trap 'kill -TERM $PID' TERM INT
tail -f /dev/null &
PID=$!
wait $PID
trap - TERM INT
wait $PID
EXIT_STATUS=$?

To see if it was any issue with the trapping code. The above script seems to be working correctly. When I kill the process running test.sh, the process running "tail -f /dev/null" is killed as expected

@chelseakomlo
Copy link
Contributor

Thanks for submitting this issue and detailed information- we'll take a further look into this.

@jerrypeng
Copy link
Author

@chelseakomlo thanks for looking into this.

FYI I just tested with the latest 0.8.3 release. The results are the same.

@dadgar
Copy link
Contributor

dadgar commented May 23, 2018

@jerrypeng Can you try with this binary. It is a linux amd64 build of master as of e108f73. nomad.zip

From that build I ran this job file:

job "example" {
  datacenters = ["dc1"]
  type        = "batch"

  group "cache" {
    count = 1

    task "redis" {
      driver = "raw_exec"

      config {
        command = "/bin/sh"
        args    = ["local/script.sh"]
      }

      template {
        data = <<EOH
set -x
set -e

trap 'kill -TERM $PID' TERM INT
tail -f /dev/null &
sleep 1000 &
PID=$!
wait $PID
trap - TERM INT
wait $PID
EXIT_STATUS=$?
EOH

        destination = "local/script.sh"
        perms       = "777"
      }
    }
  }
}

When I do a nomad stop both the sleep and tail process get killed. Since Nomad 0.8.2, all the process should be started in the same process group: #3572

You can see that they are all in the same process group here:
screen shot 2018-05-23 at 1 32 15 pm

@jerrypeng
Copy link
Author

@dadgar thank you! I will try it out.

@jerrypeng
Copy link
Author

jerrypeng commented May 23, 2018

@dadgar So I tried with the latest Nomad build you recommend me to try:

ubuntu@ip-172-31-15-70:~$ ~/.heron/bin/heron-nomad version
Nomad v0.8.4-dev

I am still getting the problem I mentioned above.

Using that version of Nomad, I submitted a Heron topology application and then I kill the application:

ubuntu@ip-172-31-15-70:~$ ~/.heron/bin/heron-nomad status 
ID                                           Type     Priority  Status          Submit Date
apiserver                                    service  50        running         2018-05-23T22:06:57Z
heron-tools                                  service  50        running         2018-05-23T22:06:59Z
test1a9043448-693c-4bee-9e85-1f2161dce424-0  service  50        dead (stopped)  2018-05-23T22:07:18Z
test1a9043448-693c-4bee-9e85-1f2161dce424-1  service  50        dead (stopped)  2018-05-23T22:07:18Z
test1a9043448-693c-4bee-9e85-1f2161dce424-2  service  50        dead (stopped)  2018-05-23T22:07:18Z
test1a9043448-693c-4bee-9e85-1f2161dce424-3  service  50        dead (stopped)  2018-05-23T22:07:18Z
test1a9043448-693c-4bee-9e85-1f2161dce424-4  service  50        dead (stopped)  2018-05-23T22:07:18Z

The nomad jobs that represent executors in heron displays as being killed. However upon further inspection of one of the jobs:

ubuntu@ip-172-31-15-70:~$ ~/.heron/bin/heron-nomad status test1a9043448-693c-4bee-9e85-1f2161dce424-1
ID            = test1a9043448-693c-4bee-9e85-1f2161dce424-1
Name          = test1-1
Submit Date   = 2018-05-23T22:07:18Z
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = dead (stopped)
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
test1-1     0       0         1        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
07fa11b5  1243af6d  test1-1     0        stop     running  10m22s ago  7m24s ago

As you can see the "Desired" state is "stop" but the "Status" remains "running". The jobs are stuck in this weird state. Some more details:

ubuntu@ip-172-31-15-70:~$ ~/.heron/bin/heron-nomad status 07fa11b5
ID                  = 07fa11b5
Eval ID             = 68753efb
Name                = test1a9043448-693c-4bee-9e85-1f2161dce424-1.test1-1[0]
Node ID             = 1243af6d
Job ID              = test1a9043448-693c-4bee-9e85-1f2161dce424-1
Job Version         = 0
Client Status       = running
Client Description  = <none>
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created             = 11m35s ago
Modified            = 8m37s ago

Task "test1-1" is "running"
Task Resources
CPU       Memory   Disk     IOPS  Addresses
2000 MHz  563 MiB  300 MiB  0     master: 172.31.15.70:30114
                                  ckptmgr: 172.31.15.70:26987
                                  metrics_cache_m: 172.31.15.70:31885
                                  scheduler: 172.31.15.70:28023
                                  tmaster_ctl: 172.31.15.70:23355
                                  tmaster_stats: 172.31.15.70:29283
                                  shell_port: 172.31.15.70:27147
                                  metrics_cache_s: 172.31.15.70:30459
                                  metrics_mgr: 172.31.15.70:21734

Task Events:
Started At     = 2018-05-23T22:07:18Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2018-05-23T22:10:15Z  Killing     Sent interrupt. Waiting 5s before force killing
2018-05-23T22:07:18Z  Started     Task started by client
2018-05-23T22:07:18Z  Task Setup  Building Task Directory
2018-05-23T22:07:18Z  Received    Task received by client

So the process also doesn't kill killed. The "force killing" mentioned in the "Description" never seems to happen.

I see that all the processes are in the same process group as the nomad executor:

ubuntu@ip-172-31-15-70:~$ ps f -g 11470
   PID TTY      STAT   TIME COMMAND
 11470 ?        Ssl    0:01 /home/ubuntu/.heron/bin/heron-nomad executor {"LogFile":"/tmp/slave/alloc/8cb4f96e-088e-233a-909c-4a9720f10bab/test1-4/executor.out","LogLevel":"DEBUG"}
 11705 ?        Sl     0:00 python2.7 ./heron-core/bin/heron-executor --topology-name=test1 --topology-id=test1a9043448-693c-4bee-9e85-1f2161dce424 --topology-defn-file=test1.defn --state-manager-connection=127.0.0.1:2181 --state-manage
 11965 ?        Sl     0:02  \_ /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx1024M -XX:+PrintCommandLineFlags -verbosegc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCCause -XX:+UseGCLogFileRotation -XX
 12000 ?        Rl     2:51  \_ ./heron-core/bin/heron-stmgr --topology_name=test1 --topology_id=test1a9043448-693c-4bee-9e85-1f2161dce424 --topologydefn_file=test1.defn --zkhostportlist=127.0.0.1:2181 --zkroot=/heron --stmgr_id=stmgr-4
 12023 ?        Sl    13:28  \_ /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx320M -Xms320M -Xmn160M -XX:MaxMetaspaceSize=128M -XX:MetaspaceSize=128M -XX:ReservedCodeCacheSize=64M -XX:+CMSScavengeBeforeRemark -XX:TargetSurvivorRatio=90
 12066 ?        S      0:00  \_ python2.7 ./heron-core/bin/heron-shell --port=27576 --log_file_prefix=log-files/heron-shell-4.log --secret=test1a9043448-693c-4bee-9e85-1f2161dce424

But the parent nomad executor never gets killed either. Any thoughts?

@jerrypeng
Copy link
Author

jerrypeng commented May 23, 2018

@dadgar so actually something weird is going on. The "python2.7 ./heron-core/bin/heron-executor" process is not a child of the nomad executor:

ubuntu@ip-172-31-15-70:~$ ps f -g 19656
   PID TTY      STAT   TIME COMMAND
 19656 ?        Ssl    0:01 /home/ubuntu/.heron/bin/heron-nomad executor {"LogFile":"/tmp/slave/alloc/08c9215c-8752-38ef-f075-5f05e44b07e4/test1-4/executor.out","LogLevel":"DEBUG"}
 19886 ?        Sl     0:00 python2.7 ./heron-core/bin/heron-executor --topology-name=test1 --topology-id=test14e905c1a-0b20-4f49-aee7-7e2a96a89b78 --topology-defn-file=test1.defn --state-manager-connection=127.0.0.1:2181 --state-manage
 20208 ?        Sl     0:02  \_ /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx1024M -XX:+PrintCommandLineFlags -verbosegc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCCause -XX:+UseGCLogFileRotation -XX
 20216 ?        Rl     3:48  \_ ./heron-core/bin/heron-stmgr --topology_name=test1 --topology_id=test14e905c1a-0b20-4f49-aee7-7e2a96a89b78 --topologydefn_file=test1.defn --zkhostportlist=127.0.0.1:2181 --zkroot=/heron --stmgr_id=stmgr-4
 20234 ?        Sl    17:56  \_ /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx320M -Xms320M -Xmn160M -XX:MaxMetaspaceSize=128M -XX:MetaspaceSize=128M -XX:ReservedCodeCacheSize=64M -XX:+CMSScavengeBeforeRemark -XX:TargetSurvivorRatio=90
 20251 ?        S      0:00  \_ python2.7 ./heron-core/bin/heron-shell --port=31501 --log_file_prefix=log-files/heron-shell-4.log --secret=test14e905c1a-0b20-4f49-aee7-7e2a96a89b78
ubuntu@ip-172-31-15-70:~$ pstree 19886
python2.7─┬─heron-stmgr───2*[{heron-stmgr}]
          ├─java───45*[{java}]
          ├─java───27*[{java}]
          ├─python2.7
          └─7*[{python2.7}]
ubuntu@ip-172-31-15-70:~$ pstree 19886 -s
init───python2.7─┬─heron-stmgr───2*[{heron-stmgr}]
                 ├─java───45*[{java}]
                 ├─java───27*[{java}]
                 ├─python2.7
                 └─7*[{python2.7}]

but how come the example you ran:

ubuntu@ip-172-31-15-70:~$ ps f -g 21230
   PID TTY      STAT   TIME COMMAND
 21230 ?        Ssl    0:00 /home/ubuntu/.heron/bin/heron-nomad executor {"LogFile":"/tmp/slave/alloc/2ed1f5df-34aa-2868-98e7-b6146173d558/redis/executor.out","LogLevel":"DEBUG"}
 21247 ?        S      0:00  \_ /bin/sh local/script.sh
 21248 ?        S      0:00      \_ tail -f /dev/null
 21249 ?        S      0:00    

ubuntu@ip-172-31-15-70:~$ pstree 21247 -s
init───heron-nomad───heron-nomad───sh─┬─sleep
                                      └─tail

The shell script is a child process of the nomad executor. Why is that the case for example script but not the heron-nomad.sh script which are pretty much identical?

@jerrypeng
Copy link
Author

Further investigation I noticed the following. When the nomad jobs of a Heron topology is running, the process of /bin/sh run_heron_executor.sh is the parent of heron-executor and all subprocesses which is correct and expected:

ubuntu@ip-172-31-15-70:~$ ps f -g 22869
   PID TTY      STAT   TIME COMMAND
 22869 ?        Ssl    0:00 /home/ubuntu/.heron/bin/heron-nomad executor {"LogFile":"/tmp/slave/alloc/1fb2582c-33da-58ea-79e7-a0484e093a9a/test1-4/executor.out","LogLevel":"DEBUG"}
 22994 ?        S      0:00  \_ /bin/sh run_heron_executor.sh
 23105 ?        Sl     0:00      \_ python2.7 ./heron-core/bin/heron-executor --topology-name=test1 --topology-id=test1bf503684-5bde-417e-9dee-17940ceb5a40 --topology-defn-file=test1.defn --state-manager-connection=127.0.0.1:2181 --stat
 23480 ?        Sl     0:01          \_ /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx1024M -XX:+PrintCommandLineFlags -verbosegc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCCause -XX:+UseGCLogFileRota
 23500 ?        Sl     0:04          \_ ./heron-core/bin/heron-stmgr --topology_name=test1 --topology_id=test1bf503684-5bde-417e-9dee-17940ceb5a40 --topologydefn_file=test1.defn --zkhostportlist=127.0.0.1:2181 --zkroot=/heron --stmgr_id
 23507 ?        Sl     0:27          \_ /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx320M -Xms320M -Xmn160M -XX:MaxMetaspaceSize=128M -XX:MetaspaceSize=128M -XX:ReservedCodeCacheSize=64M -XX:+CMSScavengeBeforeRemark -XX:TargetSurvivor
 23520 ?        S      0:00          \_ python2.7 ./heron-core/bin/heron-shell --port=23752 --log_file_prefix=log-files/heron-shell-4.log --secret=test1bf503684-5bde-417e-9dee-17940ceb5a40

however after I kill the topology and the corresponding nomad jobs get stopped:

22869 ?        Ssl    0:01 /home/ubuntu/.heron/bin/heron-nomad executor {"LogFile":"/tmp/slave/alloc/1fb2582c-33da-58ea-79e7-a0484e093a9a/test1-4/executor.out","LogLevel":"DEBUG"}
 23105 ?        Sl     0:00 python2.7 ./heron-core/bin/heron-executor --topology-name=test1 --topology-id=test1bf503684-5bde-417e-9dee-17940ceb5a40 --topology-defn-file=test1.defn --state-manager-connection=127.0.0.1:2181 --state-manage
 23480 ?        Sl     0:01  \_ /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx1024M -XX:+PrintCommandLineFlags -verbosegc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCCause -XX:+UseGCLogFileRotation -XX
 23500 ?        Rl     1:16  \_ ./heron-core/bin/heron-stmgr --topology_name=test1 --topology_id=test1bf503684-5bde-417e-9dee-17940ceb5a40 --topologydefn_file=test1.defn --zkhostportlist=127.0.0.1:2181 --zkroot=/heron --stmgr_id=stmgr-4
 23507 ?        Sl     5:53  \_ /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx320M -Xms320M -Xmn160M -XX:MaxMetaspaceSize=128M -XX:MetaspaceSize=128M -XX:ReservedCodeCacheSize=64M -XX:+CMSScavengeBeforeRemark -XX:TargetSurvivorRatio=90
 23520 ?        S      0:00  \_ python2.7 ./heron-core/bin/heron-shell --port=23752 --log_file_prefix=log-files/heron-shell-4.log --secret=test1bf503684-5bde-417e-9dee-17940ceb5a40

The "/bin/sh run_heron_executor.sh" process disappears or gets killed but the subprocess don't. Very weird.

@dadgar
Copy link
Contributor

dadgar commented May 24, 2018

@jerrypeng I am having quite a hard time following the heron standalone guide. I am running this:

$ heron submit standalone \
  ~/.heron/examples/heron-streamlet-examples.jar \
  com.twitter.heron.examples.streamlet.WindowedWordCountTopology \
  WindowedWordCount

And I see the task is failing:

nomad job status WindowedWordCountf29268cd-32e9-4b2f-846a-23eb3fc64391-0
ID            = WindowedWordCountf29268cd-32e9-4b2f-846a-23eb3fc64391-0
Name          = WindowedWordCount-0
Submit Date   = 2018-05-24T00:53:23Z
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group           Queued  Starting  Running  Failed  Complete  Lost
WindowedWordCount-0  0       0         0        2       0         0

Future Rescheduling Attempts
Task Group           Eval ID   Eval Time
WindowedWordCount-0  2d426fb6  33s from now

Allocations
ID        Node ID   Task Group           Version  Desired  Status  Created   Modified
524860c4  b04bdc67  WindowedWordCount-0  0        run      failed  1m1s ago  27s ago
074c7bd0  b04bdc67  WindowedWordCount-0  0        run      failed  2m7s ago  1m31s ago
vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad logs -stderr 074c7bd0-318e-5b5f-157c-3e9e18205466
+ CORE_RELEASE_FILE=heron-core.tar.gz
+ CORE_RELEASE_DIR=heron-core
+ TOPOLOGY_PACKAGE_FILE=topology.tar.gz
+ mkdir -p /home/vagrant/.herondata/topologies/standalone/vagrant/WindowedWordCountf29268cd-32e9-4b2f-846a-23eb3fc64391/container-0
+ cd /home/vagrant/.herondata/topologies/standalone/vagrant/WindowedWordCountf29268cd-32e9-4b2f-846a-23eb3fc64391/container-0
+ [ false == true ]
run_heron_executor.sh: 16: [: false: unexpected operator
+ ln -s /home/vagrant/.heron/dist/heron-core heron-core
+ ./heron-core/bin/heron-downloader http://null:9000/api/v1/file/download/3f55d20b-f118-4e9d-a0d4-49d7fb855e3a-topology.tar.gz .
Exception in thread "main" java.net.UnknownHostException: null
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at java.net.Socket.connect(Socket.java:538)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
        at sun.net.www.http.HttpClient.New(HttpClient.java:339)
        at sun.net.www.http.HttpClient.New(HttpClient.java:357)
        at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1220)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:984)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1564)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
        at java.net.URL.openStream(URL.java:1045)
        at com.twitter.heron.downloader.HttpDownloader.download(HttpDownloader.java:25)
        at com.twitter.heron.downloader.DownloadRunner.main(DownloadRunner.java:42)
+ CORE_RELEASE_FILE=heron-core.tar.gz
+ CORE_RELEASE_DIR=heron-core
+ TOPOLOGY_PACKAGE_FILE=topology.tar.gz
+ mkdir -p /home/vagrant/.herondata/topologies/standalone/vagrant/WindowedWordCountf29268cd-32e9-4b2f-846a-23eb3fc64391/container-0
+ cd /home/vagrant/.herondata/topologies/standalone/vagrant/WindowedWordCountf29268cd-32e9-4b2f-846a-23eb3fc64391/container-0
+ [ false == true ]
run_heron_executor.sh: 16: [: false: unexpected operator
+ ln -s /home/vagrant/.heron/dist/heron-core heron-core
ln: failed to create symbolic link 'heron-core/heron-core': File exists
+ CORE_RELEASE_FILE=heron-core.tar.gz
+ CORE_RELEASE_DIR=heron-core
+ TOPOLOGY_PACKAGE_FILE=topology.tar.gz
+ mkdir -p /home/vagrant/.herondata/topologies/standalone/vagrant/WindowedWordCountf29268cd-32e9-4b2f-846a-23eb3fc64391/container-0
+ cd /home/vagrant/.herondata/topologies/standalone/vagrant/WindowedWordCountf29268cd-32e9-4b2f-846a-23eb3fc64391/container-0
+ [ false == true ]
run_heron_executor.sh: 16: [: false: unexpected operator
+ ln -s /home/vagrant/.heron/dist/heron-core heron-core
ln: failed to create symbolic link 'heron-core/heron-core': File exists

Job Inspect: https://gist.github.com/dadgar/2316b24f0c985c7b46fc510f02fdf01e

Do you have a VM or simpler reproduction steps?

@dadgar
Copy link
Contributor

dadgar commented May 24, 2018

@jerrypeng If you repeat the steps from #4218 (comment) can you show what the process group is for each of those? I wonder if the python executor is changing its process group

@jerrypeng
Copy link
Author

@dadgar are you trying to submit to an existing Nomad cluster or did you also start the cluster via "heron-admin standalone cluster"?

@dadgar
Copy link
Contributor

dadgar commented May 24, 2018

@jerrypeng So I ran: heron-admin standalone cluster start

I followed the guide here exactly: https://apache.github.io/incubator-heron/docs/operators/deployment/schedulers/standalone/

The only difference is I replaced heron-nomad with 0.8.4-dev binary. I am also running zookeeper in the started Nomad cluster before starting that topology via:

job "zookeeper" {
  datacenters = ["dc1"]

  group "zookeeper" {
    task "zookeeper" {
      driver = "docker"

      config {
        image = "zookeeper:3.3"
	network_mode = "host"
      }

      resources {
        cpu    = 500
        memory = 256
        network {
          mbits = 10
          port "client" {
	    static = 2181
	  }
        }
      }
    }
  }
}

Did you see that the ENV for the downloader has null in it?

@jerrypeng
Copy link
Author

jerrypeng commented May 24, 2018

@dadgar ya I saw the null in the URL. That usually means that the hostname could not be determined. Are you running this on a VM via vagrant? I will try to reproduce.

Ya, you are right about the heron-executor changing the process group:

ubuntu@ip-172-31-15-70:~$ ps f -g 27357
   PID TTY      STAT   TIME COMMAND
 27357 ?        Ssl    0:02 /home/ubuntu/.heron/bin/heron-nomad executor {"LogFile":"/tmp/slave/alloc/b79d8119-84e8-4235-36ca-f510c2a30379/test1-3/executor.out","LogLevel":"DEBUG"}
 27455 ?        S      0:00  \_ /bin/sh run_heron_executor.sh
 27623 ?        Sl     0:00      \_ python2.7 ./heron-core/bin/heron-executor --topology-name=test1 --topology-id=test1903256a5-4e27-4cd3-8118-f8eb8496b2f3 --topology-defn-file=test1.defn --state-manager-connection=127.0.0.1:2181 --stat
 27959 ?        Sl    15:46          \_ /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx320M -Xms320M -Xmn160M -XX:MaxMetaspaceSize=128M -XX:MetaspaceSize=128M -XX:ReservedCodeCacheSize=64M -XX:+CMSScavengeBeforeRemark -XX:TargetSurvivor
 27992 ?        S      0:00          \_ python2.7 ./heron-core/bin/heron-shell --port=25354 --log_file_prefix=log-files/heron-shell-3.log --secret=test1903256a5-4e27-4cd3-8118-f8eb8496b2f3
 28012 ?        Rl     3:24          \_ ./heron-core/bin/heron-stmgr --topology_name=test1 --topology_id=test1903256a5-4e27-4cd3-8118-f8eb8496b2f3 --topologydefn_file=test1.defn --zkhostportlist=127.0.0.1:2181 --zkroot=/heron --stmgr_id
 28036 ?        Sl     0:02          \_ /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx1024M -XX:+PrintCommandLineFlags -verbosegc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCCause -XX:+UseGCLogFileRota

ubuntu@ip-172-31-15-70:~$ ps xao pid,ppid,pgid | grep 27357
 27357  26975  27357
 27455  27357  27455
ubuntu@ip-172-31-15-70:~$ ps xao pid,ppid,pgid | grep 27455
 27455  27357  27455
 27623  27455  27623
ubuntu@ip-172-31-15-70:~$ ps xao pid,ppid,pgid | grep 27623
 27623  27455  27623
 27959  27623  27623
 27992  27623  27623
 28012  27623  27623
 28036  27623  27623

I just realized in the code of heron_executor.py. There is code to create a new process group:

https://github.com/apache/incubator-heron/blob/master/heron/executor/src/python/heron_executor.py#L1082

@jerrypeng
Copy link
Author

jerrypeng commented May 24, 2018

However at the same time, though I am not a bash expert, shouldn't the trap clause in the script kill the heron-executor process when the run_heron_executor.sh gets killed by Nomad when the job is stopped?

@jerrypeng
Copy link
Author

@dadgar so in Kubernetes, it supports a prepare phase that executes some commands prior to the main service you want to invoke. Is there something similar in Nomad? Basically a task that gets run before your "main" task. If there is than, all the downloading of jars and creation of directories that is done run_heron_executor.sh can be moved to that phase and the "main" task can directly invoke the heron-executor command side stepping this problem.

@jerrypeng
Copy link
Author

jerrypeng commented May 24, 2018

So I can also reproduce the problem with the following job spec:

job "example" {
  datacenters = ["dc1"]
  type        = "batch"

  group "cache" {
    count = 1

    task "redis" {
      driver = "raw_exec"

      config {
        command = "/bin/sh"
        args    = ["local/script.sh"]
      }

      template {
        data = <<EOH
set -x
set -e

trap 'kill -TERM $PID' TERM INT
/tmp/test.py &
PID=$!
wait $PID
trap - TERM INT
wait $PID
EXIT_STATUS=$?
EOH

        destination = "local/script.sh"
        perms       = "777"
      }
    }
  }
}

and python file test.py:

#!/usr/bin/env python
import time
import subprocess
import os

os.setpgrp()
process = subprocess.Popen(["sleep", "100000"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                                 env=None, bufsize=1)
process = subprocess.Popen(["tail", "-f", "/dev/null"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, env=None, bufsize=1)

while True:
  time.sleep(1)

Though not sure if you would consider this an application issue or issue with nomad

The processes test.py and child processes will never be killed by Nomad. And the nomad job will always be in limbo state:

ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
68e9f44f  6772973c  cache       2        stop     running   1m13s ago  1m3s ago

@dadgar
Copy link
Contributor

dadgar commented Jun 1, 2018

@jerrypeng Okay thanks for getting me the reproducers. So I have good and bad news. The good news is:

Bad news is:

So the end outcome is that in the most common scenario where Nomad is being run as root this is entirely solved. When not on root and not on Linux, there actually isn't anything Nomad can do because the task is essentially creating a daemon and not forwarding signals properly so it gets leaked. I added a note to that in the docs.

Essentially if you are going to daemonize, you need to forward signals properly such that the grandchild process gets killed.

@dadgar dadgar closed this as completed Jun 1, 2018
@jerrypeng
Copy link
Author

@dadgar thanks a ton for taking the time to investigate this!

@dadgar
Copy link
Contributor

dadgar commented Jun 4, 2018

@jerrypeng Of course! Give the RC a shot and let me know if it doesn't fix the issue when you run as root on linux: https://releases.hashicorp.com/nomad/0.8.4-rc1/

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants