Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task fails with "failed: Wait returned exit code 3762504530," #955

Closed
dotcomputercraft opened this issue Mar 21, 2016 · 34 comments
Closed

Comments

@dotcomputercraft
Copy link

Nomad version - v0.3.1

Operating system and Environment details - Windows

Issue - I am trying to run an exe using Nomad Raw Fork/Exec Driver, Artifact Source is a zip file which contains executable and its dependencies (.i.e. windows DLLs) . (Job Config below). When i submit the job, i see following error on client

** 2016/03/21 21:56:55 [INFO] client: Restarting task "microservice" for alloc "5a62cc7f-1837-174d-5a9f-fe6999b03f37" in 16.622121009s
2016/03/21 21:56:55 [DEBUG] plugin: C:\nomad\nomad.exe: plugin process exited
2016/03/21 21:56:55 [DEBUG] client: updated allocations at index 11029 (pulled 0) (filtered 9)
2016/03/21 21:56:55 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 9)
2016/03/21 21:57:11 [DEBUG] plugin: starting plugin: C:\nomad\nomad.exe []string{"C:\nomad\nomad.exe", "executor", "C:\nomad\data\alloc\5a62cc7f-1837-174d-5a9f-fe6999b03f37\m
icroservice\microservice-executor.out"}
2016/03/21 21:57:44 [DEBUG] plugin: waiting for RPC address for: C:\nomad\nomad.exe
2016/03/21 21:57:44 [DEBUG] plugin: nomad.exe: 2016/03/21 21:57:44 [DEBUG] plugin: plugin address: tcp 127.0.0.1:14000
2016/03/21 21:57:44 [DEBUG] driver.raw_exec: started process with pid: 3756
2016/03/21 21:57:44 [INFO] client: task "microservice" for alloc "5a62cc7f-1837-174d-5a9f-fe6999b03f37" failed: Wait returned exit code 3762504530, signal 0, and error
**

Job Config

job "windows-service" {
    region = "global"
    datacenters = ["dc1"]
    type = "service"
    constraint {
        attribute = "${attr.kernel.name}"
        value = "windows"
    }
    update {
        stagger = "30s"
        max_parallel = 1
    }
    group "microservices" {
        count = 1
        task "microservice" {
            driver = "raw_exec"
            config {
                artifact_source = "https://<ARTIFACT_LOCATION>/build.zip"
                command = "OwinSample.Host.exe"
            }
            env {
                DB_HOST = "db01.example.com"
                DB_USER = "web"
                DB_PASSWORD = "loremipsum"
            }
            resources {
                cpu = 500
                memory = 128
            }
        }
    }
}

@mohitarora - opened issue https://github.com/hashicorp/nomad/issues/923 that fixed one of the problems we had.

When I execute the OwnSample.Host.exe by hand, the executable starts fine. .i.e.

C:\\nomad\\data\\alloc\\5a62cc7f-1837-174d-5a9f-fe6999b03f37\\microservice\\OwnSample.Host.exe

Do I have to specify a working directory? Or is there a goLang issue which is preventing the process to start properly on windows? The error logs do show an error that indicates that the process does not understand the working directory, but I'm not 100% certain.

We would like to use nomad in our environment and this issue is blocking me from moving forward. I look forward in working with you to solve this problem. Talk to you soon.

Best Regards,

John Montoya

@diptanu
Copy link
Contributor

diptanu commented Mar 21, 2016

@dotcomputercraft Glad to hear that you are planning to use Nomad! So we would need some more information to help you with this.

Can you please use nomad fs cat <alloc id> /alloc/logs/<stdout/stderr log file name> to share with the actual error that your executable is throwing? This would help us understand if we need to set up the command in a certain way or if there are some workarounds for this.

UPDATE: Also you can use nomad fs ls <alloc id> /alloc/logs/ to know the log file names.

@dotcomputercraft
Copy link
Author

@diptanu - Here is some of the contents of the microservice.stderr.0 file:

I see the same error in the file :

Unhandled Exception: System.IO.FileLoadException: Could not load file or assembly 'Microsoft.Owin.Hosting, Version=2.0.2.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies. Provider DLL failed to initialize correctly. (Exception from HRESULT: 0x8009001D)
   at Program.main(String[] argv)

This type off error should not happen because this DLL is in the same location as the EXE file. Also, this error indicates that the EXE process does not have knowledge of the working directory. In fact, when I open a shell and execute the EXE by hand, the process starts up fine without any errors on the windows machine. .i.e.
C:\\nomad\\data\\alloc\\5a62cc7f-1837-174d-5a9f-fe6999b03f37\\microservice\\OwnSample.Host.exe

Is there something I need to include in the command parameter to help nomad to start the OwnSample.Host.exe process without any errors?

The OwnSample.Host.exe is a long running process that runs forever. Is this type of workload supported by nomad?

Talk to you soon.

Best Regards,

John Montoya

@dotcomputercraft
Copy link
Author

@diptanu - I downloaded the nomad code and found the code where the error is being reported.

task_runner.go - line 298
                // Log whether the task was successful or not.
                r.restartTracker.SetWaitResult(waitRes)
                r.setState(structs.TaskStateDead, r.waitErrorToEvent(waitRes))
                if !waitRes.Successful() {
                    r.logger.Printf("[INFO] client: task %q for alloc %q failed: %v", r.task.Name, r.alloc.ID, waitRes)
                } else {
                    r.logger.Printf("[INFO] client: task %q for alloc %q completed successfully", r.task.Name, r.alloc.ID)
                }

I wanted to add more logging to this area of code to figure out what is going on with nomad.exe. Thoughts?

@dotcomputercraft
Copy link
Author

@diptanu - I'm using the vagrantFile setup to do local development. I really want to figure out this problem that we are having in our windows host machine with nomad.exe. so I downloaded source and did a make bin but I get the following error:

1 errors occurred:
--> darwin/amd64 error: exit status 2
Stderr: # net
could not determine kind of name for C.AI_MASK
# github.com/hashicorp/nomad/vendor/github.com/shirou/gopsutil/cpu
vendor/github.com/shirou/gopsutil/cpu/cpu_darwin_cgo.go:10:28: fatal error: mach/mach_init.h: No such file or directory
 #include <mach/mach_init.h>

I added instrumentation to the code to find the which working directory is being used when nomad.exe tries to invoke the OwnSample.Host.exe process. I'm thinking that is using the wrong working directory but I can't see the logs to validate my idea. Talk to you soon.

I love your help to either help me build nomad with success or help me identify and put in a fix for the problem I reported. Thank you in advance and talk to you soon.

Best Regards,

John Montoya

@diptanu
Copy link
Contributor

diptanu commented Mar 22, 2016

@dotcomputercraft Sorry for the late response on this! Yeah so on the Vagrant box the build for Darwin doesn't work.

Are you trying to build Nomad for windows on Linux? If so then remove all the platforms and just keep windows here - https://github.com/hashicorp/nomad/blob/master/scripts/build.sh#L20

Let me know how I can help you. Also, you can jump on the irc #nomad-tool on freenode.

@diptanu
Copy link
Contributor

diptanu commented Mar 22, 2016

@dotcomputercraft Also the working directory Nomad sets is the task dir and not the alloc dir. Regarding your earlier question about long-running processes - Nomad was designed to run both long running services and batch processes, so your use case definitely fits into what Nomad was designed for.

@dotcomputercraft
Copy link
Author

@diptanu - No problem. Yes, I would like to build nomad for windows platform. Let me make the changes. Thank you for the response about long running process on nomad. I think there is a problem but without logs is hard to say for sure. In client/driver/executor/executor.go line 181 -191 where the path is being created may be wrong for my use case. Let me build new code now. Talk to you soon.

Best Regards,

John Montoya

@diptanu
Copy link
Contributor

diptanu commented Mar 22, 2016

@dotcomputercraft This is where the Directory is being set - https://github.com/hashicorp/nomad/blob/master/client/driver/executor/executor.go#L338

Let me know what you find out!

@dotcomputercraft
Copy link
Author

@diptanu - I'm making another windows build. I saw this error not sure what it means...

2016/03/22 22:00:34 [ERR] client: allocation "81d8c637-41d5-741e-eb19-bde840dd2d0a", task microservice, artifact &{<ARTIFACT_LOCATION>/build.zip map[] } (0) fails validation

<ARTIFACT_LOCATION> is a real http get url

I'm only building the windows nomad.exe executable (using the vagrantFile vm) using using the existing server version 0.31 nomad for ubuntu

Job Config

job "windows-service" {
    region = "global"
    datacenters = ["dc1"]
    type = "service"
    constraint {
        attribute = "${attr.kernel.name}"
        value = "windows"
    }
    update {
        stagger = "30s"
        max_parallel = 1
    }
    group "microservices" {
        count = 1
        task "microservice" {
            driver = "raw_exec"
            config {
                artifact_source = "https://<ARTIFACT_LOCATION>/build.zip"
                command = "OwinSample.Host.exe"
            }
            env {
                DB_HOST = "db01.example.com"
                DB_USER = "web"
                DB_PASSWORD = "loremipsum"
            }
            resources {
                cpu = 500
                memory = 128
            }
        }
    }
}

Thoughs?

@dotcomputercraft
Copy link
Author

@diptanu - I confirmed that nomad on windows is using the correct task directory. I'm very confuse why the nomad process fails to start a .NET application. The .NET application has everything it needs to start properly... but nomad continues to exit the process.

    2016/03/22 23:01:16 [DEBUG] plugin: starting plugin: C:\nomad\nomad.exe []string{"C:\\nomad\\nomad.exe", "executor", "C:\\nomad\\data\\alloc\\d737a7b8-f7a5-fa4e-f552-88fb173033d6\\microservice\\microservice-executor.out"}
    2016/03/22 23:01:16 [DEBUG] plugin: waiting for RPC address for: C:\nomad\nomad.exe
    2016/03/22 23:01:16 [DEBUG] plugin: nomad.exe: 2016/03/22 23:01:16 [DEBUG] plugin: plugin address: tcp 127.0.0.1:14000
    2016/03/22 23:01:16 [DEBUG] driver.raw_exec: started process with pid: 3812
    2016/03/22 23:01:16 [INFO] client: task "microservice" for alloc "d737a7b8-f7a5-fa4e-f552-88fb173033d6" failed: Wait returned exit code 3762504530, signal 0, and error <nil>
    2016/03/22 23:01:16 [INFO] client: Restarting task "microservice" for alloc "d737a7b8-f7a5-fa4e-f552-88fb173033d6" in 15.018887389s
    2016/03/22 23:01:16 [DEBUG] plugin: C:\nomad\nomad.exe: plugin process exited
    2016/03/22 23:01:16 [DEBUG] client: updated allocations at index 12938 (pulled 0) (filtered 14)
    2016/03/22 23:01:16 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2016/03/22 23:01:27 [DEBUG] client: updated allocations at index 12941 (pulled 0) (filtered 14)
    2016/03/22 23:01:27 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)
    2016/03/22 23:01:31 [DEBUG] plugin: starting plugin: C:\nomad\nomad.exe []string{"C:\\nomad\\nomad.exe", "executor", "C:\\nomad\\data\\alloc\\d737a7b8-f7a5-fa4e-f552-88fb173033d6\\microservice\\microservice-executor.out"}
    2016/03/22 23:01:31 [DEBUG] plugin: waiting for RPC address for: C:\nomad\nomad.exe
    2016/03/22 23:01:31 [DEBUG] plugin: nomad.exe: 2016/03/22 23:01:31 [DEBUG] plugin: plugin address: tcp 127.0.0.1:14000
    2016/03/22 23:01:31 [DEBUG] driver.raw_exec: started process with pid: 7320
    2016/03/22 23:01:31 [INFO] client: task "microservice" for alloc "d737a7b8-f7a5-fa4e-f552-88fb173033d6" failed: Wait returned exit code 3762504530, signal 0, and error <nil>
    2016/03/22 23:01:31 [INFO] client: Restarting task "microservice" for alloc "d737a7b8-f7a5-fa4e-f552-88fb173033d6" in 18.149770124s
    2016/03/22 23:01:31 [DEBUG] plugin: C:\nomad\nomad.exe: plugin process exited
    2016/03/22 23:01:32 [DEBUG] client: updated allocations at index 12942 (pulled 0) (filtered 14)
    2016/03/22 23:01:32 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 14)

Yes, nomad reports an error but this error should not be occurring because I'm able to run the task by hand.

@dadgar
Copy link
Contributor

dadgar commented Mar 22, 2016

@dotcomputercraft can you just set the CD environment variable.

task {
    env {
        "CD": "${NOMAD_TASK_DIR}"
    }
}

I also pushed a change that should show the error for the artifact validation.

@dadgar
Copy link
Contributor

dadgar commented Mar 22, 2016

Could you also post the job file? I know you have it in your original post but that syntax up to date so I am assuming its been updated

@dotcomputercraft
Copy link
Author

hi @dadgar - I just updated the job file

job "windows-service" {
    region = "global"

    datacenters = ["dc1"]

    type = "service"

    constraint {
        attribute = "${attr.kernel.name}"
        value = "windows"
    }

    # Rolling updates should be sequential
    update {
        stagger = "30s"
        max_parallel = 1
    }

    group "microservices" {
        count = 1

        # Create a web front end using a docker image
        task "microservice" {
            driver = "raw_exec"

            artifact {
                source = "https://<ARTIFACT_LOCATION>/build.zip"
            }
            config {
                command = "OwinSample.Host.exe"
            }

            env {
             "CD" = "${NOMAD_TASK_DIR}"
            }
        resources {
                cpu = 500
                memory = 128
            }
        }
    }
}

but the process excited again. Thoughts?

@dadgar
Copy link
Contributor

dadgar commented Mar 22, 2016

Were the task's logs any different?

@dadgar
Copy link
Contributor

dadgar commented Mar 22, 2016

Is it possible to produce a non-sensitive binary that exhibits this behavior so you can just give us the files so we can test and resolve?

@dotcomputercraft
Copy link
Author

let me post the build.zip in my github repo. give me 5 minutes

@dotcomputercraft
Copy link
Author

@dadgar - here is the build.zip archive that contains the .NET application.

https://github.com/dotcomputercraft/OwinSample.Host/tree/master/deploy/build.zip

this is the original code for the executable
https://github.com/dotcomputercraft/OwinSample.Host

I work for jet.com and I really want for nomad to work because we would love to use it for some of our use cases.
Talk to you soon.

@dadgar
Copy link
Contributor

dadgar commented Mar 22, 2016

@dotcomputercraft Cool! We will try to debug this and will get back to you once we have an update. Once this is resolved, would love to chat with you about jet.com's usage of Nomad and how we can help!

@dotcomputercraft
Copy link
Author

@dadgar - Awesome... I would love to catch up and thank you for helping with this issue.

@dotcomputercraft
Copy link
Author

@dadgar - Do you know if this problem can be fix? Can you shoot me an email at john.montoya@jet.com when you find out any updates? Thank you again for your help in this issue.

@diptanu
Copy link
Contributor

diptanu commented Mar 23, 2016

@dotcomputercraft We are going to debug the issue, and get back to you! This issue can definitely be fixed.

@dotcomputercraft
Copy link
Author

@diptanu , @dadgar - Thank you guys. Talk to you soon. Best regards, John

@dadgar
Copy link
Contributor

dadgar commented Mar 23, 2016

You can unblock yourself by adding:

task {
    ...
            env {
                SYSTEMROOT="C:\\Windows"
            }
}

This fixed the task you linked. Will have to work more to fix this in Nomad proper.

@dotcomputercraft
Copy link
Author

@dadgar - thank you for the amazing support... Yes, adding this key value in the env solved the problem. here is my updated job

Job Config

job "windows-service" {
    region = "global"
    datacenters = ["dc1"]
    type = "service"
    constraint {
        attribute = "${attr.kernel.name}"
        value = "windows"
    }
    update {
        stagger = "30s"
        max_parallel = 1
    }
    group "microservices" {
        count = 1
        task "microservice" {
            driver = "raw_exec"
            config {
                artifact_source = "https://<ARTIFACT_LOCATION>/build.zip"
                command = "OwinSample.Host.exe"
            }
            env {
               SYSTEMROOT = "C:\\Windows"
               CD = "${NOMAD_TASK_DIR}"            
            }
            resources {
                cpu = 500
                memory = 128
            }
        }
    }
}

@dotcomputercraft
Copy link
Author

@dadgar or @dadgar - please let us know when you have a new release of nomad. Thank you again for the wonderful support.
Best Regards,
John Montoya

@dotcomputercraft
Copy link
Author

@dadgar - what does env variable Systemroot do?

@dadgar
Copy link
Contributor

dadgar commented Mar 23, 2016

@dotcomputercraft - Of course! Glad you guys are looking into Nomad. I will fix this in master today so if you guys can compile from source you should be able to continue to use Nomad. Otherwise, we will be releasing Nomad 0.3.2 in the next 2 or 3 weeks.

Windows appears to have a few special system environment variable and that happens to be one of them.

@dotcomputercraft
Copy link
Author

@dadgar - thank you. What platform do you use to build Nomad? using the VagrantFile that comes with Nomad does not build a clean executable, so I was going to use OSX Yosemite (10.10.5). Thoughts?

@dadgar
Copy link
Contributor

dadgar commented Mar 23, 2016

@dotcomputercraft: I build the linux binary on the provided vagrant and I build the darwin/windows binary on OS X El Capitan

@dadgar
Copy link
Contributor

dadgar commented Mar 23, 2016

So the PR linked fixes this problem and will let you remove that environment flag from your PR. Should be merged shortly!

@mgenov
Copy link
Contributor

mgenov commented Apr 5, 2016

I'm encountering similar issue with 0.3.1 on linux:

   2016/04/05 16:22:22 [INFO] client: task "proxy-test" for alloc "385efff6-3665-bf92-84c1-5ecc984ce85f" failed: Wait returned exit code -1, signal 0, and error <nil>
   2016/04/05 16:22:22 [INFO] client: Restarting task "proxy-test" for alloc "385efff6-3665-bf92-84c1-5ecc984ce85f" in 15.121273443s
   2016/04/05 16:22:22 [DEBUG] plugin: /usr/lib/nomad/nomad: plugin process exited

The stderr and stdout are not containing any crash information from the process. The same process is working as expected when is started manually.

Is it possible the cause to be from the environment too ?

@mgenov
Copy link
Contributor

mgenov commented Apr 5, 2016

I've added some more logging and it looks like that nomad is sending interrupt signal to the process.

Log Output:

lient: task "proxy-test" for alloc "385efff6-3665-bf92-84c1-5ecc984ce85f" failed: Wait returned exit code -1, signal 0, and error <nil>

This code is part of my app

        done := make(chan bool, 1)
        c := make(chan os.Signal, 1)
        signal.Notify(c, os.Interrupt, syscall.SIGHUP,
                syscall.SIGINT,
                syscall.SIGTERM,
                syscall.SIGQUIT)
        go func() {
                sig := <-c
                log.Printf("Got quit signal: %v", sig)
                done <- true
        }()                     
        <-done  

And this is what is logged in stderr.0 in 385efff6-3665-bf92-84c1-5ecc984ce85f/alloc/logs

2016/04/05 16:26:11 Got quit signal: interrupt

@mgenov
Copy link
Contributor

mgenov commented Apr 6, 2016

Added #1042 about this issue.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants