Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data directory included in chroot causes infinite directory structure #2522

Closed
kyle-crumpton opened this issue Apr 5, 2017 · 4 comments · Fixed by #11334
Closed

Data directory included in chroot causes infinite directory structure #2522

kyle-crumpton opened this issue Apr 5, 2017 · 4 comments · Fixed by #11334

Comments

@kyle-crumpton
Copy link

kyle-crumpton commented Apr 5, 2017

Nomad version

0.5.6

Operating system and Environment details

OS: Centos 7
Consul: 0.7.5

Issue

Job hangs on "Building task directory" when Nomad's data directory is included in the chroot (for example, using /etc/nomad.d/ as a data directory) until either the disk fills up, or memory runs out.

Reproduction steps

Set Nomad's data directory to /etc/nomad.d/, run a Job which utilizes a chroot.
Also, you can simply add your data directory to your "chroot_env" in the client config for Nomad. This will reproduce the same behavior.

Nomad Server logs (if appropriate)

Logs did not produce anything meaningful.

Nomad Client logs (if appropriate)

Setup Failure failed to build task directory for "example": Couldn't create symlink: symlink python2.7.1.gz /etc/nomad.d/alloc/ba334436-96e1-61bf-e6b1-9f8ff3c56a63/example/etc/nomad.d/alloc/ba334436-96e1-61bf-e6b1-9f8ff3c56a63/example/etc/nomad.d/alloc/2bb9b1a0-95c3-c06b-c7c9-6752eda18d2f/example/etc/nomad.d/alloc/2bb9b1a0-95c3-c06b-c7c9-6752eda18d2f/example/etc/nomad.d/alloc/b8318d40-4b90-5a65-90e5-99fdd57ec522/example/usr/share/man/man1/python2.1.gz: no space left on device

Job file (if appropriate)

job "sleep" {
     datacenters = ["us-central1-a-gce"]
     type = "service"
     group "sleep" {
       count = 1
       task "example" {
          driver = "exec"
          config {
               # When running a binary that exists on the host, the path must be absolute.
               command = "/bin/sleep"
               args    = ["1"]
           }
          resources {
	    memory = 500
          }
       }  
      }
}

A lot of this boils down to using a directory such as /etc to actively write data is typically a no-no. The issue is not limited to that, from what I've seen and reproduced, though. If I set the data directory in my chroot_env it will cause this issue regardless. It would be nice to detect this kind of thing and actively prevent the data directory and the chroot from overlapping, though. I didn't see any warnings in the server logs or anything about this in the documentation though.

@dadgar
Copy link
Contributor

dadgar commented Apr 5, 2017

Yeah I agree we should do some amount of checking but this will largely come down to documentation and operator configuration.

@urjitbhatia
Copy link

We just got hit by this and it rendered our entire nomad worker cluster useless. This should be up in bold in the documentation.

@shantanugadgil
Copy link
Contributor

shantanugadgil commented Sep 7, 2018

I too experienced this recently. (Nomad v 0.8.4)

The Building task directory seems to be as if things are "hung" and not working.

The time taken it takes is not obvious to a new user and it takes a bit of "I have seen this symptom before" to understand what is going on rather than just by looking by the info/error messages.

Would there be a way too show some sort of progress indicator during the Building task directory bit?
(something analogous to the docker pulling layer progress indicator 😀 )

@tgross tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021
schmichael added a commit that referenced this issue Oct 16, 2021
Fixes #2522

Skip embedding client.alloc_dir when building chroot. If a user
configures a Nomad client agent so that the chroot_env will embed the
client.alloc_dir, Nomad will happily infinitely recurse while building
the chroot until something horrible happens. The best case scenario is
the filesystem's path length limit is hit. The worst case scenario is
disk space is exhausted.

A bad agent configuration will look something like this:

```hcl
data_dir = "/tmp/nomad-badagent"

client {
  enabled = true

  chroot_env {
    # Note that the source matches the data_dir
    "/tmp/nomad-badagent" = "/ohno"
    # ...
  }
}
```

Note that `/ohno/client` (the state_dir) will still be created but not
`/ohno/alloc` (the alloc_dir).
While I cannot think of a good reason why someone would want to embed
Nomad's client (and possibly server) directories in chroots, there
should be no cause for harm. chroots are only built when Nomad runs as
root, and Nomad disables running exec jobs as root by default. Therefore
even if client state is copied into chroots, it will be inaccessible to
tasks.

Skipping the `data_dir` and `{client,server}.state_dir` is possible, but
this PR attempts to implement the minimum viable solution to reduce risk
of unintended side effects or bugs.

When running tests as root in a vm without the fix, the following error
occurs:

```
=== RUN   TestAllocDir_SkipAllocDir
    alloc_dir_test.go:520:
                Error Trace:    alloc_dir_test.go:520
                Error:          Received unexpected error:
                                Couldn't create destination file /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/testtask/nomad/test/testtask/.../nomad/test/testtask/secrets/.nomad-mount: open /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/.../testtask/secrets/.nomad-mount: file name too long
                Test:           TestAllocDir_SkipAllocDir
--- FAIL: TestAllocDir_SkipAllocDir (22.76s)
```

Also removed unused Copy methods on AllocDir and TaskDir structs.

Thanks to @eveld for not letting me forget about this!
schmichael added a commit that referenced this issue Oct 18, 2021
Fixes #2522

Skip embedding client.alloc_dir when building chroot. If a user
configures a Nomad client agent so that the chroot_env will embed the
client.alloc_dir, Nomad will happily infinitely recurse while building
the chroot until something horrible happens. The best case scenario is
the filesystem's path length limit is hit. The worst case scenario is
disk space is exhausted.

A bad agent configuration will look something like this:

```hcl
data_dir = "/tmp/nomad-badagent"

client {
  enabled = true

  chroot_env {
    # Note that the source matches the data_dir
    "/tmp/nomad-badagent" = "/ohno"
    # ...
  }
}
```

Note that `/ohno/client` (the state_dir) will still be created but not
`/ohno/alloc` (the alloc_dir).
While I cannot think of a good reason why someone would want to embed
Nomad's client (and possibly server) directories in chroots, there
should be no cause for harm. chroots are only built when Nomad runs as
root, and Nomad disables running exec jobs as root by default. Therefore
even if client state is copied into chroots, it will be inaccessible to
tasks.

Skipping the `data_dir` and `{client,server}.state_dir` is possible, but
this PR attempts to implement the minimum viable solution to reduce risk
of unintended side effects or bugs.

When running tests as root in a vm without the fix, the following error
occurs:

```
=== RUN   TestAllocDir_SkipAllocDir
    alloc_dir_test.go:520:
                Error Trace:    alloc_dir_test.go:520
                Error:          Received unexpected error:
                                Couldn't create destination file /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/testtask/nomad/test/testtask/.../nomad/test/testtask/secrets/.nomad-mount: open /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/.../testtask/secrets/.nomad-mount: file name too long
                Test:           TestAllocDir_SkipAllocDir
--- FAIL: TestAllocDir_SkipAllocDir (22.76s)
```

Also removed unused Copy methods on AllocDir and TaskDir structs.

Thanks to @eveld for not letting me forget about this!
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants