Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elixer applications are not able to access files in the job task directory. #4194

Closed
dansteen opened this issue Apr 20, 2018 · 7 comments
Closed

Comments

@dansteen
Copy link

dansteen commented Apr 20, 2018

Nomad versions

Nomad v0.8.1 (46aa11b)
(also 0.8.0)
(worked fine in Nomad v0.7.1 (0b295d3))

Operating system and Environment details

debian 8

Issue

After upgrading to nomad 0.8.0 (and then 0.8.1 as a test) my elixer applications are no longer able to read files that are placed in the allocation directory during the "downloading artifacts" phase of the deployment (i get a "file not found" type of error).

These applications worked just fine under version 0.7.1.

The actual error I get is:

** (Conform.Schema.SchemaError) Schema at /local/org_service/releases/0.0.1/org_service.schema.exs doesn't exist!
    (conform) lib/conform/schema.ex:134: Conform.Schema.load!/1
    (conform) lib/conform.ex:95: Conform.process/1
    (elixir) lib/kernel/cli.ex:105: anonymous fn/3 in Kernel.CLI.exec_fun/2

but the file does exist:

-rw-r--r-- 1 root root 9193 Apr 20 13:15 /local/org_service/releases/0.0.1/org_service.schema.exs

To resolve this I can manually run through and "touch" all the files that the application will need to read. once I do that, the application is able to read the files and start up.

(note that I have also opened a support ticket for this issue - sorry about the duplicate)

Nomad Server logs (if appropriate)

There are no server error logs

Nomad Client logs (if appropriate)

there are no client error logs

Job file (if appropriate)

  datacenters = ["awse"]
  type = "service"
  constraint {
    attribute = "${meta.role}"
    value     = "api-cluster"
  }
  constraint {
    attribute = "${meta.env}"
    value     = "stag"
  }

	# set our update policy
  update {
    max_parallel     = 2
    health_check     = "checks"
    min_healthy_time = "30s"
    healthy_deadline = "3m"
    auto_revert      = false
    #canary           = 1
    #stagger          = "30s"
  }

  reschedule {
    delay          = "30s"
    delay_function = "exponential"
    max_delay      = "5m"
    unlimited      = true
  }

  group "app" {
	  # set our restart policy
		restart {
			interval = "1m"
			attempts = 2
			delay    = "15s"
			mode     = "fail"
	  }
    count = 2

    # needed for increased log file size
    ephemeral_disk {
      size    = "2600"
    }

    task "org-service" {
      leader = true
      # grab our files
      artifact {
        source = "https://<url>/org-service-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.tar.gz"
        destination = "local/org_service"
      }
      artifact {
        # for development
        source = "https://<url>/org-service-config-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.conf.tmpl"
      }
      # turn it into the correct config
      template {
        source = "local/org-service-config-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.conf.tmpl"
        # the underscore in org_service below is intentional since thats the technical name of the application
        destination = "local/org_service/releases/0.0.1/org_service.conf"
        change_mode = "restart"
        splay = "10m"
        vault_grace = "15m"
        perms = "664"
      }
			artifact {
        # for development
        source = "https://<url>/org-service-vm-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.args.tmpl"
      }
      # turn it into the correct config
      template {
        source = "local/org-service-vm-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.args.tmpl"
        destination = "local/org_service/vm.args"
        change_mode = "restart"
        splay = "10m"
        vault_grace = "15m"
        perms = "664"
      }
      # set our environment variables
      env {
        CHEF_ENV = "${meta.env}"
        APP_NAME = "org-service"
        LOCAL_HOSTNAME = "${node.unique.name}"
        # we need this so it doesn't try to write into the application
        RELEASE_MUTABLE_DIR="/local/run_dir"
        PORT = "${NOMAD_PORT_app}"
        ERL_CRASH_DUMP="/alloc/logs/erl_crash.dump"
        ERL_EPMD_PORT = "${NOMAD_PORT_epmd}"
        # when a new deploy runs, we havent yet set the deploy_version to the new value so we need to specify the GIT_HASH that we are
        # using for the job and templates
        GIT_HASH = "b1fdcac2c596935f49c29aba2e630b97f5f6e28f"
      }
      # grant access to secrets
      vault {
        policies = [ "app-stag-org-service" ]
        change_mode = "noop"
      }
      # run our app
      driver = "exec"
      config {
        command = "local/org_service/bin/org_service"
        args = [ "foreground" ]
      }
      resources {
        cpu    = 2000
        memory = 3000
        network {
          port "app" {}
          port "admin" {}
          port "epmd" {
            static = "11001"
          }
        }
      }

      logs {
        max_files     = 5
        max_file_size = 500
      }
      
      # add in service discovery
      service {
        name = "org-service"
        # for now we use both <context>__<data> and <data> formats
        tags = [
          "${node.unique.name}", "host__${node.unique.name}",
          "b1fdcac2c596935f49c29aba2e630b97f5f6e28f", "version__b1fdcac2c596935f49c29aba2e630b97f5f6e28f",
          "${meta.env}", "env__${meta.env}",
          "${meta.env}-api-cluster-prefix-/v1/organizations",
          "${meta.env}-api-cluster-prefix-/swagger-orgs.json",
          "consuldogConfig:org-service-http_check.yaml.tmpl:http_check",
          "consuldogConfig:org-service-process.yaml.tmpl:process"
        ]

        port = "app"

        check {
          name = "app"
          path     = "/v1/organizations/monitor/ping"
          initial_status = "critical"
          type     = "http"
          protocol = "http"
          port     = "app"
          interval = "10s"
          timeout  = "2s"
        }
      }

      # add in service discovery so we can find the admin port from consul
      service {
        name = "org-service-admin"
        tags = [
          "${node.unique.name}",
          "host__${node.unique.name}",
          "b1fdcac2c596935f49c29aba2e630b97f5f6e28f", "version__b1fdcac2c596935f49c29aba2e630b97f5f6e28f",
          "${meta.env}", "env__${meta.env}",
          "consuldogConfig:org-service-admin.yaml.tmpl:admin"
        ]

        port = "admin"

        check {
          name = "admin"
          initial_status = "critical"
          type     = "tcp"
          port     = "app"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }

    task "log-shipper" {
      # grab our config file template
      artifact {
        # for development
        source = "https://<url>/org-service-remote-syslog2-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.yml.tmpl"
      }
      # turn it into the correct config
      template {
        source = "local/org-service-remote-syslog2-b1fdcac2c596935f49c29aba2e630b97f5f6e28f.yml.tmpl"
        destination = "local/remote-syslog2.yml"
        change_mode = "noop"
        perms = "664"
      }
      # set our environment variables
      env {
        CHEF_ENV = "${meta.env}"
        APP_NAME = "org-service"
        LOCAL_HOSTNAME = "${node.unique.name}"
        LOG_TASK_NAME = "org-service"
        # when a new deploy runs, we havent yet set the deploy_version to the new value so we need to specify the GIT_HASH that we are
        # using for the job and templates
        GIT_HASH = "b1fdcac2c596935f49c29aba2e630b97f5f6e28f"
      }
      # grant access to secrets
      driver = "exec"
      config {
        command = "/usr/local/bin/remote_syslog"
        args = [ "-c", "/local/remote-syslog2.yml", "-D" ]
      }
      resources {
        cpu    = 100
        memory = 100
      }
    }
  }
}
@preetapan
Copy link
Contributor

Thanks for reporting this.

https://github.com/hashicorp/nomad/pull/4129/files looks like it could be related to this regression.

As a workaround, would you be able to make the file being downloaded be world readable, the changes we made in #4129 should preserve original permissions.

@dansteen
Copy link
Author

Hi @preetapan

Thanks for the reply! Unfortunately, I don't think its a permissions issue as all the files and directories are all world read all the way up the chain:

-rw-r--r-- 1 root root 8635 Apr 16 12:32 local/org_service/releases/0.0.1/org_service.schema.exs
drwxr-xr-x 5 root root 4096 Apr 16 17:04 local/org_service/releases/0.0.1/
drwxr-xr-x 3 root root 4096 Apr 16 12:32 local/org_service/releases
drwxr-xr-x 6 root root 4096 Apr 16 12:32 local/org_service
drwxrwxrwx 4 nobody nogroup 4096 Apr 16 12:32 local

Also, when I do a "touch" on the file (no permissions change at all) it becomes accessible to the service.

Thanks!

@dadgar
Copy link
Contributor

dadgar commented Apr 23, 2018

@dansteen Could you share the same permission breakdown when you running it under 0.7.1? Also could you, as a test, run a batch version of that job and have the command just be cat and the arg be that file (on 0.8.1). I wonder if any application can read the file.

@dansteen
Copy link
Author

Here is a super-simple test application that will demonstrate this issue:
hello_world.tar.gz

If you can't use the binary, or you want to test in other box types, here is the code that generates that application bundle:
hello_world_repo.tar.gz

To build it, you will need the following packages (on debian):
apt-get install elixir erlang-dev erlang-parsetools erlang-eunit erlang-xmerl

Then run the following commands:

mix deps.get
MIX_ENV=prod mix do compile, release --env=prod

The binary bundle will then be located in _build/prod/rel/hello_world/releases/0.1.0/hello_world.tar.gz

And here is an associated job file that will demonstrate this issue:

job "test" {
  datacenters = ["awse"]
  type        = "service"

  constraint {
    attribute = "${meta.role}"
    value     = "api-cluster"
  }

  constraint {
    attribute = "${meta.env}"
    value     = "load"
  }

  # set our update policy
  update {
    max_parallel     = 1
    health_check     = "checks"
    min_healthy_time = "30s"
    healthy_deadline = "3m"
    auto_revert      = false

    #canary           = 1
    #stagger          = "30s"
  }

  reschedule {
    delay          = "30s"
    delay_function = "exponential"
    max_delay      = "5m"
    unlimited      = true
  }

  group "test" {
    # set our restart policy
    restart {
      interval = "1m"
      attempts = 2
      delay    = "15s"
      mode     = "fail"
    }

    count = 1

    task "test" {
      leader = true

      # grab our files
      artifact {
        source      = "https://<url to archive>/hello_world.tar.gz"
        destination = "local/hello_world"
      }

      # set our environment variables
      env {
        CHEF_ENV       = "${meta.env}"
        APP_NAME       = "org-service"
        LOCAL_HOSTNAME = "${node.unique.name}"

        # we need this so it doesn't try to write into the application
        RELEASE_MUTABLE_DIR = "/local/run_dir"
        PORT                = "${NOMAD_PORT_app}"
        ERL_CRASH_DUMP      = "/alloc/logs/erl_crash.dump"
        ERL_EPMD_PORT       = "${NOMAD_PORT_epmd}"
      }

      # run our app
      driver = "exec"

      config {
        command = "local/hello_world/bin/hello_world"
        args    = ["foreground"]
      }

      resources {
        cpu    = 200
        memory = 300

        network {
          port "app"{}
          port "admin"{}

          port "epmd" {
            static = "11001"
          }
        }
      }
    }
  }
}

The errors show up in stderr:

018-04-23 12:53:44 std_error           2018-04-23 12:53:44 std_error           2018-04-23 12:53:44 std_error           2018-04-23 12:53:44 std_error           2018-04-23 12:53:44 std_error           2018-04-23 12:53:45 std_error 
** (Conform.Schema.SchemaError) Schema at /local/hello_world/releases/0.1.0/hello_world.schema.exs doesn't exist!
    (conform) lib/conform/schema.ex:134: Conform.Schema.load!/1
    (conform) lib/conform.ex:95: Conform.process/1
    (elixir) lib/kernel/cli.ex:105: anonymous fn/3 in Kernel.CLI.exec_fun/2

A successful run will generate the following log line:

==> Generated sys.config in /tmp/hello_world_sample/_build/prod/rel/hello_world/var

Some interesting tests

  1. If you run cp /local/hello_world/releases/0.1.0/hello_world.schema.exs /<any place at all> the next time the job restarts it will be able to find the file. (notice that we don't change anything about the file at all - it's just a forced read via a copy).

  2. This error happens when you run the command inside a folder that nomad creates as part of an allocation. However, you don't have to be running the command from within nomad to get this error. If you just cd into the allocation direcory, and run the command manually (no chroot or anything) you get the same error. As long as the command is being run from within the allocation directory it has issues.

Thanks!

@dansteen
Copy link
Author

Hi @dadgar !

Here is the permissions breakdown under 0.7.1:

-rw-r--r-- 1 root root 9193 Apr 23 10:49 local/org_service/releases/0.0.1/org_service.schema.exs
drwxr-xr-x 5 root root 4096 Apr 23 10:49 local/org_service/releases/0.0.1/
drwxr-xr-x 3 root root 4096 Apr 23 10:49 local/org_service/releases/
drwxr-xr-x 6 root root 4096 Apr 23 10:49 local/org_service
drwxrwxrwx 4 nobody nogroup 4096 Apr 23 10:49 local

Here is the outcome of running a cat as the command:

@moduledoc """
A schema is a keyword list which represents how to map, transform, and validate
...
<lots of stuff>

So it seems that the file is there and can be generally read.

Thanks!

@dadgar
Copy link
Contributor

dadgar commented Apr 25, 2018

@dansteen Thanks for the reproducer. We were able to track it down to an issue when the archive did not have access time set. This has been fixed and will be pulled into Nomad 0.8.2 which will be releasing shortly!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants