[0.9.3] Issue accessing allocation logs #5920

mantoine96 · 2019-07-04T09:42:56Z

Hey everyone,

I'll preface this by saying that this seems intermittent. Sometimes you will be able to nomad logs the allocation, sometimes the call will hang. However it happens on running, failed or completed allocations alike.

When calling nomad alloc logs on an allocation, the call hangs. This also happens on the Web UI where both the direct call to the node, and the call through the master fail.

One thing to note is that logs are available on the filesystem of the clients and can be read without any problems.

I'm not sure whether this is a question or more a bug report as I can't reproduce the issue on other versions of Nomad (0.9.1).

Reproduction steps

$ nomad alloc logs -job myjob

Job file (if appropriate)

Doesn't seem dependent on the job since sometimes it works for one allocation of a job, and not for another one

Nomad Shared config

consul {
  # The address to the Consul agent.
  address = "127.0.0.1:8500"

  # The service name to register the server and client with Consul.
  server_service_name = "nomad"
  client_service_name = "nomadclient"

  # Enables automatically registering the services.
  auto_advertise = true

  # Enabling the server and client to bootstrap using Consul.
  server_auto_join = true
  client_auto_join = true
}

tls {
  http    = true
  rpc     = true
  ca_file = "/cert/cert.crt"
  cert_file = "/etc/ssl/server.eu-central-1.nomad/cert.pem"
  key_file  = "/etc/ssl/server.eu-central-1.nomad/key.pem"
  verify_server_hostname = true
}

telemetry {
  publish_allocation_metrics = true
  publish_node_metrics       = true
  datadog_address            = "127.0.0.1:8125"
  disable_hostname           = true
}

enable_syslog = true

acl {
  enabled    = true
  token_ttl  = "30s"
  policy_ttl = "60s"
}

Nomad Server config

datacenter = "datacenter"

data_dir = "/tmp/nomad"

region = "eu-central-1"

server {
  enabled          = true
  bootstrap_expect = 3
  encrypt          = "xxxxxxxx"

  job_gc_threshold        = "12h"
  eval_gc_threshold       = "6h"
  deployment_gc_threshold = "6h"

  authoritative_region = "eu-central-1"
}

vault {
  enabled = true

  address = "https://vault.service.consul"

  token = "XXXXXX"


  create_from_role = "system_nomad"
}

Nomad Client Config

datacenter = "datacenter"
data_dir = "/tmp/nomad"
region = "eu-central-1"

client {
  enabled = true
  meta = {
  }

  options = {
    "docker.auth.config" = "/root/.docker/config.json"
    "docker.caps.whitelist" = "CHOWN,DAC_OVERRIDE,FSETID,FOWNER,MKNOD,NET_RAW,SETGID,SETUID,SETFCAP, SETPCAP,NET_BIND_SERVICE,SYS_CHROOT,KILL,AUDIT_WRITE,NET_ADMIN"
  }


  gc_interval = "10m"
  gc_max_allocs = 100

}


advertise {
}



vault {
  enabled = true
  address = "https://vault.service.consul"
}

Nomad Client logs (if appropriate)

Couldn't find anything meaningful in Nomad logs.
If possible please post relevant logs in the issue.

In the browser I see:

nomad-ui-c53fd732837d85cbd816ec376e86c4f7.js:1164 LOG FETCH: Couldn't connect to //10.102.3.22:4646/v1/client/fs/logs/2ba31be3-db99-d243-7aea-27f41651b010?follow=true&offset=50000&origin=end&task=myservice&type=stdout
(anonymous) @ nomad-ui-c53fd732837d85cbd816ec376e86c4f7.js:1164

nomad-ui-c53fd732837d85cbd816ec376e86c4f7.js:1164 LOG FETCH: Couldn't connect to /v1/client/fs/logs/2ba31be3-db99-d243-7aea-27f41651b010?follow=true&offset=50000&origin=end&task=myservice&type=stdout

The first thing is expected, the client nodes don't have a certificate that correspond to their IP.

The second call, directly to the master, that will then proxy to the client, is the one that is flaky for us. Sometimes, refreshing is enough to get the logs (in the UI), or making another nomad logs call. Other times, the call will just hang (through curl or nomad logs)

I've been trying to figure it out for a few days at this point and can't find anything that could indicate what's happening.

Thanks

The text was updated successfully, but these errors were encountered:

cgbaker · 2019-07-05T19:09:51Z

What happens if you run the CLI command on the client node where the allocation was scheduled (i.e., where the logs are still present on the file system)?

$ nomad alloc logs -job myjob

ilyaiqoqo · 2019-07-09T20:32:39Z

we experience the same behavior, the logs are working intermediately

mantoine96 · 2019-07-10T12:46:57Z

Hey @cgbaker

From the node, I was able to constantly get the logs:

$ NOMAD_ADDR=https://localhost:4646 NOMAD_SKIP_VERIFY=1 nomad logs -stderr $myAlloc

From my machine, pointing NOMAD_ADDR to the masters, Nomad was still only displaying the logs intermittently (~ once every 5 to 10 tries)

onomojaku · 2019-07-10T17:32:05Z

We have similar problems with both nomad alloc logs and nomad alloc exec. Typically it works fine on the node with the allocation, and it does not work for other nodes. We didn't see this behavior in Nomad 0.9.0, but we started seeing it in 0.9.2 and it persists in 0.9.3. Preetha asked for log/conf/job files, but it looks like you beat me to getting them transferred off of our private network and available to the internet at wild. Thanks for the thorough documentation and bug report.

jpclarks · 2019-07-11T14:39:06Z

Here is the example that @onomojaku referenced:

The alloc 3cdea42e is running on NOMAD_CLIENT_1. The command nomad alloc exec succeeds on NOMAD_CLIENT_1:

NOMAD_CLIENT_1 > nomad alloc exec 3cdea42e bash
root@3cdea42e:/#

Here are the logs:

NOMAD_CLIENT_1 > sudo /bin/cat /var/log/messages | grep 3cdea42e
Jul 10 15:04:37 NOMAD_CLIENT_1 nomad: 2019-07-10T15:04:37.919Z [INFO ] client: task exec  session starting: exec_id=0c3684e1-d128-237f-770a-c2213644878c alloc_id=3cdea42e-c682-c292-06f4-db96c0d8ca2d task=rabbitmq command=[bash] tty=true access_token_name= access_token_id=
Jul 10 15:04:40 NOMAD_CLIENT_1 nomad: 2019-07-10T15:04:40.028Z [DEBUG] http: request complete: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D&region=REGION_1&task=rabbitmq&tty=true duration=2.108800229s

However, the command nomad alloc exec fails on NOMAD_CLIENT_2 (where the alloc 3cdea42e is not running):

NOMAD_CLIENT_2 > nomad alloc exec 3cdea42e bash
failed to exec into task: write tcp NOMAD_SERVER_1:43064->NOMAD_SERVER_2:4647: write: connection reset by peer

Here are the logs:

NOMAD_CLIENT_2 > sudo /bin/cat /var/log/messages | grep 3cdea42e
Jul 10 14:30:42 NOMAD_CLIENT_2 nomad: 2019-07-10T14:30:42.138Z [ERROR] http: request failed: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D&region=REGION_1&task=rabbitmq&tty=true error="write tcp NOMAD_SERVER_IP_1:57470->NOMAD_SERVER_IP_2:4647: write: connection reset by peer" code=500
Jul 10 14:30:42 NOMAD_CLIENT_2 nomad: 2019-07-10T14:30:42.138Z [DEBUG] http: request complete: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D&region=REGION_1&task=rabbitmq&tty=true duration=16.958344ms

Of note, creating a websocket with

 NOMAD_CLIENT_2 > curl -X GET \
  --include \
  --no-buffer \
  --verbose \
  --header "Connection: Upgrade" \
  --header "Upgrade: websocket" \
  --header "Sec-Websocket-Version: 13" \
  --header "Sec-WebSocket-Key: foo" \ 
  --cacert /etc/pki/cluster/nomad-ca.pem \
  --cert /etc/pki/cluster/nomad.cli.pem \ 
  --key /etc/pki/cluster/nomad.cli-key.pem \
  "https://localhost:4646/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D&region=REGION_1&task=rabbitmq&tty=true"

succeeds, after which the nomad alloc exec commands will succeed until the connection is terminated.

daveadams · 2019-07-11T22:10:03Z

FWIW, we're seeing the same thing after upgrading to 0.9.3 from 0.9.1.

notnoop · 2019-07-12T02:01:57Z

Thank you so much for raising the issue. I pinpointed the problem as being a bug in server-to-server streaming RPC forwarding when tls is enabled. Aiming to have a PR fix soon and should target 0.9.4.

onomojaku · 2019-07-12T02:41:53Z

Awesome! You guys do great work and are incredibly responsive. Thanks much! Rod

…

Sent from my iPad

On Jul 11, 2019, at 10:02 PM, Mahmood Ali ***@***.***> wrote: Thank you so much for raising the issue. I pinpointed the problem as being a bug in server-to-server streaming RPC forwarding when tls is enabled. Aiming to have a PR fix soon and should target 0.9.4. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

mantoine96 · 2019-07-12T06:58:41Z

Glad to hear that this is a bug, and that you were able to pinpoint the bug.
I'm eager to see what the fix is :)

Thank you for the responsiveness, and looking forward to upgrade to 0.9.4 !

Cheers

yoadfe · 2019-07-14T09:08:58Z

tested with a dev binary and it indeed solved our issue. Thank you :)

github-actions · 2022-11-21T02:29:14Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

cgbaker self-assigned this Jul 5, 2019

cgbaker added theme/cli theme/logging theme/ui stage/waiting-reply labels Jul 5, 2019

stale bot removed the stage/waiting-reply label Jul 9, 2019

cgbaker assigned notnoop Jul 11, 2019

notnoop added theme/api HTTP API and SDK issues type/bug theme/tls and removed theme/cli theme/logging theme/ui labels Jul 12, 2019

This was referenced Jul 12, 2019

rpc: use tls wrapped connection for streaming rpc #5954

Merged

[WIP] Enable mTLS in Nomad e2e terraform #5955

Closed

notnoop closed this as completed in #5954 Jul 12, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.9.3] Issue accessing allocation logs #5920

[0.9.3] Issue accessing allocation logs #5920

mantoine96 commented Jul 4, 2019 •

edited

Loading

cgbaker commented Jul 5, 2019

ilyaiqoqo commented Jul 9, 2019

mantoine96 commented Jul 10, 2019 •

edited

Loading

onomojaku commented Jul 10, 2019

jpclarks commented Jul 11, 2019

daveadams commented Jul 11, 2019

notnoop commented Jul 12, 2019

onomojaku commented Jul 12, 2019 via email

mantoine96 commented Jul 12, 2019 •

edited

Loading

yoadfe commented Jul 14, 2019

github-actions bot commented Nov 21, 2022

[0.9.3] Issue accessing allocation logs #5920

[0.9.3] Issue accessing allocation logs #5920

Comments

mantoine96 commented Jul 4, 2019 • edited Loading

Reproduction steps

Job file (if appropriate)

Nomad Shared config

Nomad Server config

Nomad Client Config

Nomad Client logs (if appropriate)

cgbaker commented Jul 5, 2019

ilyaiqoqo commented Jul 9, 2019

mantoine96 commented Jul 10, 2019 • edited Loading

onomojaku commented Jul 10, 2019

jpclarks commented Jul 11, 2019

daveadams commented Jul 11, 2019

notnoop commented Jul 12, 2019

onomojaku commented Jul 12, 2019 via email

mantoine96 commented Jul 12, 2019 • edited Loading

yoadfe commented Jul 14, 2019

github-actions bot commented Nov 21, 2022

mantoine96 commented Jul 4, 2019 •

edited

Loading

mantoine96 commented Jul 10, 2019 •

edited

Loading

mantoine96 commented Jul 12, 2019 •

edited

Loading