Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.3] Issue accessing allocation logs #5920

Closed
mantoine96 opened this issue Jul 4, 2019 · 11 comments · Fixed by #5954
Closed

[0.9.3] Issue accessing allocation logs #5920

mantoine96 opened this issue Jul 4, 2019 · 11 comments · Fixed by #5954
Assignees
Labels
theme/api HTTP API and SDK issues theme/tls type/bug

Comments

@mantoine96
Copy link

mantoine96 commented Jul 4, 2019

Hey everyone,

I'll preface this by saying that this seems intermittent. Sometimes you will be able to nomad logs the allocation, sometimes the call will hang. However it happens on running, failed or completed allocations alike.

When calling nomad alloc logs on an allocation, the call hangs. This also happens on the Web UI where both the direct call to the node, and the call through the master fail.

One thing to note is that logs are available on the filesystem of the clients and can be read without any problems.

I'm not sure whether this is a question or more a bug report as I can't reproduce the issue on other versions of Nomad (0.9.1).

Reproduction steps

$ nomad alloc logs -job myjob

Job file (if appropriate)

Doesn't seem dependent on the job since sometimes it works for one allocation of a job, and not for another one

Nomad Shared config

consul {
  # The address to the Consul agent.
  address = "127.0.0.1:8500"

  # The service name to register the server and client with Consul.
  server_service_name = "nomad"
  client_service_name = "nomadclient"

  # Enables automatically registering the services.
  auto_advertise = true

  # Enabling the server and client to bootstrap using Consul.
  server_auto_join = true
  client_auto_join = true
}

tls {
  http    = true
  rpc     = true
  ca_file = "/cert/cert.crt"
  cert_file = "/etc/ssl/server.eu-central-1.nomad/cert.pem"
  key_file  = "/etc/ssl/server.eu-central-1.nomad/key.pem"
  verify_server_hostname = true
}

telemetry {
  publish_allocation_metrics = true
  publish_node_metrics       = true
  datadog_address            = "127.0.0.1:8125"
  disable_hostname           = true
}

enable_syslog = true

acl {
  enabled    = true
  token_ttl  = "30s"
  policy_ttl = "60s"
}

Nomad Server config

datacenter = "datacenter"

data_dir = "/tmp/nomad"

region = "eu-central-1"

server {
  enabled          = true
  bootstrap_expect = 3
  encrypt          = "xxxxxxxx"

  job_gc_threshold        = "12h"
  eval_gc_threshold       = "6h"
  deployment_gc_threshold = "6h"

  authoritative_region = "eu-central-1"
}

vault {
  enabled = true

  address = "https://vault.service.consul"

  token = "XXXXXX"


  create_from_role = "system_nomad"
}

Nomad Client Config

datacenter = "datacenter"
data_dir = "/tmp/nomad"
region = "eu-central-1"

client {
  enabled = true
  meta = {
  }

  options = {
    "docker.auth.config" = "/root/.docker/config.json"
    "docker.caps.whitelist" = "CHOWN,DAC_OVERRIDE,FSETID,FOWNER,MKNOD,NET_RAW,SETGID,SETUID,SETFCAP, SETPCAP,NET_BIND_SERVICE,SYS_CHROOT,KILL,AUDIT_WRITE,NET_ADMIN"
  }


  gc_interval = "10m"
  gc_max_allocs = 100

}


advertise {
}



vault {
  enabled = true
  address = "https://vault.service.consul"
}

Nomad Client logs (if appropriate)

Couldn't find anything meaningful in Nomad logs.
If possible please post relevant logs in the issue.

In the browser I see:

nomad-ui-c53fd732837d85cbd816ec376e86c4f7.js:1164 LOG FETCH: Couldn't connect to //10.102.3.22:4646/v1/client/fs/logs/2ba31be3-db99-d243-7aea-27f41651b010?follow=true&offset=50000&origin=end&task=myservice&type=stdout
(anonymous) @ nomad-ui-c53fd732837d85cbd816ec376e86c4f7.js:1164

nomad-ui-c53fd732837d85cbd816ec376e86c4f7.js:1164 LOG FETCH: Couldn't connect to /v1/client/fs/logs/2ba31be3-db99-d243-7aea-27f41651b010?follow=true&offset=50000&origin=end&task=myservice&type=stdout

The first thing is expected, the client nodes don't have a certificate that correspond to their IP.

The second call, directly to the master, that will then proxy to the client, is the one that is flaky for us. Sometimes, refreshing is enough to get the logs (in the UI), or making another nomad logs call. Other times, the call will just hang (through curl or nomad logs)

I've been trying to figure it out for a few days at this point and can't find anything that could indicate what's happening.

Thanks

@cgbaker
Copy link
Contributor

cgbaker commented Jul 5, 2019

What happens if you run the CLI command on the client node where the allocation was scheduled (i.e., where the logs are still present on the file system)?

$ nomad alloc logs -job myjob

@ilyaiqoqo
Copy link
Contributor

we experience the same behavior, the logs are working intermediately

@mantoine96
Copy link
Author

mantoine96 commented Jul 10, 2019

Hey @cgbaker

From the node, I was able to constantly get the logs:

$ NOMAD_ADDR=https://localhost:4646 NOMAD_SKIP_VERIFY=1 nomad logs -stderr $myAlloc

From my machine, pointing NOMAD_ADDR to the masters, Nomad was still only displaying the logs intermittently (~ once every 5 to 10 tries)

@onomojaku
Copy link

We have similar problems with both nomad alloc logs and nomad alloc exec. Typically it works fine on the node with the allocation, and it does not work for other nodes. We didn't see this behavior in Nomad 0.9.0, but we started seeing it in 0.9.2 and it persists in 0.9.3. Preetha asked for log/conf/job files, but it looks like you beat me to getting them transferred off of our private network and available to the internet at wild. Thanks for the thorough documentation and bug report.

@jpclarks
Copy link

Here is the example that @onomojaku referenced:

The alloc 3cdea42e is running on NOMAD_CLIENT_1. The command nomad alloc exec succeeds on NOMAD_CLIENT_1:

NOMAD_CLIENT_1 > nomad alloc exec 3cdea42e bash
root@3cdea42e:/#

Here are the logs:

NOMAD_CLIENT_1 > sudo /bin/cat /var/log/messages | grep 3cdea42e
Jul 10 15:04:37 NOMAD_CLIENT_1 nomad: 2019-07-10T15:04:37.919Z [INFO ] client: task exec  session starting: exec_id=0c3684e1-d128-237f-770a-c2213644878c alloc_id=3cdea42e-c682-c292-06f4-db96c0d8ca2d task=rabbitmq command=[bash] tty=true access_token_name= access_token_id=
Jul 10 15:04:40 NOMAD_CLIENT_1 nomad: 2019-07-10T15:04:40.028Z [DEBUG] http: request complete: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D&region=REGION_1&task=rabbitmq&tty=true duration=2.108800229s

However, the command nomad alloc exec fails on NOMAD_CLIENT_2 (where the alloc 3cdea42e is not running):

NOMAD_CLIENT_2 > nomad alloc exec 3cdea42e bash
failed to exec into task: write tcp NOMAD_SERVER_1:43064->NOMAD_SERVER_2:4647: write: connection reset by peer

Here are the logs:

NOMAD_CLIENT_2 > sudo /bin/cat /var/log/messages | grep 3cdea42e
Jul 10 14:30:42 NOMAD_CLIENT_2 nomad: 2019-07-10T14:30:42.138Z [ERROR] http: request failed: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D&region=REGION_1&task=rabbitmq&tty=true error="write tcp NOMAD_SERVER_IP_1:57470->NOMAD_SERVER_IP_2:4647: write: connection reset by peer" code=500
Jul 10 14:30:42 NOMAD_CLIENT_2 nomad: 2019-07-10T14:30:42.138Z [DEBUG] http: request complete: method=GET path=/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D&region=REGION_1&task=rabbitmq&tty=true duration=16.958344ms

Of note, creating a websocket with

 NOMAD_CLIENT_2 > curl -X GET \
  --include \
  --no-buffer \
  --verbose \
  --header "Connection: Upgrade" \
  --header "Upgrade: websocket" \
  --header "Sec-Websocket-Version: 13" \
  --header "Sec-WebSocket-Key: foo" \ 
  --cacert /etc/pki/cluster/nomad-ca.pem \
  --cert /etc/pki/cluster/nomad.cli.pem \ 
  --key /etc/pki/cluster/nomad.cli-key.pem \
  "https://localhost:4646/v1/client/allocation/3cdea42e-c682-c292-06f4-db96c0d8ca2d/exec?command=%5B%22bash%22%5D&region=REGION_1&task=rabbitmq&tty=true"

succeeds, after which the nomad alloc exec commands will succeed until the connection is terminated.

@daveadams
Copy link

FWIW, we're seeing the same thing after upgrading to 0.9.3 from 0.9.1.

@notnoop
Copy link
Contributor

notnoop commented Jul 12, 2019

Thank you so much for raising the issue. I pinpointed the problem as being a bug in server-to-server streaming RPC forwarding when tls is enabled. Aiming to have a PR fix soon and should target 0.9.4.

@onomojaku
Copy link

onomojaku commented Jul 12, 2019 via email

@mantoine96
Copy link
Author

mantoine96 commented Jul 12, 2019

Glad to hear that this is a bug, and that you were able to pinpoint the bug.
I'm eager to see what the fix is :)

Thank you for the responsiveness, and looking forward to upgrade to 0.9.4 !

Cheers

@yoadfe
Copy link

yoadfe commented Jul 14, 2019

tested with a dev binary and it indeed solved our issue. Thank you :)

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
theme/api HTTP API and SDK issues theme/tls type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants