Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error querying node allocations: invalid character 'I' in numeric literal #11100

Closed
watsonian opened this issue Aug 30, 2021 · 11 comments
Closed

Comments

@watsonian
Copy link
Contributor

Nomad version

Nomad v1.1.4+ent (fbad2c4f1c135306ecd6d9cd794508fbaaf0c4de)

Operating system and Environment details

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=21.04
DISTRIB_CODENAME=hirsute
DISTRIB_DESCRIPTION="Ubuntu 21.04"
nomad
  bootstrap = true
  known_regions = 1
  leader = true
  leader_addr = 172.28.128.151:4647
  server = true
raft
  applied_index = 14
  commit_index = 14
  fsm_pending = 0
  last_contact = 0
  last_log_index = 14
  last_log_term = 2
  last_snapshot_index = 0
  last_snapshot_term = 0
  latest_configuration = [{Suffrage:Voter ID:172.28.128.151:4647 Address:172.28.128.151:4647}]
  latest_configuration_index = 0
  num_peers = 0
  protocol_version = 2
  protocol_version_max = 3
  protocol_version_min = 0
  snapshot_version_max = 1
  snapshot_version_min = 0
  state = Leader
  term = 2
runtime
  arch = amd64
  cpu_count = 2
  goroutines = 78
  kernel.name = linux
  max_procs = 2
  version = go1.16.7
serf
  coordinate_resets = 0
  encrypted = false
  event_queue = 0
  event_time = 1
  failed = 0
  health_score = 0
  intent_queue = 0
  left = 0
  member_time = 3
  members = 2
  query_queue = 0
  query_time = 1
vault
  token_expire_time =
  token_ttl = 0s
  tracked_for_revoked = 0

Issue

With a two datacenter deployment when using the spread stanza for a job, the allocation returns an error when requesting the status:

$ nomad alloc status 59eeb575
Error querying allocation: invalid character 'I' in numeric literal

If you perform the request via curl, it comes back fine:

$ curl -s localhost:4646/v1/allocations?prefix=59eeb575 -v
*   Trying 127.0.0.1:4646...
* Connected to localhost (127.0.0.1) port 4646 (#0)
> GET /v1/allocations?prefix=59eeb575 HTTP/1.1
> Host: localhost:4646
> User-Agent: curl/7.74.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: application/json
< Vary: Accept-Encoding
< X-Nomad-Index: 253
< X-Nomad-Knownleader: true
< X-Nomad-Lastcontact: 0
< Date: Thu, 26 Aug 2021 17:29:30 GMT
< Transfer-Encoding: chunked
<
[{"ID":"59eeb575-0e75-9714-2262-c9c1c83cbfe7","EvalID":"5fa73e1c-8641-06dd-fc86-7a43eb7e87fd","Name":"error_repro.error_repro_group[1]","Namespace":"default","NodeID":"bdf919c8-1e52-a814-8067-2a25b23dc5fe","NodeName":"dc1-nomad-client-1","JobID":"error_repro","JobType":"service","JobVersion":0,"TaskGroup":"error_repro_group","AllocatedResources":null,"DesiredStatus":"run","DesiredDescription":"","ClientStatus":"running","ClientDescription":"Tasks are running","DesiredTransition":{"Migrate":null,"Reschedule":null,"ForceReschedule":null},"TaskStates":{"error_repro_task":{"State":"running","Failed":false,"Restarts":0,"LastRestart":null,"StartedAt":"2021-08-26T17:25:47.092349187Z","FinishedAt":null,"Events":[{"Type":"Received","Time":1629998746553224465,"Message":"","DisplayMessage":"Task received by client","Details":{},"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Task Setup","Time":1629998746561175588,"Message":"Building Task Directory","DisplayMessage":"Building Task Directory","Details":{"message":"Building Task Directory"},"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Started","Time":1629998747092173863,"Message":"","DisplayMessage":"Task started by client","Details":{},"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""}],"TaskHandle":null}},"DeploymentStatus":{"Healthy":true,"Timestamp":"2021-08-26T17:25:57.094420438Z","Canary":false,"ModifyIndex":251},"FollowupEvalID":"","RescheduleTracker":null,"PreemptedAllocations":null,"PreemptedByAllocation":"","CreateIndex":246,"ModifyIndex":251,"CreateTime":1629998746532089531,"ModifyTime":1629998757212308321}]
* Connection #0 to host localhost left intactTUE

Reproduction steps

  1. Spin up an environment with two Nomad datacenters (dc1, dc2). Each datacenter should have a single Nomad server and a single Nomad client (this isn't necessarily important to the reproduction, but is how I reproduced the issue). Configs supplied below.
  2. Run the example job (supplied below). It should run successfully.
  3. Run nomad alloc status $ALLOC_ID for one of the allocs of the job. The error should appear. Commenting out the spread stanza and running the job again causes nomad alloc status to return the expected result.

Expected Result

The allocation status should be printed.

Actual Result

Error querying allocation: invalid character 'I' in numeric literal

Job file (if appropriate)

job "error_repro" {
  datacenters = ["dc1", "dc2"]

  type = "service"

  spread {
    attribute = "${node.datacenter}"
    weight    = 100

    target "dc1" {
      percent = 100 
    }
    target "dc2" {
      percent = 0
    }
  }

  update {
    max_parallel = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    progress_deadline = "10m"
    auto_revert = false
    canary = 0
  }

  migrate {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "error_repro_group" {
    count = 2

    network {
      port "db" {
        to = 6379
      }
    }

    restart {
      attempts = 3
      interval = "30m"
      delay = "15s"
      mode = "delay"
    }

    task "error_repro_task" {
      driver = "docker"

      config {
        image = "redis:3.2"

        ports = ["db"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Nomad Server Config

data_dir = "/tmp/data-nomad"

bind_addr = "0.0.0.0"

log_level = "trace"

advertise {
    http = "{{ GetInterfaceIP \"eth1\" }}"
    rpc  = "{{ GetInterfaceIP \"eth1\" }}"
    serf = "{{ GetInterfaceIP \"eth1\" }}"
}

server {
    enabled = true
    bootstrap_expect = 1

    license_path = "/vagrant/nomad.lic"
}

client {
    enabled = false
}

telemetry {
  publish_allocation_metrics = true
  publish_node_metrics       = true
  prometheus_metrics         = true
}

Nomad Client Config

data_dir = "/tmp/data-nomad"

bind_addr = "0.0.0.0"

log_level = "trace"

advertise {
    http = "{{ GetInterfaceIP \"eth1\" }}"
    rpc  = "{{ GetInterfaceIP \"eth1\" }}"
    serf = "{{ GetInterfaceIP \"eth1\" }}"
}

server {
    enabled = false

    license_path = "/vagrant/nomad.lic"
}

client {
    enabled = true
}

telemetry {
  publish_allocation_metrics = true
  publish_node_metrics       = true
  prometheus_metrics         = true
}
@lgfa29
Copy link
Contributor

lgfa29 commented Sep 2, 2021

Thanks for the report @watsonian.

Does the error still happens if the CLI is connected to the server in the datacenter where the allocation is running? For example, if you set the NOMAD_ADDR env var to point to the region that has the allocation:

$ NOMAD_ADDR=http://dc1.my.nomad.cluster.com:4646 nomad alloc status 59eeb575

@watsonian
Copy link
Contributor Author

@lgfa29 In my tests all DCs were in the same region and I was running the CLI request against the server the job was allocated in.

@ron-savoia
Copy link
Contributor

This behavior is seen in 1.4.3 as well.

Repro Steps

  1. Create two single server clusters (East/West regions).
  2. Create one worker node for each cluster.
  3. Verify that each cluster is stable before moving on.

Setup

Region  Server  	IP				Worker		IP
West	server1	 192.168.1.190		worker1  192.168.1.195
East	server2	 192.168.1.191		worker2  192.168.1.196
  1. On server2 run: nomad server join 192.168.1.190:4648
  2. Run error.nomad on server1 (West)
root@server1:/tmp# nomad run error.nomad
==> 2023-01-27T19:57:00Z: Monitoring evaluation "375adef0"
    2023-01-27T19:57:00Z: Evaluation triggered by job "error_repro"
    2023-01-27T19:57:01Z: Evaluation within deployment: "c2e272fd"
    2023-01-27T19:57:01Z: Allocation "e0c03903" created: node "e63a5786", group "error_repro_group"
    2023-01-27T19:57:01Z: Allocation "f2dcf44a" created: node "e63a5786", group "error_repro_group"
    2023-01-27T19:57:01Z: Evaluation status changed: "pending" -> "complete"
==> 2023-01-27T19:57:01Z: Evaluation "375adef0" finished with status "complete"
==> 2023-01-27T19:57:01Z: Monitoring deployment "c2e272fd"
  ✓ Deployment "c2e272fd" successful

    2023-01-27T19:57:17Z
    ID          = c2e272fd
    Job ID      = error_repro
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
    error_repro_group  2        2       2        0          2023-01-27T20:07:15Z
  1. Run nomad status:

server1

root@server1:/tmp# nomad status
ID           Type     Priority  Status   Submit Date
error_repro  service  50        running  2023-01-27T19:57:00Z

server2

root@server2:/etc/nomad.d# nomad status
No running jobs

worker1

root@worker1:/etc/nomad.d# nomad status
ID           Type     Priority  Status   Submit Date
error_repro  service  50        running  2023-01-27T19:57:00Z

worker2

root@worker2:/etc/nomad.d# nomad status
No running jobs
  1. Run nomad job status error_repro:

server1

root@server1:/tmp# nomad job status error_repro
ID            = error_repro
Name          = error_repro
Submit Date   = 2023-01-27T19:57:00Z
Type          = service
Priority      = 50
Datacenters   = east_dc1,west_dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost  Unknown
error_repro_group  0       0         2        0       0         0     0

Latest Deployment
ID          = c2e272fd
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
error_repro_group  2        2       2        0          2023-01-27T20:07:15Z

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created    Modified
e0c03903  e63a5786  error_repro_group  0        run      running  4m14s ago  3m58s ago
f2dcf44a  e63a5786  error_repro_group  0        run      running  4m14s ago  3m58s ago

server2

root@server2:/etc/nomad.d# nomad job status error_repro
No job(s) with prefix or id "error_repro" found

worker1

root@worker1:/etc/nomad.d# nomad job status error_repro
ID            = error_repro
Name          = error_repro
Submit Date   = 2023-01-27T19:57:00Z
Type          = service
Priority      = 50
Datacenters   = east_dc1,west_dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost  Unknown
error_repro_group  0       0         2        0       0         0     0

Latest Deployment
ID          = c2e272fd
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
error_repro_group  2        2       2        0          2023-01-27T20:07:15Z

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created    Modified
e0c03903  e63a5786  error_repro_group  0        run      running  4m14s ago  3m58s ago
f2dcf44a  e63a5786  error_repro_group  0        run      running  4m14s ago  3m58s ago

worker2

root@worker2:/etc/nomad.d# nomad job status error_repro
No job(s) with prefix or id "error_repro" found
  1. Run nomad alloc status <ALLOC_ID>:

server1

root@server1:/tmp# nomad alloc status e0c03903
Error querying allocation: invalid character 'I' in numeric literal

server2

root@server2:/etc/nomad.d# nomad alloc status e0c03903
No allocation(s) with prefix or id "e0c03903" found

worker1

root@worker1:/etc/nomad.d# nomad alloc status e0c03903
Error querying allocation: invalid character 'I' in numeric literal

worker2

root@worker2:/etc/nomad.d# nomad alloc status e0c03903
No allocation(s) with prefix or id "e0c03903" found
  1. Re-run using region nomad alloc status <ALLOC_ID>:

server1

root@server1:/tmp# nomad alloc status -region=west e0c03903
Error querying allocation: invalid character 'I' in numeric literal
root@server1:/tmp# nomad alloc status -region=east e0c03903
No allocation(s) with prefix or id "e0c03903" found

server2

root@server2:/etc/nomad.d# nomad alloc status -region=west e0c03903
Error querying allocation: invalid character 'I' in numeric literal
root@server2:/etc/nomad.d# nomad alloc status -region=east e0c03903
No allocation(s) with prefix or id "e0c03903" found

worker1

root@worker1:/etc/nomad.d# nomad alloc status -region=west e0c03903
Error querying allocation: invalid character 'I' in numeric literal
root@worker1:/etc/nomad.d# nomad alloc status -region=east e0c03903
No allocation(s) with prefix or id "e0c03903" found

worker2

root@worker2:/etc/nomad.d# nomad alloc status -region=west e0c03903
Error querying allocation: invalid character 'I' in numeric literal
root@worker2:/etc/nomad.d# nomad alloc status -region=east e0c03903
No allocation(s) with prefix or id "e0c03903" found
  1. Run nomad node status -self on the worker nodes:

worker1

root@worker1:/etc/nomad.d# nomad node status -self
Error querying node allocations: invalid character 'I' in numeric literal

worker2

root@worker2:/etc/nomad.d# nomad node status -self
ID              = 88833ba9-59b3-7ad7-d238-545e14d622e3
Name            = worker2
Class           = <none>
DC              = east_dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 2h22m12s
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec

Node Events
Time                  Subsystem  Message
2023-01-27T19:18:56Z  Cluster    Node registered

Allocated Resources
CPU         Memory       Disk
0/8000 MHz  0 B/3.8 GiB  0 B/21 GiB

Allocation Resource Utilization
CPU         Memory
0/8000 MHz  0 B/3.8 GiB

Host Resource Utilization
CPU         Memory           Disk
0/8000 MHz  298 MiB/3.8 GiB  (/dev/mapper/ubuntu--vg-ubuntu--lv)

Allocations
No allocations placed
  1. Purge job and run gc

Re-run of the same steps, with spread stanza removed (error_no_spread.nomad)

  1. Run error.nomad on server1 (West):
root@server1:/tmp# nomad run error_no_spread.nomad
==> 2023-01-27T20:16:08Z: Monitoring evaluation "7d303505"
    2023-01-27T20:16:08Z: Evaluation triggered by job "error_repro"
    2023-01-27T20:16:09Z: Evaluation within deployment: "98639e8e"
    2023-01-27T20:16:09Z: Allocation "d8e322a0" created: node "e63a5786", group "error_repro_group"
    2023-01-27T20:16:09Z: Allocation "f5e66320" created: node "e63a5786", group "error_repro_group"
    2023-01-27T20:16:09Z: Evaluation status changed: "pending" -> "complete"
==> 2023-01-27T20:16:09Z: Evaluation "7d303505" finished with status "complete"
==> 2023-01-27T20:16:09Z: Monitoring deployment "98639e8e"
  ✓ Deployment "98639e8e" successful

    2023-01-27T20:16:20Z
    ID          = 98639e8e
    Job ID      = error_repro
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
    error_repro_group  2        2       2        0          2023-01-27T20:26:19Z
  1. Run nomad status:

server1

root@server1:/tmp# nomad status
ID           Type     Priority  Status   Submit Date
error_repro  service  50        running  2023-01-27T20:16:08Z

server2

root@server2:/etc/nomad.d# nomad status
No running jobs

worker1

root@worker1:/etc/nomad.d# nomad status
ID           Type     Priority  Status   Submit Date
error_repro  service  50        running  2023-01-27T20:16:08Z

worker2

root@worker2:/etc/nomad.d# nomad status
No running jobs
  1. Run nomad job status error_repro:

server1

root@server1:/tmp# nomad job status error_repro
ID            = error_repro
Name          = error_repro
Submit Date   = 2023-01-27T20:16:08Z
Type          = service
Priority      = 50
Datacenters   = east_dc1,west_dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost  Unknown
error_repro_group  0       0         2        0       0         0     0

Latest Deployment
ID          = 98639e8e
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
error_repro_group  2        2       2        0          2023-01-27T20:26:19Z

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created    Modified
d8e322a0  e63a5786  error_repro_group  0        run      running  3m30s ago  3m20s ago
f5e66320  e63a5786  error_repro_group  0        run      running  3m30s ago  3m20s ago

server2

root@server2:/etc/nomad.d# nomad job status error_repro
No job(s) with prefix or id "error_repro" found

worker1

root@worker1:/etc/nomad.d# nomad job status error_repro
ID            = error_repro
Name          = error_repro
Submit Date   = 2023-01-27T20:16:08Z
Type          = service
Priority      = 50
Datacenters   = east_dc1,west_dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost  Unknown
error_repro_group  0       0         2        0       0         0     0

Latest Deployment
ID          = 98639e8e
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
error_repro_group  2        2       2        0          2023-01-27T20:26:19Z

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created    Modified
d8e322a0  e63a5786  error_repro_group  0        run      running  3m30s ago  3m20s ago
f5e66320  e63a5786  error_repro_group  0        run      running  3m30s ago  3m20s ago

worker2

root@worker2:/etc/nomad.d# nomad job status error_repro
No job(s) with prefix or id "error_repro" found
  1. Run nomad alloc status <ALLOC_ID>:

server1

root@server1:/tmp# nomad alloc status d8e322a0
ID                  = d8e322a0-c622-a2a4-eb96-acd0fc1196dc
Eval ID             = 7d303505
Name                = error_repro.error_repro_group[1]
Node ID             = e63a5786
Node Name           = worker1
Job ID              = error_repro
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 4m56s ago
Modified            = 4m46s ago
Deployment ID       = 98639e8e
Deployment Health   = healthy

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.195:20348 -> 6379

Task "error_repro_task" is "running"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  2.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2023-01-27T20:16:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-01-27T20:16:09Z  Started     Task started by client
2023-01-27T20:16:08Z  Task Setup  Building Task Directory
2023-01-27T20:16:08Z  Received    Task received by clien

server2

root@server2:/etc/nomad.d# nomad alloc status d8e322a0
No allocation(s) with prefix or id "d8e322a0" found

worker1

root@worker1:/etc/nomad.d# nomad alloc status d8e322a0
ID                  = d8e322a0-c622-a2a4-eb96-acd0fc1196dc
Eval ID             = 7d303505
Name                = error_repro.error_repro_group[1]
Node ID             = e63a5786
Node Name           = worker1
Job ID              = error_repro
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 4m56s ago
Modified            = 4m46s ago
Deployment ID       = 98639e8e
Deployment Health   = healthy

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.195:20348 -> 6379

Task "error_repro_task" is "running"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  2.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2023-01-27T20:16:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-01-27T20:16:09Z  Started     Task started by client
2023-01-27T20:16:08Z  Task Setup  Building Task Directory
2023-01-27T20:16:08Z  Received    Task received by client

worker2

root@worker2:/etc/nomad.d# nomad alloc status d8e322a0
No allocation(s) with prefix or id "d8e322a0" found
  1. Re-run using region nomad alloc status <ALLOC_ID>:

server1

root@server1:/tmp# nomad alloc status -region=west d8e322a0
ID                  = d8e322a0-c622-a2a4-eb96-acd0fc1196dc
Eval ID             = 7d303505
Name                = error_repro.error_repro_group[1]
Node ID             = e63a5786
Node Name           = worker1
Job ID              = error_repro
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 6m55s ago
Modified            = 6m45s ago
Deployment ID       = 98639e8e
Deployment Health   = healthy

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.195:20348 -> 6379

Task "error_repro_task" is "running"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  2.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2023-01-27T20:16:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-01-27T20:16:09Z  Started     Task started by client
2023-01-27T20:16:08Z  Task Setup  Building Task Directory
2023-01-27T20:16:08Z  Received    Task received by client
root@server1:/tmp# nomad alloc status -region=east d8e322a0
No allocation(s) with prefix or id "d8e322a0" found

server2

root@server2:/etc/nomad.d# nomad alloc status -region=west d8e322a0
ID                  = d8e322a0-c622-a2a4-eb96-acd0fc1196dc
Eval ID             = 7d303505
Name                = error_repro.error_repro_group[1]
Node ID             = e63a5786
Node Name           = worker1
Job ID              = error_repro
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 6m55s ago
Modified            = 6m45s ago
Deployment ID       = 98639e8e
Deployment Health   = healthy

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.195:20348 -> 6379

Task "error_repro_task" is "running"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  2.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2023-01-27T20:16:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-01-27T20:16:09Z  Started     Task started by client
2023-01-27T20:16:08Z  Task Setup  Building Task Directory
2023-01-27T20:16:08Z  Received    Task received by client
root@server2:/etc/nomad.d# nomad alloc status -region=east d8e322a0
No allocation(s) with prefix or id "d8e322a0" found

worker1

root@worker1:/etc/nomad.d# nomad alloc status -region=west d8e322a0
ID                  = d8e322a0-c622-a2a4-eb96-acd0fc1196dc
Eval ID             = 7d303505
Name                = error_repro.error_repro_group[1]
Node ID             = e63a5786
Node Name           = worker1
Job ID              = error_repro
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 6m55s ago
Modified            = 6m45s ago
Deployment ID       = 98639e8e
Deployment Health   = healthy

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.195:20348 -> 6379

Task "error_repro_task" is "running"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  2.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2023-01-27T20:16:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-01-27T20:16:09Z  Started     Task started by client
2023-01-27T20:16:08Z  Task Setup  Building Task Directory
2023-01-27T20:16:08Z  Received    Task received by client
root@worker1:/etc/nomad.d# nomad alloc status -region=east d8e322a0
No allocation(s) with prefix or id "d8e322a0" found

worker2

root@worker2:/etc/nomad.d# nomad alloc status -region=west d8e322a0
ID                  = d8e322a0-c622-a2a4-eb96-acd0fc1196dc
Eval ID             = 7d303505
Name                = error_repro.error_repro_group[1]
Node ID             = e63a5786
Node Name           = worker1
Job ID              = error_repro
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 6m55s ago
Modified            = 6m45s ago
Deployment ID       = 98639e8e
Deployment Health   = healthy

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.195:20348 -> 6379

Task "error_repro_task" is "running"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  2.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2023-01-27T20:16:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-01-27T20:16:09Z  Started     Task started by client
2023-01-27T20:16:08Z  Task Setup  Building Task Directory
2023-01-27T20:16:08Z  Received    Task received by client
root@worker2:/etc/nomad.d# nomad alloc status -region=east d8e322a0
No allocation(s) with prefix or id "d8e322a0" found
  1. Run nomad node status -self on the worker nodes:

worker1

root@worker1:/etc/nomad.d# nomad node status -self
ID              = e63a5786-9a76-9831-1ef9-ab5001ff08c3
Name            = worker1
Class           = <none>
DC              = west_dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 2h46m40s
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec,java,raw_exec

Node Events
Time                  Subsystem  Message
2023-01-27T19:18:54Z  Cluster    Node registered

Allocated Resources
CPU            Memory           Disk
1000/8000 MHz  512 MiB/3.8 GiB  600 MiB/19 GiB

Allocation Resource Utilization
CPU         Memory
0/8000 MHz  5.0 MiB/3.8 GiB

Host Resource Utilization
CPU         Memory           Disk
0/8000 MHz  354 MiB/3.8 GiB  (/dev/mapper/ubuntu--vg-ubuntu--lv)

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created    Modified
d8e322a0  e63a5786  error_repro_group  0        run      running  8m46s ago  8m36s ago
f5e66320  e63a5786  error_repro_group  0        run      running  8m46s ago  8m36s ago

worker2

root@worker2:/etc/nomad.d# nomad node status -self
ID              = 88833ba9-59b3-7ad7-d238-545e14d622e3
Name            = worker2
Class           = <none>
DC              = east_dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 2h38m52s
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec

Node Events
Time                  Subsystem  Message
2023-01-27T19:18:56Z  Cluster    Node registered

Allocated Resources
CPU         Memory       Disk
0/8000 MHz  0 B/3.8 GiB  0 B/21 GiB

Allocation Resource Utilization
CPU         Memory
0/8000 MHz  0 B/3.8 GiB

Host Resource Utilization
CPU         Memory           Disk
0/8000 MHz  306 MiB/3.8 GiB  (/dev/mapper/ubuntu--vg-ubuntu--lv)

Allocations
No allocations placed

Jobspecs

error.nomad

job "error_repro" {
  datacenters = ["east_dc1", "west_dc1"]

  type = "service"

  spread {
    attribute = "${node.datacenter}"
    weight    = 100

    target "east_dc1" {
      percent = 100
    }
    target "west_dc1" {
      percent = 0
    }
 }

  update {
    max_parallel = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    progress_deadline = "10m"
    auto_revert = false
    canary = 0
  }

  migrate {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "error_repro_group" {
    count = 2

    network {
      port "db" {
        to = 6379
      }
    }

    restart {
      attempts = 3
      interval = "30m"
      delay = "15s"
      mode = "delay"
    }

    task "error_repro_task" {
      driver = "docker"

      config {
        image = "redis:7"

        ports = ["db"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

error_no_spread.nomad

job "error_repro" {
  datacenters = ["east_dc1", "west_dc1"]

  type = "service"

  update {
    max_parallel = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    progress_deadline = "10m"
    auto_revert = false
    canary = 0
  }

  migrate {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "error_repro_group" {
    count = 2

    network {
      port "db" {
        to = 6379
      }
    }

    restart {
      attempts = 3
      interval = "30m"
      delay = "15s"
      mode = "delay"
    }

    task "error_repro_task" {
      driver = "docker"

      config {
        image = "redis:7"

        ports = ["db"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

@tgross
Copy link
Member

tgross commented Jan 27, 2023

@ron-savoia do you have the specific API request from the server debug logs that's returning the error? That'd help narrow things down to something more easily reproducible. Your example adds a bunch of new quirks to the problem that @watsonian specifically said weren't in play, like multi-region federation.

@ron-savoia
Copy link
Contributor

@ron-savoia
Copy link
Contributor

@tgross the odd thing is that I am not seeing an error for a particular API call in the logs. I re-ran the job and the repro steps while running a debug and have added it to the GH. Hopefully you will see what I am not.

@ron-savoia
Copy link
Contributor

I forgot to add the last re-run info...

root@server1:/tmp# nomad run error.nomad
==> 2023-01-27T21:22:55Z: Monitoring evaluation "1760d36c"
    2023-01-27T21:22:55Z: Evaluation triggered by job "error_repro"
    2023-01-27T21:22:56Z: Evaluation within deployment: "464e057d"
    2023-01-27T21:22:56Z: Allocation "0e78b461" created: node "e63a5786", group "error_repro_group"
    2023-01-27T21:22:56Z: Allocation "8da880a0" created: node "e63a5786", group "error_repro_group"
    2023-01-27T21:22:56Z: Evaluation status changed: "pending" -> "complete"
==> 2023-01-27T21:22:56Z: Evaluation "1760d36c" finished with status "complete"
==> 2023-01-27T21:22:56Z: Monitoring deployment "464e057d"
  ✓ Deployment "464e057d" successful

    2023-01-27T21:23:08Z
    ID          = 464e057d
    Job ID      = error_repro
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
    error_repro_group  2        2       2        0          2023-01-27T21:33:06Z
    
root@server1:/tmp# nomad job status error_repro
ID            = error_repro
Name          = error_repro
Submit Date   = 2023-01-27T21:22:55Z
Type          = service
Priority      = 50
Datacenters   = east_dc1,west_dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost  Unknown
error_repro_group  0       0         2        0       0         0     0

Latest Deployment
ID          = 464e057d
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
error_repro_group  2        2       2        0          2023-01-27T21:33:06Z

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created  Modified
0e78b461  e63a5786  error_repro_group  0        run      running  19s ago  9s ago
8da880a0  e63a5786  error_repro_group  0        run      running  19s ago  9s ago

root@server1:/tmp# nomad alloc status 0e78b461
Error querying allocation: invalid character 'I' in numeric literal

root@server1:/tmp# nomad alloc status 8da880a0
Error querying allocation: invalid character 'I' in numeric literal

root@worker1:/etc/nomad.d# nomad node status -self
Error querying node allocations: invalid character 'I' in numeric literal

@tgross
Copy link
Member

tgross commented Jan 27, 2023

@tgross the odd thing is that I am not seeing an error for a particular API call in the logs. I re-ran the job and the repro steps while running a debug and have added it to the GH. Hopefully you will see what I am not.

You need to get the logs from the agent that your Nomad CLI has as the address. Just look at the journal for that host, don't bother with all the mess of the debug bundle -- that's all noise here.

@ron-savoia
Copy link
Contributor

ron-savoia commented Jan 31, 2023

This has been repro'd in a single region, two worker node cluster (ver. 1.4.3):
worker 1 - dc1
worker 2 - dc2

Error querying allocation: invalid character 'I' in numeric literal was seen when querying the alloc on the server and both worker nodes.

root@nomad-server-1:/home/vagrant# nomad run error.nomad
==> 2023-01-31T14:24:39Z: Monitoring evaluation "6e22c4f6"
    2023-01-31T14:24:39Z: Evaluation triggered by job "error_repro"
    2023-01-31T14:24:40Z: Evaluation within deployment: "3bd81a3d"
    2023-01-31T14:24:40Z: Allocation "699907ab" created: node "144e3838", group "error_repro_group"
    2023-01-31T14:24:40Z: Allocation "f122d1ff" created: node "144e3838", group "error_repro_group"
    2023-01-31T14:24:40Z: Evaluation status changed: "pending" -> "complete"
==> 2023-01-31T14:24:40Z: Evaluation "6e22c4f6" finished with status "complete"
==> 2023-01-31T14:24:40Z: Monitoring deployment "3bd81a3d"
  ✓ Deployment "3bd81a3d" successful

    2023-01-31T14:24:55Z
    ID          = 3bd81a3d
    Job ID      = error_repro
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
    error_repro_group  2        2       2        0          2023-01-31T14:34:54Z

root@nomad-server-1:/home/vagrant# nomad job status error_repro
ID            = error_repro
Name          = error_repro
Submit Date   = 2023-01-31T14:24:39Z
Type          = service
Priority      = 50
Datacenters   = dc1,dc2
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost  Unknown
error_repro_group  0       0         2        0       0         0     0

Latest Deployment
ID          = 3bd81a3d
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
error_repro_group  2        2       2        0          2023-01-31T14:34:54Z

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created  Modified
699907ab  144e3838  error_repro_group  0        run      running  31s ago  16s ago
f122d1ff  144e3838  error_repro_group  0        run      running  31s ago  16s ago

root@nomad-server-1:/home/vagrant# nomad alloc status 699907ab
Error querying allocation: invalid character 'I' in numeric literal
root@nomad-server-1:/home/vagrant# nomad alloc status f122d1ff
Error querying allocation: invalid character 'I' in numeric literal

root@nomad-client-1:/home/vagrant# nomad alloc status 699907ab
Error querying allocation: invalid character 'I' in numeric literal
root@nomad-client-1:/home/vagrant# nomad alloc status f122d1ff
Error querying allocation: invalid character 'I' in numeric literal

root@nomad-client-2:/home/vagrant# nomad alloc status 699907ab
Error querying allocation: invalid character 'I' in numeric literal
root@nomad-client-2:/home/vagrant# nomad alloc status f122d1ff
Error querying allocation: invalid character 'I' in numeric literal

Trace level logging as well as the journalctl logs have been collected and attached. No errors were seen in the logs when nomad alloc status <ALLOC_ID> was run on the server and workers.

The same jobspec that watsonian initially provided was used for the repro.

ReproLogs.zip

@tgross
Copy link
Member

tgross commented Jan 31, 2023

Thanks @ron-savoia! I'll take a look

@tgross tgross self-assigned this Jan 31, 2023
@tgross
Copy link
Member

tgross commented Feb 1, 2023

Ok, I was able to reproduce this on v1.4.3. I patched in a spew.Dump at the appropriate place in the code and was able to find that a piece of the scoring metadata has a value of negative infinity. In the API response this looks like the following:

$ curl -s "http://localhost:4646/v1/allocation/de64bc10-f96a-07d1-49a6-b06bc3f63e41?namespace=default" | jq .Metrics.ScoreMetaData
[
  {
    "NodeID": "2fc78255-ecad-f7aa-068b-9c952b741f3d",
    "NormScore": -0.4886056552847515,
    "Scores": {
      "node-reschedule-penalty": 0,
      "node-affinity": 0,
      "binpack": 0.022788689430496945,
      "job-anti-affinity": -1
    }
  },
  {
    "NodeID": "c65289de-32af-bc56-e230-d5056f9a4a4c",
    "NormScore": -1.7976931348623157e+308,
    "Scores": {
      "job-anti-affinity": 0,
      "node-reschedule-penalty": 0,
      "node-affinity": 0,
      "allocation-spread": -1.7976931348623157e+308,
      "binpack": 0.011454102865797388
    }
  }
]

(See the second block under Scores.allocation-spread)

If we look at the Go object before it's serialized, we see -Inf in the following:

   ScoreMetaData: ([]*structs.NodeScoreMeta) (len=2 cap=2) {
    (*structs.NodeScoreMeta)(0xc0002b48c0)(2fc78255-ecad-f7aa-068b-9c952b741f3d -0.488606 map[binpack:0.022788689430496945 job-anti-affinity:-1 node-affinity:0 node-reschedule-penalty:0]),
    (*structs.NodeScoreMeta)(0xc0002b48e0)(c65289de-32af-bc56-e230-d5056f9a4a4c -Inf map[allocation-spread:-Inf binpack:0.011454102865797388 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0])
   },

So what's happening here is that when the command line is deserializing from the JSON to a native object and that's throwing the error. Which makes this issue a duplicate of #8863 and #11130

A known workaround is to not have a spread.target.percent = 0 in your job spec. Particularly in the case of ${node.datacenter} this doesn't mean anything useful to the scheduler, as you can remove the datacenter from the list.

I'm going to close this issue as a duplicate. The underlying fix isn't currently on the roadmap but if you follow those issues and/or upvote them with a 👍 reaction it helps surface the problem when we're triaging the next batch of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants