Error querying node allocations: invalid character 'I' in numeric literal #11100

watsonian · 2021-08-30T15:56:13Z

Nomad version

Nomad v1.1.4+ent (fbad2c4f1c135306ecd6d9cd794508fbaaf0c4de)

Operating system and Environment details

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=21.04
DISTRIB_CODENAME=hirsute
DISTRIB_DESCRIPTION="Ubuntu 21.04"

nomad
  bootstrap = true
  known_regions = 1
  leader = true
  leader_addr = 172.28.128.151:4647
  server = true
raft
  applied_index = 14
  commit_index = 14
  fsm_pending = 0
  last_contact = 0
  last_log_index = 14
  last_log_term = 2
  last_snapshot_index = 0
  last_snapshot_term = 0
  latest_configuration = [{Suffrage:Voter ID:172.28.128.151:4647 Address:172.28.128.151:4647}]
  latest_configuration_index = 0
  num_peers = 0
  protocol_version = 2
  protocol_version_max = 3
  protocol_version_min = 0
  snapshot_version_max = 1
  snapshot_version_min = 0
  state = Leader
  term = 2
runtime
  arch = amd64
  cpu_count = 2
  goroutines = 78
  kernel.name = linux
  max_procs = 2
  version = go1.16.7
serf
  coordinate_resets = 0
  encrypted = false
  event_queue = 0
  event_time = 1
  failed = 0
  health_score = 0
  intent_queue = 0
  left = 0
  member_time = 3
  members = 2
  query_queue = 0
  query_time = 1
vault
  token_expire_time =
  token_ttl = 0s
  tracked_for_revoked = 0

Issue

With a two datacenter deployment when using the spread stanza for a job, the allocation returns an error when requesting the status:

$ nomad alloc status 59eeb575
Error querying allocation: invalid character 'I' in numeric literal

If you perform the request via curl, it comes back fine:

$ curl -s localhost:4646/v1/allocations?prefix=59eeb575 -v
*   Trying 127.0.0.1:4646...
* Connected to localhost (127.0.0.1) port 4646 (#0)
> GET /v1/allocations?prefix=59eeb575 HTTP/1.1
> Host: localhost:4646
> User-Agent: curl/7.74.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: application/json
< Vary: Accept-Encoding
< X-Nomad-Index: 253
< X-Nomad-Knownleader: true
< X-Nomad-Lastcontact: 0
< Date: Thu, 26 Aug 2021 17:29:30 GMT
< Transfer-Encoding: chunked
<
[{"ID":"59eeb575-0e75-9714-2262-c9c1c83cbfe7","EvalID":"5fa73e1c-8641-06dd-fc86-7a43eb7e87fd","Name":"error_repro.error_repro_group[1]","Namespace":"default","NodeID":"bdf919c8-1e52-a814-8067-2a25b23dc5fe","NodeName":"dc1-nomad-client-1","JobID":"error_repro","JobType":"service","JobVersion":0,"TaskGroup":"error_repro_group","AllocatedResources":null,"DesiredStatus":"run","DesiredDescription":"","ClientStatus":"running","ClientDescription":"Tasks are running","DesiredTransition":{"Migrate":null,"Reschedule":null,"ForceReschedule":null},"TaskStates":{"error_repro_task":{"State":"running","Failed":false,"Restarts":0,"LastRestart":null,"StartedAt":"2021-08-26T17:25:47.092349187Z","FinishedAt":null,"Events":[{"Type":"Received","Time":1629998746553224465,"Message":"","DisplayMessage":"Task received by client","Details":{},"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Task Setup","Time":1629998746561175588,"Message":"Building Task Directory","DisplayMessage":"Building Task Directory","Details":{"message":"Building Task Directory"},"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""},{"Type":"Started","Time":1629998747092173863,"Message":"","DisplayMessage":"Task started by client","Details":{},"FailsTask":false,"RestartReason":"","SetupError":"","DriverError":"","ExitCode":0,"Signal":0,"KillTimeout":0,"KillError":"","KillReason":"","StartDelay":0,"DownloadError":"","ValidationError":"","DiskLimit":0,"FailedSibling":"","VaultError":"","TaskSignalReason":"","TaskSignal":"","DriverMessage":"","GenericSource":""}],"TaskHandle":null}},"DeploymentStatus":{"Healthy":true,"Timestamp":"2021-08-26T17:25:57.094420438Z","Canary":false,"ModifyIndex":251},"FollowupEvalID":"","RescheduleTracker":null,"PreemptedAllocations":null,"PreemptedByAllocation":"","CreateIndex":246,"ModifyIndex":251,"CreateTime":1629998746532089531,"ModifyTime":1629998757212308321}]
* Connection #0 to host localhost left intactTUE

Reproduction steps

Spin up an environment with two Nomad datacenters (dc1, dc2). Each datacenter should have a single Nomad server and a single Nomad client (this isn't necessarily important to the reproduction, but is how I reproduced the issue). Configs supplied below.
Run the example job (supplied below). It should run successfully.
Run nomad alloc status $ALLOC_ID for one of the allocs of the job. The error should appear. Commenting out the spread stanza and running the job again causes nomad alloc status to return the expected result.

Expected Result

The allocation status should be printed.

Actual Result

Error querying allocation: invalid character 'I' in numeric literal

Job file (if appropriate)

job "error_repro" {
  datacenters = ["dc1", "dc2"]

  type = "service"

  spread {
    attribute = "${node.datacenter}"
    weight    = 100

    target "dc1" {
      percent = 100 
    }
    target "dc2" {
      percent = 0
    }
  }

  update {
    max_parallel = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    progress_deadline = "10m"
    auto_revert = false
    canary = 0
  }

  migrate {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "error_repro_group" {
    count = 2

    network {
      port "db" {
        to = 6379
      }
    }

    restart {
      attempts = 3
      interval = "30m"
      delay = "15s"
      mode = "delay"
    }

    task "error_repro_task" {
      driver = "docker"

      config {
        image = "redis:3.2"

        ports = ["db"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Nomad Server Config

data_dir = "/tmp/data-nomad"

bind_addr = "0.0.0.0"

log_level = "trace"

advertise {
    http = "{{ GetInterfaceIP \"eth1\" }}"
    rpc  = "{{ GetInterfaceIP \"eth1\" }}"
    serf = "{{ GetInterfaceIP \"eth1\" }}"
}

server {
    enabled = true
    bootstrap_expect = 1

    license_path = "/vagrant/nomad.lic"
}

client {
    enabled = false
}

telemetry {
  publish_allocation_metrics = true
  publish_node_metrics       = true
  prometheus_metrics         = true
}

Nomad Client Config

data_dir = "/tmp/data-nomad"

bind_addr = "0.0.0.0"

log_level = "trace"

advertise {
    http = "{{ GetInterfaceIP \"eth1\" }}"
    rpc  = "{{ GetInterfaceIP \"eth1\" }}"
    serf = "{{ GetInterfaceIP \"eth1\" }}"
}

server {
    enabled = false

    license_path = "/vagrant/nomad.lic"
}

client {
    enabled = true
}

telemetry {
  publish_allocation_metrics = true
  publish_node_metrics       = true
  prometheus_metrics         = true
}

The text was updated successfully, but these errors were encountered:

lgfa29 · 2021-09-02T18:02:13Z

Thanks for the report @watsonian.

Does the error still happens if the CLI is connected to the server in the datacenter where the allocation is running? For example, if you set the NOMAD_ADDR env var to point to the region that has the allocation:

$ NOMAD_ADDR=http://dc1.my.nomad.cluster.com:4646 nomad alloc status 59eeb575

watsonian · 2021-09-02T21:54:50Z

@lgfa29 In my tests all DCs were in the same region and I was running the CLI request against the server the job was allocated in.

ron-savoia · 2023-01-27T20:44:24Z

This behavior is seen in 1.4.3 as well.

Repro Steps

Create two single server clusters (East/West regions).
Create one worker node for each cluster.
Verify that each cluster is stable before moving on.

Setup

Region  Server  	IP				Worker		IP
West	server1	 192.168.1.190		worker1  192.168.1.195
East	server2	 192.168.1.191		worker2  192.168.1.196

On server2 run: nomad server join 192.168.1.190:4648
Run error.nomad on server1 (West)

root@server1:/tmp# nomad run error.nomad
==> 2023-01-27T19:57:00Z: Monitoring evaluation "375adef0"
    2023-01-27T19:57:00Z: Evaluation triggered by job "error_repro"
    2023-01-27T19:57:01Z: Evaluation within deployment: "c2e272fd"
    2023-01-27T19:57:01Z: Allocation "e0c03903" created: node "e63a5786", group "error_repro_group"
    2023-01-27T19:57:01Z: Allocation "f2dcf44a" created: node "e63a5786", group "error_repro_group"
    2023-01-27T19:57:01Z: Evaluation status changed: "pending" -> "complete"
==> 2023-01-27T19:57:01Z: Evaluation "375adef0" finished with status "complete"
==> 2023-01-27T19:57:01Z: Monitoring deployment "c2e272fd"
  ✓ Deployment "c2e272fd" successful

    2023-01-27T19:57:17Z
    ID          = c2e272fd
    Job ID      = error_repro
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
    error_repro_group  2        2       2        0          2023-01-27T20:07:15Z

Run nomad status:

server1

root@server1:/tmp# nomad status
ID           Type     Priority  Status   Submit Date
error_repro  service  50        running  2023-01-27T19:57:00Z

server2

root@server2:/etc/nomad.d# nomad status
No running jobs

worker1

root@worker1:/etc/nomad.d# nomad status
ID           Type     Priority  Status   Submit Date
error_repro  service  50        running  2023-01-27T19:57:00Z

worker2

root@worker2:/etc/nomad.d# nomad status
No running jobs

Run nomad job status error_repro:

server1

root@server1:/tmp# nomad job status error_repro
ID            = error_repro
Name          = error_repro
Submit Date   = 2023-01-27T19:57:00Z
Type          = service
Priority      = 50
Datacenters   = east_dc1,west_dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost  Unknown
error_repro_group  0       0         2        0       0         0     0

Latest Deployment
ID          = c2e272fd
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
error_repro_group  2        2       2        0          2023-01-27T20:07:15Z

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created    Modified
e0c03903  e63a5786  error_repro_group  0        run      running  4m14s ago  3m58s ago
f2dcf44a  e63a5786  error_repro_group  0        run      running  4m14s ago  3m58s ago

server2

root@server2:/etc/nomad.d# nomad job status error_repro
No job(s) with prefix or id "error_repro" found

worker1

root@worker1:/etc/nomad.d# nomad job status error_repro
ID            = error_repro
Name          = error_repro
Submit Date   = 2023-01-27T19:57:00Z
Type          = service
Priority      = 50
Datacenters   = east_dc1,west_dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost  Unknown
error_repro_group  0       0         2        0       0         0     0

Latest Deployment
ID          = c2e272fd
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
error_repro_group  2        2       2        0          2023-01-27T20:07:15Z

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created    Modified
e0c03903  e63a5786  error_repro_group  0        run      running  4m14s ago  3m58s ago
f2dcf44a  e63a5786  error_repro_group  0        run      running  4m14s ago  3m58s ago

worker2

root@worker2:/etc/nomad.d# nomad job status error_repro
No job(s) with prefix or id "error_repro" found

Run nomad alloc status <ALLOC_ID>:

server1

root@server1:/tmp# nomad alloc status e0c03903
Error querying allocation: invalid character 'I' in numeric literal

server2

root@server2:/etc/nomad.d# nomad alloc status e0c03903
No allocation(s) with prefix or id "e0c03903" found

worker1

root@worker1:/etc/nomad.d# nomad alloc status e0c03903
Error querying allocation: invalid character 'I' in numeric literal

worker2

root@worker2:/etc/nomad.d# nomad alloc status e0c03903
No allocation(s) with prefix or id "e0c03903" found

Re-run using region nomad alloc status <ALLOC_ID>:

server1

root@server1:/tmp# nomad alloc status -region=west e0c03903
Error querying allocation: invalid character 'I' in numeric literal
root@server1:/tmp# nomad alloc status -region=east e0c03903
No allocation(s) with prefix or id "e0c03903" found

server2

root@server2:/etc/nomad.d# nomad alloc status -region=west e0c03903
Error querying allocation: invalid character 'I' in numeric literal
root@server2:/etc/nomad.d# nomad alloc status -region=east e0c03903
No allocation(s) with prefix or id "e0c03903" found

worker1

root@worker1:/etc/nomad.d# nomad alloc status -region=west e0c03903
Error querying allocation: invalid character 'I' in numeric literal
root@worker1:/etc/nomad.d# nomad alloc status -region=east e0c03903
No allocation(s) with prefix or id "e0c03903" found

worker2

root@worker2:/etc/nomad.d# nomad alloc status -region=west e0c03903
Error querying allocation: invalid character 'I' in numeric literal
root@worker2:/etc/nomad.d# nomad alloc status -region=east e0c03903
No allocation(s) with prefix or id "e0c03903" found

Run nomad node status -self on the worker nodes:

worker1

root@worker1:/etc/nomad.d# nomad node status -self
Error querying node allocations: invalid character 'I' in numeric literal

worker2

root@worker2:/etc/nomad.d# nomad node status -self
ID              = 88833ba9-59b3-7ad7-d238-545e14d622e3
Name            = worker2
Class           = <none>
DC              = east_dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 2h22m12s
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec

Node Events
Time                  Subsystem  Message
2023-01-27T19:18:56Z  Cluster    Node registered

Allocated Resources
CPU         Memory       Disk
0/8000 MHz  0 B/3.8 GiB  0 B/21 GiB

Allocation Resource Utilization
CPU         Memory
0/8000 MHz  0 B/3.8 GiB

Host Resource Utilization
CPU         Memory           Disk
0/8000 MHz  298 MiB/3.8 GiB  (/dev/mapper/ubuntu--vg-ubuntu--lv)

Allocations
No allocations placed

Purge job and run gc

Re-run of the same steps, with spread stanza removed (error_no_spread.nomad)

Run error.nomad on server1 (West):

root@server1:/tmp# nomad run error_no_spread.nomad
==> 2023-01-27T20:16:08Z: Monitoring evaluation "7d303505"
    2023-01-27T20:16:08Z: Evaluation triggered by job "error_repro"
    2023-01-27T20:16:09Z: Evaluation within deployment: "98639e8e"
    2023-01-27T20:16:09Z: Allocation "d8e322a0" created: node "e63a5786", group "error_repro_group"
    2023-01-27T20:16:09Z: Allocation "f5e66320" created: node "e63a5786", group "error_repro_group"
    2023-01-27T20:16:09Z: Evaluation status changed: "pending" -> "complete"
==> 2023-01-27T20:16:09Z: Evaluation "7d303505" finished with status "complete"
==> 2023-01-27T20:16:09Z: Monitoring deployment "98639e8e"
  ✓ Deployment "98639e8e" successful

    2023-01-27T20:16:20Z
    ID          = 98639e8e
    Job ID      = error_repro
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
    error_repro_group  2        2       2        0          2023-01-27T20:26:19Z

Run nomad status:

server1

root@server1:/tmp# nomad status
ID           Type     Priority  Status   Submit Date
error_repro  service  50        running  2023-01-27T20:16:08Z

server2

root@server2:/etc/nomad.d# nomad status
No running jobs

worker1

root@worker1:/etc/nomad.d# nomad status
ID           Type     Priority  Status   Submit Date
error_repro  service  50        running  2023-01-27T20:16:08Z

worker2

root@worker2:/etc/nomad.d# nomad status
No running jobs

Run nomad job status error_repro:

server1

root@server1:/tmp# nomad job status error_repro
ID            = error_repro
Name          = error_repro
Submit Date   = 2023-01-27T20:16:08Z
Type          = service
Priority      = 50
Datacenters   = east_dc1,west_dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost  Unknown
error_repro_group  0       0         2        0       0         0     0

Latest Deployment
ID          = 98639e8e
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
error_repro_group  2        2       2        0          2023-01-27T20:26:19Z

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created    Modified
d8e322a0  e63a5786  error_repro_group  0        run      running  3m30s ago  3m20s ago
f5e66320  e63a5786  error_repro_group  0        run      running  3m30s ago  3m20s ago

server2

root@server2:/etc/nomad.d# nomad job status error_repro
No job(s) with prefix or id "error_repro" found

worker1

root@worker1:/etc/nomad.d# nomad job status error_repro
ID            = error_repro
Name          = error_repro
Submit Date   = 2023-01-27T20:16:08Z
Type          = service
Priority      = 50
Datacenters   = east_dc1,west_dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost  Unknown
error_repro_group  0       0         2        0       0         0     0

Latest Deployment
ID          = 98639e8e
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
error_repro_group  2        2       2        0          2023-01-27T20:26:19Z

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created    Modified
d8e322a0  e63a5786  error_repro_group  0        run      running  3m30s ago  3m20s ago
f5e66320  e63a5786  error_repro_group  0        run      running  3m30s ago  3m20s ago

worker2

root@worker2:/etc/nomad.d# nomad job status error_repro
No job(s) with prefix or id "error_repro" found

Run nomad alloc status <ALLOC_ID>:

server1

root@server1:/tmp# nomad alloc status d8e322a0
ID                  = d8e322a0-c622-a2a4-eb96-acd0fc1196dc
Eval ID             = 7d303505
Name                = error_repro.error_repro_group[1]
Node ID             = e63a5786
Node Name           = worker1
Job ID              = error_repro
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 4m56s ago
Modified            = 4m46s ago
Deployment ID       = 98639e8e
Deployment Health   = healthy

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.195:20348 -> 6379

Task "error_repro_task" is "running"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  2.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2023-01-27T20:16:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-01-27T20:16:09Z  Started     Task started by client
2023-01-27T20:16:08Z  Task Setup  Building Task Directory
2023-01-27T20:16:08Z  Received    Task received by clien

server2

root@server2:/etc/nomad.d# nomad alloc status d8e322a0
No allocation(s) with prefix or id "d8e322a0" found

worker1

root@worker1:/etc/nomad.d# nomad alloc status d8e322a0
ID                  = d8e322a0-c622-a2a4-eb96-acd0fc1196dc
Eval ID             = 7d303505
Name                = error_repro.error_repro_group[1]
Node ID             = e63a5786
Node Name           = worker1
Job ID              = error_repro
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 4m56s ago
Modified            = 4m46s ago
Deployment ID       = 98639e8e
Deployment Health   = healthy

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.195:20348 -> 6379

Task "error_repro_task" is "running"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  2.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2023-01-27T20:16:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-01-27T20:16:09Z  Started     Task started by client
2023-01-27T20:16:08Z  Task Setup  Building Task Directory
2023-01-27T20:16:08Z  Received    Task received by client

worker2

root@worker2:/etc/nomad.d# nomad alloc status d8e322a0
No allocation(s) with prefix or id "d8e322a0" found

Re-run using region nomad alloc status <ALLOC_ID>:

server1

root@server1:/tmp# nomad alloc status -region=west d8e322a0
ID                  = d8e322a0-c622-a2a4-eb96-acd0fc1196dc
Eval ID             = 7d303505
Name                = error_repro.error_repro_group[1]
Node ID             = e63a5786
Node Name           = worker1
Job ID              = error_repro
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 6m55s ago
Modified            = 6m45s ago
Deployment ID       = 98639e8e
Deployment Health   = healthy

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.195:20348 -> 6379

Task "error_repro_task" is "running"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  2.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2023-01-27T20:16:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-01-27T20:16:09Z  Started     Task started by client
2023-01-27T20:16:08Z  Task Setup  Building Task Directory
2023-01-27T20:16:08Z  Received    Task received by client
root@server1:/tmp# nomad alloc status -region=east d8e322a0
No allocation(s) with prefix or id "d8e322a0" found

server2

root@server2:/etc/nomad.d# nomad alloc status -region=west d8e322a0
ID                  = d8e322a0-c622-a2a4-eb96-acd0fc1196dc
Eval ID             = 7d303505
Name                = error_repro.error_repro_group[1]
Node ID             = e63a5786
Node Name           = worker1
Job ID              = error_repro
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 6m55s ago
Modified            = 6m45s ago
Deployment ID       = 98639e8e
Deployment Health   = healthy

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.195:20348 -> 6379

Task "error_repro_task" is "running"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  2.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2023-01-27T20:16:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-01-27T20:16:09Z  Started     Task started by client
2023-01-27T20:16:08Z  Task Setup  Building Task Directory
2023-01-27T20:16:08Z  Received    Task received by client
root@server2:/etc/nomad.d# nomad alloc status -region=east d8e322a0
No allocation(s) with prefix or id "d8e322a0" found

worker1

root@worker1:/etc/nomad.d# nomad alloc status -region=west d8e322a0
ID                  = d8e322a0-c622-a2a4-eb96-acd0fc1196dc
Eval ID             = 7d303505
Name                = error_repro.error_repro_group[1]
Node ID             = e63a5786
Node Name           = worker1
Job ID              = error_repro
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 6m55s ago
Modified            = 6m45s ago
Deployment ID       = 98639e8e
Deployment Health   = healthy

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.195:20348 -> 6379

Task "error_repro_task" is "running"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  2.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2023-01-27T20:16:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-01-27T20:16:09Z  Started     Task started by client
2023-01-27T20:16:08Z  Task Setup  Building Task Directory
2023-01-27T20:16:08Z  Received    Task received by client
root@worker1:/etc/nomad.d# nomad alloc status -region=east d8e322a0
No allocation(s) with prefix or id "d8e322a0" found

worker2

root@worker2:/etc/nomad.d# nomad alloc status -region=west d8e322a0
ID                  = d8e322a0-c622-a2a4-eb96-acd0fc1196dc
Eval ID             = 7d303505
Name                = error_repro.error_repro_group[1]
Node ID             = e63a5786
Node Name           = worker1
Job ID              = error_repro
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 6m55s ago
Modified            = 6m45s ago
Deployment ID       = 98639e8e
Deployment Health   = healthy

Allocation Addresses:
Label  Dynamic  Address
*db    yes      192.168.1.195:20348 -> 6379

Task "error_repro_task" is "running"
Task Resources:
CPU        Memory           Disk     Addresses
0/500 MHz  2.5 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2023-01-27T20:16:09Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2023-01-27T20:16:09Z  Started     Task started by client
2023-01-27T20:16:08Z  Task Setup  Building Task Directory
2023-01-27T20:16:08Z  Received    Task received by client
root@worker2:/etc/nomad.d# nomad alloc status -region=east d8e322a0
No allocation(s) with prefix or id "d8e322a0" found

Run nomad node status -self on the worker nodes:

worker1

root@worker1:/etc/nomad.d# nomad node status -self
ID              = e63a5786-9a76-9831-1ef9-ab5001ff08c3
Name            = worker1
Class           = <none>
DC              = west_dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 2h46m40s
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec,java,raw_exec

Node Events
Time                  Subsystem  Message
2023-01-27T19:18:54Z  Cluster    Node registered

Allocated Resources
CPU            Memory           Disk
1000/8000 MHz  512 MiB/3.8 GiB  600 MiB/19 GiB

Allocation Resource Utilization
CPU         Memory
0/8000 MHz  5.0 MiB/3.8 GiB

Host Resource Utilization
CPU         Memory           Disk
0/8000 MHz  354 MiB/3.8 GiB  (/dev/mapper/ubuntu--vg-ubuntu--lv)

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created    Modified
d8e322a0  e63a5786  error_repro_group  0        run      running  8m46s ago  8m36s ago
f5e66320  e63a5786  error_repro_group  0        run      running  8m46s ago  8m36s ago

worker2

root@worker2:/etc/nomad.d# nomad node status -self
ID              = 88833ba9-59b3-7ad7-d238-545e14d622e3
Name            = worker2
Class           = <none>
DC              = east_dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 2h38m52s
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec

Node Events
Time                  Subsystem  Message
2023-01-27T19:18:56Z  Cluster    Node registered

Allocated Resources
CPU         Memory       Disk
0/8000 MHz  0 B/3.8 GiB  0 B/21 GiB

Allocation Resource Utilization
CPU         Memory
0/8000 MHz  0 B/3.8 GiB

Host Resource Utilization
CPU         Memory           Disk
0/8000 MHz  306 MiB/3.8 GiB  (/dev/mapper/ubuntu--vg-ubuntu--lv)

Allocations
No allocations placed

Jobspecs

error.nomad

job "error_repro" {
  datacenters = ["east_dc1", "west_dc1"]

  type = "service"

  spread {
    attribute = "${node.datacenter}"
    weight    = 100

    target "east_dc1" {
      percent = 100
    }
    target "west_dc1" {
      percent = 0
    }
 }

  update {
    max_parallel = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    progress_deadline = "10m"
    auto_revert = false
    canary = 0
  }

  migrate {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "error_repro_group" {
    count = 2

    network {
      port "db" {
        to = 6379
      }
    }

    restart {
      attempts = 3
      interval = "30m"
      delay = "15s"
      mode = "delay"
    }

    task "error_repro_task" {
      driver = "docker"

      config {
        image = "redis:7"

        ports = ["db"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

error_no_spread.nomad

job "error_repro" {
  datacenters = ["east_dc1", "west_dc1"]

  type = "service"

  update {
    max_parallel = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    progress_deadline = "10m"
    auto_revert = false
    canary = 0
  }

  migrate {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "error_repro_group" {
    count = 2

    network {
      port "db" {
        to = 6379
      }
    }

    restart {
      attempts = 3
      interval = "30m"
      delay = "15s"
      mode = "delay"
    }

    task "error_repro_task" {
      driver = "docker"

      config {
        image = "redis:7"

        ports = ["db"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

tgross · 2023-01-27T20:57:01Z

@ron-savoia do you have the specific API request from the server debug logs that's returning the error? That'd help narrow things down to something more easily reproducible. Your example adds a bunch of new quirks to the problem that @watsonian specifically said weren't in play, like multi-region federation.

ron-savoia · 2023-01-27T21:29:59Z

nomad-debug-2023-01-27-212240Z.tar.gz

ron-savoia · 2023-01-27T21:32:13Z

@tgross the odd thing is that I am not seeing an error for a particular API call in the logs. I re-ran the job and the repro steps while running a debug and have added it to the GH. Hopefully you will see what I am not.

ron-savoia · 2023-01-27T21:37:50Z

I forgot to add the last re-run info...

root@server1:/tmp# nomad run error.nomad
==> 2023-01-27T21:22:55Z: Monitoring evaluation "1760d36c"
    2023-01-27T21:22:55Z: Evaluation triggered by job "error_repro"
    2023-01-27T21:22:56Z: Evaluation within deployment: "464e057d"
    2023-01-27T21:22:56Z: Allocation "0e78b461" created: node "e63a5786", group "error_repro_group"
    2023-01-27T21:22:56Z: Allocation "8da880a0" created: node "e63a5786", group "error_repro_group"
    2023-01-27T21:22:56Z: Evaluation status changed: "pending" -> "complete"
==> 2023-01-27T21:22:56Z: Evaluation "1760d36c" finished with status "complete"
==> 2023-01-27T21:22:56Z: Monitoring deployment "464e057d"
  ✓ Deployment "464e057d" successful

    2023-01-27T21:23:08Z
    ID          = 464e057d
    Job ID      = error_repro
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
    error_repro_group  2        2       2        0          2023-01-27T21:33:06Z
    
root@server1:/tmp# nomad job status error_repro
ID            = error_repro
Name          = error_repro
Submit Date   = 2023-01-27T21:22:55Z
Type          = service
Priority      = 50
Datacenters   = east_dc1,west_dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost  Unknown
error_repro_group  0       0         2        0       0         0     0

Latest Deployment
ID          = 464e057d
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
error_repro_group  2        2       2        0          2023-01-27T21:33:06Z

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created  Modified
0e78b461  e63a5786  error_repro_group  0        run      running  19s ago  9s ago
8da880a0  e63a5786  error_repro_group  0        run      running  19s ago  9s ago

root@server1:/tmp# nomad alloc status 0e78b461
Error querying allocation: invalid character 'I' in numeric literal

root@server1:/tmp# nomad alloc status 8da880a0
Error querying allocation: invalid character 'I' in numeric literal

root@worker1:/etc/nomad.d# nomad node status -self
Error querying node allocations: invalid character 'I' in numeric literal

tgross · 2023-01-27T21:57:42Z

@tgross the odd thing is that I am not seeing an error for a particular API call in the logs. I re-ran the job and the repro steps while running a debug and have added it to the GH. Hopefully you will see what I am not.

You need to get the logs from the agent that your Nomad CLI has as the address. Just look at the journal for that host, don't bother with all the mess of the debug bundle -- that's all noise here.

ron-savoia · 2023-01-31T14:59:46Z

This has been repro'd in a single region, two worker node cluster (ver. 1.4.3):
worker 1 - dc1
worker 2 - dc2

Error querying allocation: invalid character 'I' in numeric literal was seen when querying the alloc on the server and both worker nodes.

root@nomad-server-1:/home/vagrant# nomad run error.nomad
==> 2023-01-31T14:24:39Z: Monitoring evaluation "6e22c4f6"
    2023-01-31T14:24:39Z: Evaluation triggered by job "error_repro"
    2023-01-31T14:24:40Z: Evaluation within deployment: "3bd81a3d"
    2023-01-31T14:24:40Z: Allocation "699907ab" created: node "144e3838", group "error_repro_group"
    2023-01-31T14:24:40Z: Allocation "f122d1ff" created: node "144e3838", group "error_repro_group"
    2023-01-31T14:24:40Z: Evaluation status changed: "pending" -> "complete"
==> 2023-01-31T14:24:40Z: Evaluation "6e22c4f6" finished with status "complete"
==> 2023-01-31T14:24:40Z: Monitoring deployment "3bd81a3d"
  ✓ Deployment "3bd81a3d" successful

    2023-01-31T14:24:55Z
    ID          = 3bd81a3d
    Job ID      = error_repro
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
    error_repro_group  2        2       2        0          2023-01-31T14:34:54Z

root@nomad-server-1:/home/vagrant# nomad job status error_repro
ID            = error_repro
Name          = error_repro
Submit Date   = 2023-01-31T14:24:39Z
Type          = service
Priority      = 50
Datacenters   = dc1,dc2
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group         Queued  Starting  Running  Failed  Complete  Lost  Unknown
error_repro_group  0       0         2        0       0         0     0

Latest Deployment
ID          = 3bd81a3d
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group         Desired  Placed  Healthy  Unhealthy  Progress Deadline
error_repro_group  2        2       2        0          2023-01-31T14:34:54Z

Allocations
ID        Node ID   Task Group         Version  Desired  Status   Created  Modified
699907ab  144e3838  error_repro_group  0        run      running  31s ago  16s ago
f122d1ff  144e3838  error_repro_group  0        run      running  31s ago  16s ago

root@nomad-server-1:/home/vagrant# nomad alloc status 699907ab
Error querying allocation: invalid character 'I' in numeric literal
root@nomad-server-1:/home/vagrant# nomad alloc status f122d1ff
Error querying allocation: invalid character 'I' in numeric literal

root@nomad-client-1:/home/vagrant# nomad alloc status 699907ab
Error querying allocation: invalid character 'I' in numeric literal
root@nomad-client-1:/home/vagrant# nomad alloc status f122d1ff
Error querying allocation: invalid character 'I' in numeric literal

root@nomad-client-2:/home/vagrant# nomad alloc status 699907ab
Error querying allocation: invalid character 'I' in numeric literal
root@nomad-client-2:/home/vagrant# nomad alloc status f122d1ff
Error querying allocation: invalid character 'I' in numeric literal

Trace level logging as well as the journalctl logs have been collected and attached. No errors were seen in the logs when nomad alloc status <ALLOC_ID> was run on the server and workers.

The same jobspec that watsonian initially provided was used for the repro.

ReproLogs.zip

tgross · 2023-01-31T15:11:38Z

Thanks @ron-savoia! I'll take a look

tgross · 2023-02-01T19:25:21Z

Ok, I was able to reproduce this on v1.4.3. I patched in a spew.Dump at the appropriate place in the code and was able to find that a piece of the scoring metadata has a value of negative infinity. In the API response this looks like the following:

$ curl -s "http://localhost:4646/v1/allocation/de64bc10-f96a-07d1-49a6-b06bc3f63e41?namespace=default" | jq .Metrics.ScoreMetaData
[
  {
    "NodeID": "2fc78255-ecad-f7aa-068b-9c952b741f3d",
    "NormScore": -0.4886056552847515,
    "Scores": {
      "node-reschedule-penalty": 0,
      "node-affinity": 0,
      "binpack": 0.022788689430496945,
      "job-anti-affinity": -1
    }
  },
  {
    "NodeID": "c65289de-32af-bc56-e230-d5056f9a4a4c",
    "NormScore": -1.7976931348623157e+308,
    "Scores": {
      "job-anti-affinity": 0,
      "node-reschedule-penalty": 0,
      "node-affinity": 0,
      "allocation-spread": -1.7976931348623157e+308,
      "binpack": 0.011454102865797388
    }
  }
]

(See the second block under Scores.allocation-spread)

If we look at the Go object before it's serialized, we see -Inf in the following:

   ScoreMetaData: ([]*structs.NodeScoreMeta) (len=2 cap=2) {
    (*structs.NodeScoreMeta)(0xc0002b48c0)(2fc78255-ecad-f7aa-068b-9c952b741f3d -0.488606 map[binpack:0.022788689430496945 job-anti-affinity:-1 node-affinity:0 node-reschedule-penalty:0]),
    (*structs.NodeScoreMeta)(0xc0002b48e0)(c65289de-32af-bc56-e230-d5056f9a4a4c -Inf map[allocation-spread:-Inf binpack:0.011454102865797388 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0])
   },

So what's happening here is that when the command line is deserializing from the JSON to a native object and that's throwing the error. Which makes this issue a duplicate of #8863 and #11130

A known workaround is to not have a spread.target.percent = 0 in your job spec. Particularly in the case of ${node.datacenter} this doesn't mean anything useful to the scheduler, as you can remove the datacenter from the list.

I'm going to close this issue as a duplicate. The underlying fix isn't currently on the roadmap but if you follow those issues and/or upvote them with a 👍 reaction it helps surface the problem when we're triaging the next batch of work.

watsonian added type/bug hcc/cst Admin - internal labels Aug 30, 2021

lgfa29 added stage/needs-investigation theme/cli labels Sep 2, 2021

tgross self-assigned this Jan 31, 2023

tgross closed this as not planned Won't fix, can't repro, duplicate, stale Feb 1, 2023

tgross added the stage/duplicate label Feb 1, 2023

This was referenced Feb 1, 2023

Node affinity range is squashed when mixing positive and negative weights #11130

Open

-Inf in allocation scoring JSON output #8863

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error querying node allocations: invalid character 'I' in numeric literal #11100

Error querying node allocations: invalid character 'I' in numeric literal #11100

watsonian commented Aug 30, 2021

lgfa29 commented Sep 2, 2021

watsonian commented Sep 2, 2021

ron-savoia commented Jan 27, 2023

tgross commented Jan 27, 2023 •

edited

Loading

ron-savoia commented Jan 27, 2023

ron-savoia commented Jan 27, 2023

ron-savoia commented Jan 27, 2023

tgross commented Jan 27, 2023

ron-savoia commented Jan 31, 2023 •

edited

Loading

tgross commented Jan 31, 2023

tgross commented Feb 1, 2023

Error querying node allocations: invalid character 'I' in numeric literal #11100

Error querying node allocations: invalid character 'I' in numeric literal #11100

Comments

watsonian commented Aug 30, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server Config

Nomad Client Config

lgfa29 commented Sep 2, 2021

watsonian commented Sep 2, 2021

ron-savoia commented Jan 27, 2023

Repro Steps

Re-run of the same steps, with spread stanza removed (error_no_spread.nomad)

Jobspecs

tgross commented Jan 27, 2023 • edited Loading

ron-savoia commented Jan 27, 2023

ron-savoia commented Jan 27, 2023

ron-savoia commented Jan 27, 2023

tgross commented Jan 27, 2023

ron-savoia commented Jan 31, 2023 • edited Loading

tgross commented Jan 31, 2023

tgross commented Feb 1, 2023

tgross commented Jan 27, 2023 •

edited

Loading

ron-savoia commented Jan 31, 2023 •

edited

Loading