Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI volume per_alloc availability zone placement #11778

Closed
ygersie opened this issue Jan 5, 2022 · 18 comments · Fixed by #12129 or #13274
Closed

CSI volume per_alloc availability zone placement #11778

ygersie opened this issue Jan 5, 2022 · 18 comments · Fixed by #12129 or #13274
Assignees
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Milestone

Comments

@ygersie
Copy link
Contributor

ygersie commented Jan 5, 2022

Nomad version

v1.2.3

Issue

According to the recommendation here #10793 (comment) and in the docs the proper way to deal with scheduling volumes that are required to be mounted in the same AZ is to run a plugin controller per AZ. When using the per_alloc volume option, this doesn't work as expected. I assume this has to do with the fact that an alloc id isn't known at time of scheduling, so Nomad tries to randomly assign an alloc to a node that can't satisfy the volume requirement, this however conflicts with the purpose of the per_alloc directive. It also makes me wonder if the alloc index is even supposed to be a runtime variable as it is, as far as I can tell, determined during scheduling time?

With the following volumes and job:

$ nomad volume status myvolume[0]
ID                   = myvolume[0]
Name                 = myvolume
External ID          = vol-xxxxxxxxxxxxxx
Plugin ID            = aws-ebs-us-west-2a
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 1
Nodes Expected       = 1
Access Mode          = <none>
Attachment Mode      = <none>
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

$ nomad volume status myvolume[1]
ID                   = myvolume[1]
Name                 = myvolume
External ID          = vol-xxxxxxxxxxxxxx
Plugin ID            = aws-ebs-us-west-2b
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 1
Nodes Expected       = 1
Access Mode          = <none>
Attachment Mode      = <none>
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed
job "example" {
  type = "service"

  region      = "us-west-2"
  datacenters = ["us-west-2a", "us-west-2b", "us-west-2c"]

  group "example" {
    count = 2

    volume "ebs" {
      type      = "csi"
      source    = "myvolume"
      read_only = false
      per_alloc = true

      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type = "ext4"
      }
    }

    task "example" {
      driver = "docker"
      config {
        image = "alpine"
        args  = ["tail", "-f", "/dev/null"]
      }

      volume_mount {
        volume      = "ebs"
        destination = "/data"
        read_only   = false
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

The result is that placement intermittently fails:

$ nomad plan example.hcl
+ Job: "example"
+ Task Group: "example" (2 create)
  + Task: "example" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "example" (failed to place 1 allocation):
    * Class "default": 1 nodes excluded by filter
    * Constraint "CSI plugin aws-ebs-us-west-2b is missing from client <clientNodeID>": 1 nodes excluded by filter

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 example.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
@tgross
Copy link
Member

tgross commented Jan 5, 2022

Hi @ygersie! Yeah unfortunately you'll probably want to namespace the volume names by AZ as well, if you're using the per_alloc field. Note this should all be fixed more nicely once we've implemented topology (coming soon!)

@ygersie
Copy link
Contributor Author

ygersie commented Jan 5, 2022

Hi @tgross thanks for the quick response! Forgive my ignorance, could you elaborate a bit on what you mean with namespacing the volumes per AZ?

@tgross
Copy link
Member

tgross commented Jan 5, 2022

The jobspec you have crosses AZs, so the allocations are being placed based on binpacking across those AZs. So if you use per_alloc = true then the volume names created during scheduling will point to volumes that might not exist in those AZs, as you've seen.

So as a very hacky workaround, you can name the volumes myvolume-2a[0], myvolume-2a[1], etc. and then target jobs to each AZ independently. I recognize this isn't ideal, which is why we want topology so that we don't have this requirement anymore.

@ygersie
Copy link
Contributor Author

ygersie commented Jan 6, 2022

@tgross thanks, yeah it's not pretty. In case someone runs into the same issue I came up with the following for now:

variable "datacenters" {
  type    = list(string)
  default = ["us-west-2a", "us-west-2b", "us-west-2c"]
}

variable "instance_count" {
  type    = number
  default = 3
}

locals {
  nr_of_dcs   = length(var.datacenters)
  alloc_to_az = { for i, dc in sort(var.datacenters) : i => dc }
}

job "example" {
  type = "service"

  region      = "us-west-2"
  datacenters = var.datacenters

  dynamic "group" {
    for_each = range(var.instance_count)
    iterator = alloc
    labels   = ["example-${alloc.key}"]

    content {
      volume "ebs" {
        type = "csi"
        # creates volume source ids like:
        # us-west-2a-myvolume[0]
        # us-west-2b-myvolume[1]
        source    = "${local.alloc_to_az[alloc.key % local.nr_of_dcs]}-myvolume[${alloc.key}]"
        read_only = false

        attachment_mode = "file-system"
        access_mode     = "single-node-writer"

        mount_options {
          fs_type = "ext4"
        }
      }

      task "example" {
        driver = "docker"
        config {
          image = "alpine"
          args  = ["tail", "-f", "/dev/null"]
        }

        volume_mount {
          volume      = "ebs"
          destination = "/data"
          read_only   = false
        }

        resources {
          cpu    = 100
          memory = 64
        }
      }
    }
  }
}

At least this makes it easier to schedule a different number of instances in a single job.

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Feb 3, 2022
@tgross
Copy link
Member

tgross commented Feb 24, 2022

Just a heads up that I'm actively working on #7669 which will resolve this issue.

@tgross tgross added this to the 1.3.0 milestone Feb 24, 2022
@tgross tgross self-assigned this Feb 24, 2022
@tgross
Copy link
Member

tgross commented Mar 1, 2022

Resolved by #12129, which will ship in Nomad 1.3.0

@ygersie
Copy link
Contributor Author

ygersie commented Mar 1, 2022

Awesome stuff @tgross !

@ygersie
Copy link
Contributor Author

ygersie commented May 19, 2022

Hey @tgross I just got time to test out the per_alloc feature with Nomad 1.3.0 but I'm still running into this issue. My test case:

Volume specs:

id        = "ygersie[0]"
name      = "ygersie[0]"
namespace = "mynamespace"

type        = "csi"
plugin_id   = "aws-ebs-us-west-2a"
external_id = "vol-asdfasdfasdfasdfasdf"

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

mount_options {
  fs_type = "ext4"
}

topology_request {
  required {
    topology {
      segments {
        "topology.ebs.csi.aws.com/zone" = "us-west-2a"
      }
    }
  }
}

and the second:

id        = "ygersie[1]"
name      = "ygersie[1]"
namespace = "mynamespace"

type        = "csi"
plugin_id   = "aws-ebs-us-west-2b"
external_id = "vol-asdfasdfasdfasdfasdf"

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

mount_options {
  fs_type = "ext4"
}

topology_request {
  required {
    topology {
      segments {
        "topology.ebs.csi.aws.com/zone" = "us-west-2b"
      }
    }
  }
}

and the job:

job "ygersie" {
  region      = "us-west-2"
  datacenters = ["us-west-2a", "us-west-2b", "us-west-2c", "us-west-2d"]
  namespace   = "mynamespace"

  group "example" {
    count = 2

    volume "ebs" {
      type      = "csi"
      source    = "ygersie"
      read_only = false
      per_alloc = true

      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type = "ext4"
      }
    }

    task "example" {
      driver = "docker"
      config {
        image = "alpine"
        args  = ["tail", "-f", "/dev/null"]
      }

      volume_mount {
        volume      = "ebs"
        destination = "/data"
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

The plan sometimes is completely fine and sometimes still shows:

$ nomad job plan example-job-volume.hcl
+ Job: "ygersie"
+ Task Group: "example" (2 create)
  + Task: "example" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "example" (failed to place 1 allocation):
    * Class "default": 1 nodes excluded by filter
    * Constraint "CSI plugin aws-ebs-us-west-2b is missing from client 17224e99-393e-16d2-aa0a-329ff47ca63b": 1 nodes excluded by filter

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 example-job-volume.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

If I do a job run it'll fail initially and on the second attempt it seems to still schedule it correctly. However, I wonder if it is just luck that the job schedules on the second attempt due to the low amount of (test) nodes I have or if there's a difference between the plan and the evaluation generated on job creation. As long as Nomad guarantees it will be placed once the job is created I can get away with marking this as a "cosmetic" bug.

I'm still using the plugin_id with the AZ in there as this is how we currently have our plugins deployed. I wouldn't expect this to be related to the issue I'm seeing here.

@tgross
Copy link
Member

tgross commented May 19, 2022

Hi @ygersie! That message bubbles up from the per-node feasibility check (ref feasible.go#L305-L309), which looks like this:

plugin, ok := n.CSINodePlugins[vol.PluginID]
if !ok {
	return false, fmt.Sprintf(FilterConstraintCSIPluginTemplate, vol.PluginID, n.ID)
}

The reason you're seeing different behaviors on different runs is most likely because the scheduler shuffles the list of nodes. For each allocation the list is feasibility checked (filtered) and scored until we either have at least 1 (and a maximum of 2) viable node scores or we run out of nodes.

So suppose we have (nodeA, nodeB, nodeC, nodeD) one for each DC. Our example[0] has a volume[0] in zone A, and example[1] has a volume[1] in zone B. One possible iteration through the scheduler might look like this:

  • example[0]:
    • nodeD is feasible? no
    • nodeB is feasible? no
    • nodeA is feasible? yes, score it.
    • nodeC is feasible? no
    • no more nodes
    • take node with max score (out of first 2 scored, but we scored only 1)
    • planned for nodeA
  • example[1]:
    • nodeD is feasible? no
    • allocB: nodeB is feasible? yes, score it.
    • allocB: nodeA is feasible? no
    • allocB: nodeC is feasible? no
    • no more nodes
    • take node with max score (out of first 2 scored, but we scored only 1)
    • planned for nodeB

The failure you're getting suggests that one of the zones doesn't have a plugin (much less a healthy one). So when you say:

I'm still using the plugin_id with the AZ in there as this is how we currently have our plugins deployed. I wouldn't expect this to be related to the issue I'm seeing here.

I think this may be unexpectedly be a factor still. If it were topology-related I'd expect to see "did not meet topology requirement" (ref feasible.go#L317-L323).

I'll re-open this issue while we debug this. Some questions on your test environment that might help narrow down the behavior:

  • Can you verify that the volumes were created in the correct AWS AZ as their topology says?
  • Do you have more than one node in each DC with the running plugins?
  • Do any nodes in the DC not have a plugin?
  • Which DC is 17224e99-393e-16d2-aa0a-329ff47ca63b in? It would be interesting to see whether that was a DC with a different plugin or a DC without a plugin at all.

@tgross tgross reopened this May 19, 2022
@tgross tgross removed this from the 1.3.0 milestone May 19, 2022
@ygersie
Copy link
Contributor Author

ygersie commented May 19, 2022

Hey @tgross

Can you verify that the volumes were created in the correct AWS AZ as their topology says?

They are, otherwise it would never work as planned, but here is the confirmation:

$ aws ec2 describe-volumes --volume-ids vol-0f53db9b7f68bca2c vol-07aae447a2bec7826 --query 'Volumes[].[VolumeId,AvailabilityZone]'
[
    [
        "vol-07aae447a2bec7826",
        "us-west-2b"
    ],
    [
        "vol-0f53db9b7f68bca2c",
        "us-west-2a"
    ]
]

and the volumes in nomad:

$ nomad volume status -json ygersie[0] | jq -r '"\(.ExternalID)\n\(.Topologies)"'
vol-0f53db9b7f68bca2c
[{"Segments":{"topology.ebs.csi.aws.com/zone":"us-west-2a"}}]

$ nomad volume status -json ygersie[1] | jq -r '"\(.ExternalID)\n\(.Topologies)"'
vol-07aae447a2bec7826
[{"Segments":{"topology.ebs.csi.aws.com/zone":"us-west-2b"}}]

Do you have more than one node in each DC with the running plugins

I do have more than one node running with the plugins, here's the nomad plugin output:

$ nomad plugin status -verbose
Container Storage Interface
ID                  Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs-us-west-2a  ebs.csi.aws.com  2/2                           3/3
aws-ebs-us-west-2b  ebs.csi.aws.com  2/2                           3/3
aws-ebs-us-west-2c  ebs.csi.aws.com  2/2                           3/3
aws-ebs-us-west-2d  ebs.csi.aws.com  2/2                           3/3
aws-efs             efs.csi.aws.com  2/2                           12/12

I have 2 controllers running per AZ (== nomad datacenter) and each node then runs the node plugin as a system job. All plugins are reported healthy and there is sufficient capacity for placement.

Do any nodes in the DC not have a plugin?

No, each node has a EBS node plugin running.

Which DC is 17224e99-393e-16d2-aa0a-329ff47ca63b in? It would be interesting to see whether that was a DC with a different plugin or a DC without a plugin at all.

That node is running in us-west-2d which doesn't have a volume at all. I only have 2 volumes created, one in us-west-2a and one in us-west-2b. They both have the required topology configured as shown in my previous comment. Also, the plugins correctly report their accessible topologies:

$ nomad plugin status -verbose aws-ebs-us-west-2a
ID                   = aws-ebs-us-west-2a
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 3

Controller Capabilities
  ATTACH_READONLY
  CLONE_VOLUME
  CONTROLLER_ATTACH_DETACH
  CREATE_DELETE_SNAPSHOT
  CREATE_DELETE_VOLUME
  EXPAND_VOLUME
  GET_CAPACITY
  GET_VOLUME
  LIST_SNAPSHOTS
  LIST_VOLUMES
  LIST_VOLUMES_PUBLISHED_NODES
  VOLUME_CONDITION

Node Capabilities
  EXPAND_VOLUME
  GET_VOLUME_STATS
  STAGE_UNSTAGE_VOLUME
  VOLUME_ACCESSIBILITY_CONSTRAINTS
  VOLUME_CONDITION

Accessible Topologies
Node ID   Accessible Topology
6095ae77  topology.ebs.csi.aws.com/zone=us-west-2a
a2fca83c  topology.ebs.csi.aws.com/zone=us-west-2a
4a3e8c7c  topology.ebs.csi.aws.com/zone=us-west-2a

Allocations
No allocations placed

and

$ nomad plugin status -verbose aws-ebs-us-west-2b
ID                   = aws-ebs-us-west-2b
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 3

Controller Capabilities
  ATTACH_READONLY
  CLONE_VOLUME
  CONTROLLER_ATTACH_DETACH
  CREATE_DELETE_SNAPSHOT
  CREATE_DELETE_VOLUME
  EXPAND_VOLUME
  GET_CAPACITY
  GET_VOLUME
  LIST_SNAPSHOTS
  LIST_VOLUMES
  LIST_VOLUMES_PUBLISHED_NODES
  VOLUME_CONDITION

Node Capabilities
  EXPAND_VOLUME
  GET_VOLUME_STATS
  STAGE_UNSTAGE_VOLUME
  VOLUME_ACCESSIBILITY_CONSTRAINTS
  VOLUME_CONDITION

Accessible Topologies
Node ID   Accessible Topology
e2044d7c  topology.ebs.csi.aws.com/zone=us-west-2b
d45499f7  topology.ebs.csi.aws.com/zone=us-west-2b
2b5f4317  topology.ebs.csi.aws.com/zone=us-west-2b

Allocations
No allocations placed

@tgross
Copy link
Member

tgross commented May 19, 2022

@ygersie that's all super helpful.

That node is running in us-west-2d which doesn't have a volume at all.

Very interesting! Ok you provided a ton of info here where I can probably write a standalone test that lets me exercise the whole scheduler with roughly the same state. Let me have a go at that and hopefully it'll come up with a reproduction and clues as to what's going wrong. Thanks again!

@ygersie
Copy link
Contributor Author

ygersie commented May 19, 2022

@tgross you're welcome, thanks for taking a quick look, the sooner this gets resolved the better :)
I'm not sure why the plugins report "No allocations placed", the job is running and both volumes are now attached:

$ nomad volume status ygersie[0]
ID                   = ygersie[0]
Name                 = ygersie[0]
External ID          = vol-0f53db9b7f68bca2c
Plugin ID            = aws-ebs-us-west-2a
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = fs_type: ext4
Namespace            = mynamespace

Topologies
Topology  Segments
00        topology.ebs.csi.aws.com/zone=us-west-2a

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
bdb1e655  4a3e8c7c  example     0        run      running  8h32m ago  8h32m ago

and

$ nomad volume status ygersie[1]
ID                   = ygersie[1]
Name                 = ygersie[1]
External ID          = vol-07aae447a2bec7826
Plugin ID            = aws-ebs-us-west-2b
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = fs_type: ext4
Namespace            = mynamespace

Topologies
Topology  Segments
00        topology.ebs.csi.aws.com/zone=us-west-2b

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
eb6d7301  2b5f4317  example     0        run      running  8h34m ago  8h34m ago

@ygersie
Copy link
Contributor Author

ygersie commented May 24, 2022

It seems this issue even occurs with the manual AZ mapping of source volumes, this worked at least before the 1.3.0 upgrade. I have a job that is failing to be scheduled with the same constraint issue. The job has 3 taskgroups each with a volume claim looking like:

                "Volumes": {
                    "data": {
                        "AccessMode": "single-node-writer",
                        "AttachmentMode": "file-system",
                        "MountOptions": {
                            "FSType": "ext4",
                            "MountFlags": null
                        },
                        "Name": "data",
                        "PerAlloc": false,
                        "ReadOnly": false,
                        "Source": "us-west-2c-redis[2]",
                        "Type": "csi"
                    }
                }

the volume status:

$ nomad volume status us-west-2c-redis[2]
ID                   = us-west-2c-redis[2]
Name                 = redis[2]
External ID          = vol-asdfasdfasdfasdf
Plugin ID            = aws-ebs-us-west-2c
Provider             = ebs.csi.aws.com
Version              = v1.4.0
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = fs_type: ext4
Namespace            = mynamespace

Allocations
No allocations placed

And the job status:

Placement Failure
Task Group "redis-0":
  * Class "default": 1 nodes excluded by filter
  * Constraint "CSI plugin aws-ebs-us-west-2a is missing from client f13406f2-1cb1-008b-ebf5-74c27f418a46": 1 nodes excluded by filter

Task Group "redis-1":
  * Class "default": 1 nodes excluded by filter
  * Constraint "CSI plugin aws-ebs-us-west-2b is missing from client f562d820-a589-6c02-807a-d653ebfb7b14": 1 nodes excluded by filter

Task Group "redis-2":
  * Class "default": 1 nodes excluded by filter
  * Constraint "CSI plugin aws-ebs-us-west-2c is missing from client 17224e99-393e-16d2-aa0a-329ff47ca63b": 1 nodes excluded by filter

The node ids are indeed in completely different datacenters, so they are rightfully missing the plugin.

@tgross
Copy link
Member

tgross commented May 24, 2022

Ok, I haven't had a chance to come back to write that test. But:

I'm not sure why the plugins report "No allocations placed",

Just FYI on this the nomad plugin status command has to hit the Allocations API in order to get the list of allocations. But because plugins don't have a namespace but the plugin jobs do, the Allocations API will respond with an empty list if the plugins aren't in the default namespace if you don't provide a namespace to nomad plugin status.

@ygersie
Copy link
Contributor Author

ygersie commented May 26, 2022

I'm now also seeing this every now and then:

Placement Failure
Task Group "example":
  * Class "default": 1 nodes excluded by filter
  * Constraint "did not meet topology requirement": 1 nodes excluded by filter

There really seems to be a problem with selecting feasible nodes.

@tgross
Copy link
Member

tgross commented May 31, 2022

Ok, I've got a failing test that demonstrates the issue:

$ NOMAD_TEST_LOG_LEVEL=debug go test -v ./scheduler -run TestServiceSched_CSITopology -count=1
=== RUN   TestServiceSched_CSITopology
=== PAUSE TestServiceSched_CSITopology
=== CONT  TestServiceSched_CSITopology
2022-05-31T10:26:56.851-0400 [DEBUG] scheduler/generic_sched.go:384: service_sched: reconciled current state with desired state: eval_id=dcbcac79-43e6-780a-6500-accff939ed44 job_id=mock-service-70f9fa3a-4a5d-f7ff-8568-d593e72cbd87 namespace=default
  results=
  | Total changes: (place 2) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0)
  | Desired Changes for "web": (place 2) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)

2022-05-31T10:26:56.851-0400 [DEBUG] scheduler/generic_sched.go:301: service_sched: failed to place all allocations, blocked eval created: eval_id=dcbcac79-43e6-780a-6500-accff939ed44 job_id=mock-service-70f9fa3a-4a5d-f7ff-8568-d593e72cbd87 namespace=default blocked_eval_id=94ca12a5-b216-677a-bc29-bd31340c4ec8
2022-05-31T10:26:56.851-0400 [DEBUG] scheduler/util.go:796: service_sched: setting eval status: eval_id=dcbcac79-43e6-780a-6500-accff939ed44 job_id=mock-service-70f9fa3a-4a5d-f7ff-8568-d593e72cbd87 namespace=default status=complete
FILTERED: CSI plugin test-plugin-zone0 is missing from client 1dc7bc5b-930e-417f-3cf8-3222e4d98413    generic_sched_test.go:6486:
                Error Trace:    generic_sched_test.go:6486
                Error:          "[]" should have 1 item(s), but has 0
                Test:           TestServiceSched_CSITopology
                Messages:       expected one plan
--- FAIL: TestServiceSched_CSITopology (0.00s)
FAIL
FAIL    github.com/hashicorp/nomad/scheduler    0.706s
FAIL

That test can be found in the b-csi-feasibility-check branch. Interestingly, if I change this test so that all the plugins have the same ID, everything works as expected. This gives us two things:

  • The test narrows down the range of problems we need to figure out; the topology feasibility check itself seems to be correct on an individual node basis. We only can demonstrate the behavior at the whole-scheduler level.
  • Using the same plugin ID and relying solely on topology may be a useful temporary workaround while we get this figured out.

Edit: it occurred to me that this plugin ID feasibility check happens before topology entirely. So I removed topology from the plugins and volume requests and it still fails the same way. That probably eliminates topology as the source of the issue.

Edit 2: I've run out my timebox for this today but in some detailed exercising of this test (with a lot of printf debugging), I'm finding that the CSI feasibility checker simply isn't getting the full set of nodes to process. Which suggests there's some deeper bug lurking in the scheduler that we've been missing for a while. I saw your comment on #12748 @ygersie, so I'll pick this up again a bit later this week with that in mind.

@dhung-hashicorp dhung-hashicorp added the hcc/cst Admin - internal label Jun 6, 2022
@tgross
Copy link
Member

tgross commented Jun 7, 2022

I've just opened a PR in #13274 which should ship in Nomad 1.3.2 with backports to Nomad 1.2.x and Nomad 1.1.x.

Something interesting we discovered here is that this bug has existed since we first implemented CSI, but topology constraints make the feasibility check "sparser" and therefore more likely to hit this bug. But anyone running CSI on a cluster with a lot of heterogeneity could have easily hit it as well.

@github-actions
Copy link

github-actions bot commented Oct 7, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Projects
None yet
3 participants