Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: EBS plugin requires topology feature #10891

Closed
m1keil opened this issue Jul 13, 2021 · 17 comments · Fixed by #12129
Closed

CSI: EBS plugin requires topology feature #10891

m1keil opened this issue Jul 13, 2021 · 17 comments · Fixed by #12129
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Milestone

Comments

@m1keil
Copy link

m1keil commented Jul 13, 2021

Nomad version

Nomad v1.1.1 (7feec97c04de4f8afff54ca9e56d66a61dfbfeb3)

Operating system and Environment details

AWS VPC with 3 subnets, each in a separate availability zone. We run CSI EBS controller on every AZ (3 in total). Every node plugin job is configured to run against the suitable controller in the same AZ.

Issue

When using nomad volume create, the volume is not created in the same availability zone that the plugin is running at.

Reproduction steps

Assuming our plugin ids are: aws-ebs0-eu-west-1a, aws-ebs0-eu-west-1b and aws-ebs0-eu-west-1c, save the following config in volume.hcl and execute nomad create volume volume.hcl.

id        = "test"
name      = "test"
type      = "csi"
plugin_id = "aws-ebs0-eu-west-1c"

capacity_min = "30GiB"
capacity_max = "30GiB"

capability {
  access_mode = "single-node-writer"
  attachment_mode = "file-system"
}

Expected Result

Volume created in the eu-west-1c availability zone.

Actual Result

Volume constantly being created in the eu-west-1a availability zone. I'm not sure why it's always created by that controller.

@jrasell
Copy link
Member

jrasell commented Jul 14, 2021

Hi @m1keil and thanks for the report. I'll need a little time to dig into this; after which i'll come back to you.

@m1keil
Copy link
Author

m1keil commented Jul 15, 2021

Cheers. If there are any logs or other debug info I can provide let me know.

@mrproper
Copy link

mrproper commented Nov 7, 2021

I can confirm this is still the case and its more about the csi plugin itself than nomad:

variable "image" {
  type    = string
  default = "amazon/aws-ebs-csi-driver:release-1.4"
}
job "plugin-aws-ebs-controller" {
  datacenters = ["ops"]
  type        = "service"

  dynamic "group" {
    for_each = { "0" = "us-west-2a", "1" = "us-west-2b", "2" = "us-west-2c" }
    labels = ["controller-${group.key}"]
    content {
      count = 1
      constraint {
        attribute = "${attr.platform.aws.placement.availability-zone}"
        value     = "${group.value}"
      }
      task "plugin" {
        driver = "docker"
        config {
          image = var.image
          args = [
            "controller",
            "--endpoint=unix://csi/csi.sock",
            "--logtostderr",
            "--v=5",
            "--extra-tags=NomadCluster=hashi-prime1-ops",
          ]
        }
        csi_plugin {
          id        = "aws-ebs-${group.key}"
          type      = "controller"
          mount_dir = "/csi"
        }
        resources {
          cpu    = 500
          memory = 256
        }
      }
    }
  }
}

This places 1 controller on each availability zone with a plugin id of aws-ebs-(0|1|2) which get constrained to us-west-2(a|b|c) respectively

Then when i create a volume:

id           = "prometheus[2]"
name         = "prometheus-2"
type         = "csi"
plugin_id    = "aws-ebs-2"
capacity_max = "1T"
capacity_min = "50G"

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

mount_options {
  fs_type     = "ext4"
}

Then create the volume:

$ nomad volume create ../nomad-volume-prometheus-2.hcl 
Created external volume vol-0bb2aa18a49f69f44 with ID prometheus[2]

The plugin that takes this request is on us-west-2c and the respective logs:

I1107 21:17:19.136131       1 controller.go:101] CreateVolume: called with args {Name:prometheus-2 CapacityRange:required_bytes:50000000000 limit_bytes:1000000000000  VolumeCapabilities:[mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > ] Parameters:map[] Secrets:map[] VolumeContentSource:<nil> AccessibilityRequirements: XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I1107 21:17:22.362823       1 cloud.go:319] [Debug] AZ is not provided. Using node AZ [us-west-2a]
I1107 21:17:25.680612       1 inflight.go:73] Node Service: volume="prometheus-2" operation finished

The aws-csi-ebs-plugin (https://github.com/kubernetes-sigs/aws-ebs-csi-driver) does not have a parameter exposed to set the availability zone when creating a volume request, nor does the controller have an argument to limit itself to a particular AZ
In fact the features of the plugin quotes:

Dynamic Provisioning - uses persistence volume claim (PVC) to request the Kuberenetes to create the EBS volume on behalf of user and consumes the volume from inside container. Storage class's allowedTopologies could be used to restrict which AZ the volume should be provisioned in. The topology key should be topology.ebs.csi.aws.com/zone.

Nomad does nothing with the struct that contains allowedTopologies

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Nov 8, 2021
@tgross
Copy link
Member

tgross commented Nov 10, 2021

Hi @m1keil! Yeah, as @mrproper has noted this is a behavior specific to the AWS EBS plugin. We have a note about this kind of thing (which calls out that plugin specifically) in the csi_plugin docs:

Some plugins will create volumes only in the same location as the plugin. For example, the AWS EBS plugin will create and mount volumes only within the same Availability Zone. You should deploy these plugins with a unique-per-AZ plugin_id to allow Nomad to place allocations in the correct AZ.

@mrproper
Copy link

mrproper commented Nov 10, 2021

This shouldn't be closed. the ebs plugin already allows for locking a a storage class to particular availability zones:
https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies
Nomad does not allow you to do this as it dumps the topologies from the struct returned into /dev/null

https://cluster-api-aws.sigs.k8s.io/topics/external-cloud-provider-with-ebs-csi-driver.html

@tgross
Copy link
Member

tgross commented Nov 10, 2021

It's true we don't yet support CSI topology #7669. But isn't the matchLabelExpressions in that spec just the labels on the K8s nodes and not some kind of automatically-derived AWS availability zone? In any case, if it's fixable via topology, we'll handle that in #7669.

@m1keil
Copy link
Author

m1keil commented Nov 11, 2021

@tgross I'm aware of the documentation (I think I'm the reason it was added :))
But this is exactly where the issue is:

When using nomad volume create, the volume is not created in the same availability zone that the plugin is running at.

The setup I have follows the directions as instructed and yet the volume isn't being created where I execute the create command.

@tgross
Copy link
Member

tgross commented Nov 11, 2021

Hm... @m1keil can you provide the output of nomad job inspect on the plugin that's supposed to be getting the volume?

@m1keil
Copy link
Author

m1keil commented Nov 12, 2021

What az am I on?

$ ec2metadata --availability-zone
eu-west-1b

Agent CSI info:

$  nomad node status -self | grep -i CSI
CSI Controllers = <none>
CSI Drivers     = aws-ebs0-eu-west-1b
CSI Volumes     = <none>

Plugin job (nodes):

$ nomad job inspect plugin-aws-ebs-nodes
{
    "Job": {
        "Affinities": null,
        "AllAtOnce": false,
        "Constraints": [
            {
                "LTarget": "${node.class}",
                "Operand": "=",
                "RTarget": "generic-app"
            }
        ],
        "ConsulNamespace": "",
        "ConsulToken": "",
        "CreateIndex": 489199,
        "Datacenters": [
            "dc1"
        ],
        "Dispatched": false,
        "ID": "plugin-aws-ebs-nodes",
        "JobModifyIndex": 864990,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 864990,
        "Multiregion": null,
        "Name": "plugin-aws-ebs-nodes",
        "Namespace": "default",
        "NomadTokenID": "",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 80,
        "Region": "eu-west-1_dev",
        "Reschedule": null,
        "Spreads": null,
        "Stable": false,
        "Status": "running",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1633169101351091741,
        "TaskGroups": [
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.platform.aws.placement.availability-zone}",
                        "Operand": "=",
                        "RTarget": "eu-west-1a"
                    }
                ],
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": null,
                "Name": "nodes-eu-west-1a",
                "Networks": null,
                "ReschedulePolicy": null,
                "RestartPolicy": {
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Interval": 1800000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": null,
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "CSIPluginConfig": {
                            "ID": "aws-ebs0-eu-west-1a",
                            "MountDir": "/csi",
                            "Type": "node"
                        },
                        "Config": {
                            "image": "amazon/aws-ebs-csi-driver:v1.1.0",
                            "args": [
                                "node",
                                "--endpoint=unix://csi/csi.sock",
                                "--logtostderr",
                                "--v=5"
                            ],
                            "privileged": true
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "plugin",
                        "Resources": {
                            "CPU": 500,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 256,
                            "MemoryMaxMB": 0,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 2,
                            "Delay": 15000000000,
                            "Interval": 1800000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": null
                    }
                ],
                "Update": null,
                "Volumes": null
            },
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.platform.aws.placement.availability-zone}",
                        "Operand": "=",
                        "RTarget": "eu-west-1b"
                    }
                ],
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": null,
                "Name": "nodes-eu-west-1b",
                "Networks": null,
                "ReschedulePolicy": null,
                "RestartPolicy": {
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Interval": 1800000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": null,
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "CSIPluginConfig": {
                            "ID": "aws-ebs0-eu-west-1b",
                            "MountDir": "/csi",
                            "Type": "node"
                        },
                        "Config": {
                            "args": [
                                "node",
                                "--endpoint=unix://csi/csi.sock",
                                "--logtostderr",
                                "--v=5"
                            ],
                            "privileged": true,
                            "image": "amazon/aws-ebs-csi-driver:v1.1.0"
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "plugin",
                        "Resources": {
                            "CPU": 500,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 256,
                            "MemoryMaxMB": 0,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 2,
                            "Delay": 15000000000,
                            "Interval": 1800000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": null
                    }
                ],
                "Update": null,
                "Volumes": null
            },
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.platform.aws.placement.availability-zone}",
                        "Operand": "=",
                        "RTarget": "eu-west-1c"
                    }
                ],
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": null,
                "Name": "nodes-eu-west-1c",
                "Networks": null,
                "ReschedulePolicy": null,
                "RestartPolicy": {
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Interval": 1800000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": null,
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "CSIPluginConfig": {
                            "ID": "aws-ebs0-eu-west-1c",
                            "MountDir": "/csi",
                            "Type": "node"
                        },
                        "Config": {
                            "privileged": true,
                            "image": "amazon/aws-ebs-csi-driver:v1.1.0",
                            "args": [
                                "node",
                                "--endpoint=unix://csi/csi.sock",
                                "--logtostderr",
                                "--v=5"
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "plugin",
                        "Resources": {
                            "CPU": 500,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 256,
                            "MemoryMaxMB": 0,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 2,
                            "Delay": 15000000000,
                            "Interval": 1800000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": null
                    }
                ],
                "Update": null,
                "Volumes": null
            }
        ],
        "Type": "system",
        "Update": {
            "AutoPromote": false,
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 0,
            "MinHealthyTime": 0,
            "ProgressDeadline": 0,
            "Stagger": 0
        },
        "VaultNamespace": "",
        "VaultToken": "",
        "Version": 4
    }
}

Plugin job (controller):

$ nomad job inspect plugin-aws-ebs-controller
{
    "Job": {
        "Affinities": null,
        "AllAtOnce": false,
        "Constraints": [
            {
                "LTarget": "${node.class}",
                "Operand": "=",
                "RTarget": "generic-app"
            }
        ],
        "ConsulNamespace": "",
        "ConsulToken": "",
        "CreateIndex": 342473,
        "Datacenters": [
            "dc1"
        ],
        "Dispatched": false,
        "ID": "plugin-aws-ebs-controller",
        "JobModifyIndex": 864976,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 865039,
        "Multiregion": null,
        "Name": "plugin-aws-ebs-controller",
        "Namespace": "default",
        "NomadTokenID": "",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 80,
        "Region": "eu-west-1_dev",
        "Reschedule": null,
        "Spreads": null,
        "Stable": true,
        "Status": "running",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1633169094044568531,
        "TaskGroups": [
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.platform.aws.placement.availability-zone}",
                        "Operand": "=",
                        "RTarget": "eu-west-1a"
                    }
                ],
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "controller-eu-west-1a",
                "Networks": null,
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 30000000000,
                    "DelayFunction": "exponential",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Interval": 1800000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": null,
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "CSIPluginConfig": {
                            "ID": "aws-ebs0-eu-west-1a",
                            "MountDir": "/csi",
                            "Type": "controller"
                        },
                        "Config": {
                            "image": "amazon/aws-ebs-csi-driver:v1.1.0",
                            "args": [
                                "controller",
                                "--endpoint=unix://csi/csi.sock",
                                "--logtostderr",
                                "--v=5"
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "plugin",
                        "Resources": {
                            "CPU": 500,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 256,
                            "MemoryMaxMB": 0,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 2,
                            "Delay": 15000000000,
                            "Interval": 1800000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": null
                    }
                ],
                "Update": {
                    "AutoPromote": false,
                    "AutoRevert": false,
                    "Canary": 0,
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000,
                    "ProgressDeadline": 600000000000,
                    "Stagger": 30000000000
                },
                "Volumes": null
            },
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.platform.aws.placement.availability-zone}",
                        "Operand": "=",
                        "RTarget": "eu-west-1b"
                    }
                ],
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "controller-eu-west-1b",
                "Networks": null,
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 30000000000,
                    "DelayFunction": "exponential",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Interval": 1800000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": null,
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "CSIPluginConfig": {
                            "ID": "aws-ebs0-eu-west-1b",
                            "MountDir": "/csi",
                            "Type": "controller"
                        },
                        "Config": {
                            "image": "amazon/aws-ebs-csi-driver:v1.1.0",
                            "args": [
                                "controller",
                                "--endpoint=unix://csi/csi.sock",
                                "--logtostderr",
                                "--v=5"
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "plugin",
                        "Resources": {
                            "CPU": 500,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 256,
                            "MemoryMaxMB": 0,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 2,
                            "Delay": 15000000000,
                            "Interval": 1800000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": null
                    }
                ],
                "Update": {
                    "AutoPromote": false,
                    "AutoRevert": false,
                    "Canary": 0,
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000,
                    "ProgressDeadline": 600000000000,
                    "Stagger": 30000000000
                },
                "Volumes": null
            },
            {
                "Affinities": null,
                "Constraints": [
                    {
                        "LTarget": "${attr.platform.aws.placement.availability-zone}",
                        "Operand": "=",
                        "RTarget": "eu-west-1c"
                    }
                ],
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "controller-eu-west-1c",
                "Networks": null,
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 30000000000,
                    "DelayFunction": "exponential",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Interval": 1800000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": null,
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "CSIPluginConfig": {
                            "ID": "aws-ebs0-eu-west-1c",
                            "MountDir": "/csi",
                            "Type": "controller"
                        },
                        "Config": {
                            "image": "amazon/aws-ebs-csi-driver:v1.1.0",
                            "args": [
                                "controller",
                                "--endpoint=unix://csi/csi.sock",
                                "--logtostderr",
                                "--v=5"
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "plugin",
                        "Resources": {
                            "CPU": 500,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 256,
                            "MemoryMaxMB": 0,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 2,
                            "Delay": 15000000000,
                            "Interval": 1800000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": null
                    }
                ],
                "Update": {
                    "AutoPromote": false,
                    "AutoRevert": false,
                    "Canary": 0,
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000,
                    "ProgressDeadline": 600000000000,
                    "Stagger": 30000000000
                },
                "Volumes": null
            }
        ],
        "Type": "service",
        "Update": {
            "AutoPromote": false,
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 1,
            "MinHealthyTime": 0,
            "ProgressDeadline": 0,
            "Stagger": 30000000000
        },
        "VaultNamespace": "",
        "VaultToken": "",
        "Version": 6
    }
}

Lets create the volume:

$ cat volume.hcl 
id        = "test"
name      = "test"
type      = "csi"
plugin_id = "aws-ebs0-eu-west-1b"

capacity_min = "30GiB"
capacity_max = "30GiB"

capability {
  access_mode = "single-node-writer"
  attachment_mode = "file-system"
}

$ nomad  volume  create volume.hcl
Created external volume vol-0c3404a26a0ad230b with ID test

Show volume

$ aws ec2 describe-volumes --volume-id vol-0c3404a26a0ad230b --query 'Volumes[].AvailabilityZone'
[
    "eu-west-1a"
]

@tgross
Copy link
Member

tgross commented Nov 12, 2021

Ok, so the controllers certainly seem to line up on the right AZs:

$ jq '.Job.TaskGroups[] | [ .Constraints[0].RTarget, .Tasks[].CSIPluginConfig.ID ]' < controller.json
[
  "eu-west-1a",
  "aws-ebs0-eu-west-1a"
]
[
  "eu-west-1b",
  "aws-ebs0-eu-west-1b"
]
[
  "eu-west-1c",
  "aws-ebs0-eu-west-1c"
]

I know our E2E tests don't run into this issue but they're also running an older version of the plugin (amazon/aws-ebs-csi-driver:v0.9.0). I don't see anything in the post-0.9 changelog that suggests this is something new, but I may just be missing it. Ultimately if the plugin expects topology support, then that's almost certainly the right way to do this of course, but it's not on our immediate roadmap to do that work.

Is the plugin fingerprinting its availability zone correctly? What does the controller plugin for an AZ say in its logs (both stdout and stderr) when it comes up?

@tgross tgross reopened this Nov 12, 2021
Nomad - Community Issues Triage automation moved this from Done to Needs Triage Nov 12, 2021
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Nov 12, 2021
@m1keil
Copy link
Author

m1keil commented Nov 13, 2021

Yes, fingerprinting seems to be fine.
I also went ahead and downgraded to 0.9.0 to see if it can help but no luck either.

Logs:

I1112 23:57:59.854930       1 driver.go:68] Driver: ebs.csi.aws.com Version: v0.9.0
W1112 23:58:03.119419       1 metadata.go:136] Failed to parse the outpost arn: 
I1112 23:58:03.119836       1 driver.go:138] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I1112 23:58:05.908775       1 controller.go:334] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I1112 23:58:35.918369       1 controller.go:334] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
....
(truncated)

I1113 00:09:09.323524       1 controller.go:95] CreateVolume: called with args {Name:test CapacityRange:required_bytes:32212254720 limit_bytes:32212254720  VolumeCapabilities:[mount:<> access_mode:<mode:SINGLE_NODE_WRITER > ] Parameters:map[] Secrets:map[] VolumeContentSource:<nil> AccessibilityRequirements: XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I1113 00:09:09.434854       1 cloud.go:284] AZ is not provided. Using node AZ []

@m1keil
Copy link
Author

m1keil commented Nov 13, 2021

Now I've upgraded to the latest v1.4.0 controller and ran the create command again. Now logs show this:

I1113 00:23:45.612155       1 controller.go:101] CreateVolume: called with args {Name:test CapacityRange:required_bytes:32212254720 limit_bytes:32212254720  VolumeCapabilities:[mount:<> access_mode:<mode:SINGLE_NODE_WRITER > ] Parameters:map[] Secrets:map[] VolumeContentSource:<nil> AccessibilityRequirements: XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I1113 00:23:45.647943       1 cloud.go:319] [Debug] AZ is not provided. Using node AZ [eu-west-1a]

but this is being run on a node located in eu-west-1b. So it seems to me like the AZ is just being picked up at random.

As a final test, I went and shut down the controller in eu-west-1a, and re-run the test. The volume still ended up at eu-west-1a.

This seems to be an identical issue and indeed related to topology support.

@tgross
Copy link
Member

tgross commented Nov 15, 2021

Thanks @m1keil. Looks like that's the solution then. It hasn't been on our immediate roadmap but if it blocks a common usage like this that's something we'll want to bump up in priority.

@tgross tgross moved this from In Progress to Needs Roadmapping in Nomad - Community Issues Triage Nov 15, 2021
@tgross tgross removed their assignment Nov 15, 2021
@tgross tgross changed the title CSI: EBS volume created in the wrong availability zone CSI: EBS plugin requires topology feature Nov 15, 2021
@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Feb 3, 2022
@tgross
Copy link
Member

tgross commented Feb 24, 2022

I'm actively working on #7669 which will resolve this issue.

@tgross tgross self-assigned this Feb 24, 2022
@tgross tgross added this to the 1.3.0 milestone Feb 24, 2022
Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Mar 1, 2022
@tgross
Copy link
Member

tgross commented Mar 1, 2022

Resolved by #12129, which will ship in Nomad 1.3.0

@m1keil
Copy link
Author

m1keil commented Mar 2, 2022

Thanks, I'm planning to upgrade and test once this is released. Will report back.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Projects
Development

Successfully merging a pull request may close this issue.

4 participants