Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Ability to use variable interpolation in volume {} stanza #7877

Closed
benvanstaveren opened this issue May 6, 2020 · 19 comments · Fixed by #10136
Closed

[feature request] Ability to use variable interpolation in volume {} stanza #7877

benvanstaveren opened this issue May 6, 2020 · 19 comments · Fixed by #10136
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/hcl theme/storage type/enhancement
Milestone

Comments

@benvanstaveren
Copy link

Per a topic on Discuss - I've ran into a few situations in our setup where it would be absolutely beneficial to use variable interpolation in the volume and host_volume stanzas. As an example, we currently set up various cloud instances with their own mounted volumes that belong to a particular instance. On these instances we like to run system jobs, so that every instance has one - but the problem now is that currently there is no easy way to automatically mount the volume that "belongs" to a particular node.

In essence, being able to do something like...

volume "persistent-data" {
    type = "csi"
    source = "pd-${node.attr.id}" 
   read_only = false
}

Would be most excellent. At that point each system job picks up it's own volume automatically, which makes my life easier (and saves on job definitions, or currently the one job we use with the 40+ task groups so we can get it to work properly).

Outside the scope of this feature request (but maybe related) would be a change to host volumes - currently we have a few servers that use host volumes, but the same volume is shared between multiple joibs because we can't mount a subdirectory out of a host volume - perhaps a change to allow something like:

volume "subdirectory-mounted-host-volume" {
  type = "host"
  source = "my-host-volume/${env "nomad_job_name"}"
  read_only = false
}

So that each job gets itself a named volume to be used with volume_mount that points to a subdirectory of the actual host volume. Currently if we want to achieve this we need to add new host volumes to the client configuration and restart the nomad process on the node, which isn't something we're very keen on. (Alternatively having new host volumes picked up with a reload, if it isn't already, would be nice too).

This would probably entail some tweaks to the order in which things are resolved and set up, but would (again) maky my life easier.

Can't speak for anyone else though so maybe I'm just the only one with this particular use case :D

@tgross
Copy link
Member

tgross commented May 6, 2020

Hi @benvanstaveren! Thanks for opening this issue and giving a solid description of a use case. Will get this onto the road map.

Outside the scope of this feature request (but maybe related) would be a change to host volumes - currently we have a few servers that use host volumes, but the same volume is shared between multiple joibs because we can't mount a subdirectory out of a host volume - perhaps a change to allow something like:

Yup! Looks like we've had a few requests for this feature: #7110 #6536

@shantanugadgil
Copy link
Contributor

cross referencing #8262

@ryanmickler
Copy link
Contributor

ryanmickler commented Jul 27, 2020

a huge +1, as this is blocking our progress with CSI. In particular, we'd need the Nomad job environment variables e.g.

...

    group "group" {

        volume "group_csi_volume" {
            type = "csi"
            read_only = false
            source = "csi-test-${NOMAD_ALLOC_INDEX}"
        }

        ...

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Aug 24, 2020
@vrenjith
Copy link
Contributor

Waiting for this badly 👍

@jokramer123
Copy link

jokramer123 commented Nov 13, 2020

It would be super cool if we could simply do this with our nomad jobs:

parameterized {

    meta_required = [“volume”]
}

volume “efs” {
    type = “csi”
    source = “$NOMAD_META_volume”
}

@tgross
Copy link
Member

tgross commented Nov 25, 2020

Wanted to follow up on this just to make sure folks don't think we're ignoring this important issue. This feature request is seems like it's a single feature on the surface but there are actually two parts to it:

  • Interpolating the volume block with information that's only available after placement on the host. This isn't going to be easily done (if at all), because we use the volume.source at the time of scheduling in order to determine where the volume can go. So there's a chicken-and-the-egg situation where the scheduler would not have enough information to place if we allowed the volume.source to include things like allocation index or client metadata. There may be improvements we can do here, but this is non-trivial. (I'm inclined to consider that as a larger Nomad issue than this feature request, because it impacts the entire scheduler.)

  • Interpolating the volume block's HCL with information that can be used for scheduling. We already can do this with the HCL2 feature in Nomad 1.0!

Here's an example of a job that consumes CSI volumes. The commented-out bits are how the job originally worked with a static CSI volume, and the dynamic blocks are the replacements that let us interpolate the source with data that's available at job submission time. (We're still working on getting better documentation for all the HCL2 stuff, but we should have it good shape by the time we ship 1.0 GA).

variables {
  volume_ids = ["volume0"]
}

job "example" {
  datacenters = ["dc1"]

  group "cache" {

    dynamic "volume" {
      for_each = var.volume_ids
      labels   = [volume.value]
      content {
        type      = "csi"
        source    = "test-${volume.value}"
        read_only = true
      }
    }

    # volume "volume0" {
    #   type      = "csi"
    #   source    = "test-volume0"
    #   read_only = true
    #   mount_options {
    #     fs_type = "ext4"
    #   }
    # }

    count = 2

    task "redis" {

      driver = "docker"

      config {
        image = "redis:3.2"

        port_map {
          db = 6379
        }
      }

      dynamic "volume_mount" {
        for_each = var.volume_ids
        content {
          volume      = "${volume_mount.value}"
          destination = "${NOMAD_ALLOC_DIR}/${volume_mount.value}"
        }
      }

      # volume_mount {
      #   volume      = "volume0"
      #   destination = "${NOMAD_ALLOC_DIR}/test"
      # }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

cc @notnoop to point out a working case of dynamic for our docs.
cc @galeep @yishan-lin so that internal/customer discussions of this feature request are using up-to-date information.

@tgross
Copy link
Member

tgross commented Nov 25, 2020

Example of using HCL2 to accomplish this is in #9449

@ryanmickler
Copy link
Contributor

ryanmickler commented Nov 26, 2020

Thanks a lot for the update @tgross.

This issue is still blocking adoption of CSI in our stack. A few comments:

Interpolating the volume block with information that's only available after placement on the host. This isn't going to be easily done (if at all), because we use the volume.source at the time of scheduling in order to determine where the volume can go. So there's a chicken-and-the-egg situation where the scheduler would not have enough information to place if we allowed the volume.source to include things like allocation index or client metadata. There may be improvements we can do here, but this is non-trivial. (I'm inclined to consider that as a larger Nomad issue than this feature request, because it impacts the entire scheduler.)

I certainly see that 'allocation index' is insufficient to use for this purpose. You actually just want the container to be able to interpolate 'CSI volume index' - because really, the indexing should be of the persistent objects, not at the allocation level.

So perhaps if we could specify that the 'count' of the group can used to reference volumes, through something like a 'volume group'

     volume_group "redis-volumes" {
       type      = "csi"
       source    = "test-${count.index}"
       read_only = true
       mount_options {
         fs_type = "ext4"
       }
     }

and then we'd do something like

  task "redis" {
      driver = "docker"
      ...
      volume_mount {
        volume_group = "redis-volumes"
        destination = "${NOMAD_ALLOC_DIR}/test"
      }
      ...
      env {
       MY_ID = "${NOMAD_VOLUME_GROUP_redis_volumes_ID}" // internally calculated, persistent per volume 
       #MY_ID = "${NOMAD_ALLOC_INDEX}" // not neeeded anymore.
      }

How does that sound?

@tgross
Copy link
Member

tgross commented Nov 30, 2020

This issue is still blocking adoption of CSI in our stack.

Sorry to hear that. Could you dig into that a bit more? I want to make sure this isn't a matter of having written an incomplete HCL2 example instead of something we can't work around for you.

Effectively there's a gap in the job specification:

  • A volume block can define a single volume that's used by a single allocation. Ex. AWS EBS or Azure Block Store.
  • A volume block can define a single volume that's shared between multiple allocations. Ex. an NFS volume.
  • A volume block cannot currently but should allow defining multiple independent volumes, each of which is mapped to one of the job's allocations.

The HCL2 example in #9449 effective tries to turn case (3) into case (1).

So perhaps if we could specify that the 'count' of the group can used to reference volumes, through something like a 'volume group'

I think you're on the right track here in terms of the job spec. Arguably, volumes aren't really group-level resources once you have CSI in the mix -- they're cluster-level resources that are being consumed by the job. And then the volume block is a reference to that cluster-level resource. But we could do everything we're talking about with volume_group today in volume and still run into the issue of not having any kind of count information available for mapping to allocations until after we've created a plan.

I think this all comes down to wanting to create the count of volumes in the scheduler where the plan is being created. Adding the count into the planning stage opens up a bunch of questions around updates: what happens when we reschedule an allocation, or have canaries, or have rolling updates? Probably not intractable but there's a lot of small details we'll need to get right.

Also we haven't yet implemented Volume Creation #8212, which I'm just starting design work on. If we want to be able to support some kind of count/index attribute, we'll need to handle counts at volume creation time too (and I haven't figured out what stage that happens in yet).

Our process around design work like this is usually to write up an RFC which gets reviewed within the team and then engineering-wide. I'm going to push for making this something we can share with the community for this issue and #8212 so that we can get some feedback.

@ryanmickler
Copy link
Contributor

ryanmickler commented Dec 1, 2020

Thanks for that @tgross.

I'm a bit short of time, so I'll just ramble a bit.

First, lets address the use case:

  1. Need for a fleet of containers with persistent storage per-container.

e.g. 3 node redis cluster of aws ECS containers (1 master 2 replicas) each with a persistent EBS volume. If one of the containers goes down, we'll need to start up a new container with the same volume mounted (so must be in the same AZ as the volume of the downed instance - CSI topology constraint).

Lets assume for now, that we have the CSI volume create functionality.

Just to abstractly use the spec I laid out before - with more detail. We need to indicate that we want 3 (volume-container) pairs that intersect a set of constraints.

First, at the group level , we set up the (volume- half of the joint constraints

 group "redis-fleet" {
    count = 3

    volume_group "redis-volumes" {
       type      = "csi-ebs"
       read_only = true
       size = 10gb
     }

Then, also at the group level, we set up the -container) half of the constraints

  task "redis" {
      driver = "docker"
      ...
      volume_mount {
        volume_group = "redis-volumes"
        destination = "${NOMAD_ALLOC_DIR}/test"
      }
      ...
      env {
       MY_ID = "${NOMAD_VOLUME_GROUP_redis_volumes_ID}" // internally calculated, persistent with the volume 
      }

The intersection of the two scheduling constraints is done by the 'volume_mount' stanza in the task. I.e. its saying that when we shedule a container from this task, we also need to shedule (or satisfy) the placement constraints of the volume_group simultaneously.

If there was another task in this group besides "redis" we could also attach it to the same 'volume group' and have both containers (scheduled on the same host because they are in the same 'group') access the shared volume.

++ My general understanding of the syntax elements inside the 'group' stanza, is that 'you get one of these per count' for all of the stanzas subordinate to 'group'. Which is why I feel that the way 'volume' is used now currently is confusing. I would call 'volume' as it as used now 'common_volume'. meaning that, despite what 'count' says, you are only going to get one of these. I almost feel that the default behavior should be 'volume_group' as above.

Other use cases:

  1. Scaling up the count.

This is relatively straightforward. There needs to be some care about scaledown-then-scale up. i.e. do you want to use the previously initialized volumes, or do they get cleaned up.

  1. Canary

There should be a way to specify if you want the canary container to use the existing persistent volume (i.e. get scheduled alongside existing deployment), or scheduled into new persistent volumes (or temporary volumes?).

@tgross
Copy link
Member

tgross commented Feb 3, 2021

We've done some design work on this and here's how we're going to solve this problem.

The existing workflow for mounting a CSI volume is as follows:

  • Job is registered.
  • Scheduler creates allocations in the reconciler, generating the allocation name that includes the index.
  • Scheduler selects the next client nodes where volume request is feasible (volume exists and has free claims)
  • Allocation is placed on a client.
  • The allocation runner’s csi_hook on the client makes a CSIVolumeClaim RPC to the server. This RPC:
    • Optionally makes a Controller RPC to attach the volume to the client host (ex. the AWS AttachVolume API, which creates a /dev/sdx device on the host).
    • Makes CSI Node RPCs to mount the volume on the host (ex. Mount /dev/sdx to /var/nomad/alloc/:csi-plugin-alloc-id/csi/:volume-id)
  • The task runner’s volume_hook on the client creates the mount configuration, which the task driver uses to mount the volume to the task (ex. a Docker bind-mount).

Note that in this workflow, the volume ID is used for feasibility checking and then later during the claim workflow on the client.

When the schedulers select the next client node they pass the placement along with a SelectOptions struct. We’ll add the allocation name as a field to this struct, which will allow the feasibility checker to extract the index (similarly to how the structs.Allocation.Index() method works). This lets us pass the allocation name with minimal disruption to the rest of the scheduler code and only adds that string to the memory usage of the scheduler.

We’ll add a per_alloc boolean field to the existing volume block. Jobs with this field set to true will fail validation if the update block has canaries, with an error message that tells the user that canaries are incompatible with per_alloc volumes.

volume "data" {
  type      = "csi"
  source    = "my-ebs-volume"
  per_alloc = true
}

When a job with per_alloc=true is scheduled, the VolumeChecker uses a separate feasibility check algorithm:

  • Extract the index from the allocation name, and interpolate the source with the allocation index so that, for example, my-ebs-volume becomes my-ebs-volume[0]. We're not going to allow custom interpolation of $NOMAD_ALLOC_INDEX here because we want the naming pattern to match what we're doing for allocations.
  • Query CSIVolumeByID to get a list of volumes that matches the interpolated source.
  • Verify the volume’s plugin exists and is healthy (as the existing checker does).
  • Compare these claims against the free claims of all volumes that meet the requirements.
  • If there are enough valid claims, return true from the feasibility check.

Allocations don’t carry CSIVolume information, only the un-interpolated volume request in its TaskGroup pointer. When the allocation is placed on the client, the client will extract the index from the allocation name, and re-interpolate the source as was done in the scheduler. The client will send the CSIVolumeClaim RPC to the server with that volume ID and so the RPC can remain unchanged.

Because the index is being generated when the scheduler does placements, an allocation with more than one CSI volume receives the same group of volumes after an update, and any state from ephemeral_disk migrations stays with the volumes.

@tgross
Copy link
Member

tgross commented Mar 5, 2021

Just an update on this long-awaited feature, I'll be working on this over the next few weeks.

@ryanmickler
Copy link
Contributor

Thanks so much for this @tgross, we're patiently watching the progress!

@apollo13
Copy link
Contributor

Extract the index from the allocation name, and interpolate the source with the allocation index so that, for example, my-ebs-volume becomes my-ebs-volume[0]. We're not going to allow custom interpolation of $NOMAD_ALLOC_INDEX here because we want the naming pattern to match what we're doing for allocations.

So the final name will be my-ebs-volume[0]? This will then be the name under which one can manually register the volumes? I'd like to ask to maybe change it to -0 if that will indeed be the final name. CSI plugins might place restrictions on volume names, so it seems to be that it would be better to not use special chars like [] (even though they make the intent rather clear).

@tgross
Copy link
Member

tgross commented Mar 10, 2021

CSI plugins might place restrictions on volume names, so it seems to be that it would be better to not use special chars like [] (even though they make the intent rather clear).

Do you have an example of that? Keep in mind this isn't the same as the External ID used by the storage provider.

@apollo13
Copy link
Contributor

Oh right, yes the limitations (if at all) will be on external ID. Sorry!

@tgross
Copy link
Member

tgross commented Mar 10, 2021

I've opened #10159 to tack onto the work for #8212 which should make working with those IDs a little nicer.

@tgross
Copy link
Member

tgross commented Mar 18, 2021

#10136 has been merged and will ship with Nomad 1.1

@tgross tgross added this to the 1.1.0 milestone Mar 18, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/hcl theme/storage type/enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants