Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad job restart #11533

Closed
wants to merge 5 commits into from
Closed

Conversation

shishir-a412ed
Copy link
Contributor

@shishir-a412ed shishir-a412ed commented Nov 18, 2021

@schmichael
Thank you for writing the nomad-job-restart-rfc. This was super helpful!

This PR is not ready for review. I am just opening this, so I can put in my understanding of the flow, and ask some questions.
We can use this PR as a brainstorming ground to progress on the RFC, and eventually get it in shape to be merged.

Based on nomad/checklist-command
This PR only has changes on Code section. No Doc changes are included right now.
The PR doesn't have implemented the monitor (and detach) mode where the job restart by default will start in monitor mode. And user can pass -detach to run the command in detach mode.

In phase 1, we ll just implement:

  • nomad job restart -batch-size n -batch-wait d <job_id>
  • nomad job restart -attach <restart_id>
  • nomad job restart -cancel <restart_id>

Pause/resume can come in Phase 2.

My understanding around the changes are as follows:

  • New CLI Command: nomad job restart

  • From the CLI, you jump to HTTP API client code in api/jobs.go.

  • From the HTTP API client, you call Restart which calls into the actual HTTP API defined in command/agent/job_endpoint.go.

  • jobRestart function is called in command/agent/job_endpoint.go. This decodes the HTTP request into API struct api.JobRestartRequest which can then be used to map into the RPC struct structs.JobRestartRequest and send an RPC.

  • Restart RPC is called. This will check permissions on the auth-token, get the snapshot from the state store, get the allocations for the job, Restart all the allocations (Restarting of allocations is not coded yet). Everytime we restart an allocation, we update the modify index and commit this into raft.

  • New raft message type defined: JobRestartRequestType

  • Raft apply here

  • raft log message applied by FSM in nomad/fsm.go

  • This will call applyRestartJob. This will decode the raft message into RPC struct and call into the state store: n.state.UpdateJobRestart

  • State store logic for committing the transaction to boltDB:

    // UpdateJobRestart updates jobRestart object into memdb.
    func (s *StateStore) UpdateJobRestart(msgType structs.MessageType, index uint64, req *structs.JobRestartRequest) error {
    fmt.Println("HELLO state/state_store.go: UpdateJobRestart")
    txn := s.db.WriteTxnMsgT(msgType, index)
    defer txn.Abort()
    if err := s.updateJobRestartImpl(txn, index, req); err != nil {
    return err
    }
    return txn.Commit()
    }
    func (s *StateStore) updateJobRestartImpl(txn *txn, index uint64, req *structs.JobRestartRequest) error {
    fmt.Println("HELLO: state/state_store.go: updateJobRestartImpl")
    req.ModifyIndex = index
    // Insert the job restart object
    if err := txn.Insert("restart", req); err != nil {
    return fmt.Errorf("job restart update failed: %v", err)
    }
    return nil
    }

  • New boltDB table schema defined:

    nomad/nomad/state/schema.go

    Lines 101 to 120 in 3344707

    // jobRestartSchema returns the MemDB schema for the job restart table.
    // This table is used to store uuid and job restart state object.
    func jobRestartSchema() *memdb.TableSchema {
    return &memdb.TableSchema{
    Name: "restart",
    Indexes: map[string]*memdb.IndexSchema{
    // Primary index is used for job restart management
    // and simple direct lookup. ID (UUID) is required to be
    // unique.
    "id": {
    Name: "id",
    AllowMissing: false,
    Unique: true,
    Indexer: &memdb.UUIDFieldIndex{
    Field: "ID",
    },
    },
    },
    }
    }

Like I mentioned above, this patch might not be ready for review yet. It has a bunch of print statement left out. I added those just to test the flow and see if the data is being passed around correctly. E.g When I run this on a test job with 5 allocations, I can print some info like:

Nov 18 20:08:10 linux nomad[1955]:     2021-11-18T20:08:10.489Z [INFO]  client: started client: node_id=3757b276-95bb-1ed4-4fd7-c3169f6d919d
Nov 18 20:08:18 linux nomad[1955]:     2021-11-18T20:08:18.984Z [INFO]  client: node registration complete
Nov 18 20:10:06 linux nomad[1955]: HELLO HELLO: nomad/job_endpoint.go JobRestartRequest: &{ID:58d503f8-2c54-d3f5-4ca3-8484d123206d JobID:count BatchSize:5 BatchWait:10 Status:running RestartedAllocs:[] StartedAt:2021-11-18 20:10:06.373692499 +0000 UTC m=+122.230232254 UpdatedAt:2021-11-18 20:10:06.373692568 +0000 UTC m=+122.230232321 CreateIndex:0 ModifyIndex:0 WriteRequest:{Region:global Namespace:default AuthToken: IdempotencyToken: InternalRpcInfo:{Forwarded:false}}}
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: 4089ce63-50e3-c370-54fe-6ab8a90d647e
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: 910ed5bc-20dc-c027-1a40-e3d9c5d1eb2a
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: c8d77bff-95bb-b4cb-b1d7-1dc2e8f294ef
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: df98f5ce-b716-8e6e-cd48-55c37cd2d1f1
Nov 18 20:10:06 linux nomad[1955]: Hello alloc ID: fd9661e3-b167-6916-0350-84da97cae0a2
Nov 18 20:10:06 linux nomad[1955]: HELLO nomad/fsm.go: applyRestartJob
Nov 18 20:10:06 linux nomad[1955]: Hello JobRestartRequest object: {ID:58d503f8-2c54-d3f5-4ca3-8484d123206d JobID:count BatchSize:5 BatchWait:10 Status:running RestartedAllocs:[] StartedAt:2021-11-18 20:10:06.373692499 +0000 UTC UpdatedAt:2021-11-18 20:10:06.373692568 +0000 UTC CreateIndex:0 ModifyIndex:0 WriteRequest:{Region:global Namespace:default AuthToken: IdempotencyToken: InternalRpcInfo:{Forwarded:false}}}
Nov 18 20:10:06 linux nomad[1955]: HELLO state/state_store.go: UpdateJobRestart
Nov 18 20:10:06 linux nomad[1955]: HELLO: state/state_store.go: updateJobRestartImpl

Questions:

  1. What would be the best way to restart allocations in the RPC code? (Any code references in the RPC code I could refer to?).
    Should we signal the restart right before raftApply
    i.e. for each alloc that we restart, we update the modify index and apply a new log message into raft. My understanding of the indexes is there are two indexes CreateIndex and ModifyIndex. The first time, we ll create the JobRestartObject in the server, we ll update the CreateIndex, and from that point onwards everytime we restart an alloc, ModifyIndex will be updated.

  2. I am still not 100% clear on the boltDB table schema since it only has a key e.g. id in my table. Does that mean we can just store the entire JobRestartRequest Object against that key. If you can take a look at the state store code where I am committing the transaction and let me know if I am missing something.

This is probably a lot to read, and if this doesn't make a lot of sense, let me know. We can also carry some of this conversation back to our internal slack. I just wanted to put the patch out so it's easier to look at some code and go from there.

References: Issue #698

@lgfa29
Copy link
Contributor

lgfa29 commented Dec 21, 2021

Thanks @shishir-a412ed! As you mentioned, this is still a work in progress so I will convert it to Draft until it's ready for review. I think you will be able to switch it back yourself, but if you can't just let us know 🙂

@mikenomitch
Copy link
Contributor

Hey @shishir-a412ed, just wanted to check in and see how this was going? Anything blocking?

Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
@mikenomitch
Copy link
Contributor

FYI, I spoke with @shishir-a412ed, who has a great start on this PR and he probably won't be able to get around to finishing it.

If anybody internal to Nomad team or external wants to take a swing at finishing this off, feel free. Please let us know if you are doing so, so multiple people don't attempt it at the same time!

@mikenomitch mikenomitch added the help-wanted We encourage community PRs for these issues! label Oct 19, 2022
@lgfa29 lgfa29 self-assigned this Feb 10, 2023
@lgfa29
Copy link
Contributor

lgfa29 commented Mar 1, 2023

Closing this in favour of #16278.

Thank you so much for the great start @shishir-a412ed!

@zhixinwen
Copy link

zhixinwen commented Jun 21, 2024

This command caused some surprise to me.
When I run nomad job restart $JOB_ID, I was expecting each alloc in the job to restart one by one and not causing issue in our running service. I expect restart to work similar with update.

But I was seeing the service was brought down quickly, looks like it restart the alloc without waiting the restarted one back online. And users see error during the restart. Is there anyway for me to make restart graceful?

@lgfa29
Copy link
Contributor

lgfa29 commented Jun 21, 2024

Hi @zhixinwen 👋

This is the expected and documented behaviour of this command. From the docs:

The command waits until the new allocations have client status ready before proceeding with the remaining batches. Services health checks are not taken into account.

I thought there was an open issue to track this feature request, but I can't seem to find it. If you don't mind it could be useful to create on. But implementing it would be a little tricky, since this command runs purely on the machine that called the command, which may not have access to Consul.

Is there anyway for me to make restart graceful?

The -batch-wait flag can help you. If you have a sense of how long it usually takes for the service to be healthy, you could set a fixed timeout. A more complex solution could involve using -batch-wait=ask and monitor the service health out-of-band, sending a response in the job restart command stdin when healthy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help-wanted We encourage community PRs for these issues! theme/api HTTP API and SDK issues theme/cli
Projects
Development

Successfully merging this pull request may close these issues.

6 participants