Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Job awaiter #633

Merged
merged 2 commits into from
Sep 30, 2019
Merged

Add Job awaiter #633

merged 2 commits into from
Sep 30, 2019

Conversation

lblackstone
Copy link
Member

@lblackstone lblackstone commented Jul 11, 2019

Fixes #449.

TODO:

  • Report errors
  • Add tests
  • Await deletion

Here's how things look at this point:

Failed Job:

[0/2] Waiting for Job "foo" to start
[1/2] Waiting for Job "foo" to succeed
warning: [Pod foo-vk6wg]: containers with unready status: [pi]

warning: [Pod foo-vk6wg]: containers with unready status: [pi] -- [ContainerCannotRun] OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"perly\": executable file not found in $PATH": unknown

error: Job has reached the specified backoff limit -- [BackoffLimitExceeded] Job has reached the specified backoff limit
 
error: Plan apply failed: 4 errors occurred:
	* resource foo was successfully created, but the Kubernetes API server reported that it failed to fully initialize or become live: Resource 'foo' was created but failed to initialize
	* [Pod foo-vk6wg]: containers with unready status: [pi]
	* [Pod foo-vk6wg]: containers with unready status: [pi] -- [ContainerCannotRun] OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"perly\": executable file not found in $PATH": unknown
	* Job has reached the specified backoff limit -- [BackoffLimitExceeded] Job has reached the specified backoff limit

Successful Job:

[0/2] Waiting for Job "foo" to start
[1/2] Waiting for Job "foo" to succeed
warning: [Pod foo-wsq8r]: containers with unready status: [pi]
Job ready

@lblackstone lblackstone force-pushed the lblackstone/job-await branch 3 times, most recently from 07aed3d to 59f6ea7 Compare September 4, 2019 17:51
@nesl247
Copy link

nesl247 commented Sep 19, 2019

Looking forward to this. Should really simplify our code base. Any idea when this is targeted to be completed?

@lblackstone
Copy link
Member Author

Any idea when this is targeted to be completed?

I'd guess that it will be in a dev build within the next week.

@nesl247
Copy link

nesl247 commented Sep 19, 2019

Awesome, so glad to hear that. This is a HUGE feature for us, and pretty much anyone using pulumi to deploy applications that require stuff like DB migrations, etc.

@lblackstone
Copy link
Member Author

Here's how things look at this point:

Failed Job:

[0/2] Waiting for Job "foo" to start
[1/2] Waiting for Job "foo" to succeed
warning: [Pod foo-vk6wg]: containers with unready status: [pi]

warning: [Pod foo-vk6wg]: containers with unready status: [pi] -- [ContainerCannotRun] OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"perly\": executable file not found in $PATH": unknown

error: Job has reached the specified backoff limit -- [BackoffLimitExceeded] Job has reached the specified backoff limit
 
error: Plan apply failed: 4 errors occurred:
	* resource foo was successfully created, but the Kubernetes API server reported that it failed to fully initialize or become live: Resource 'foo' was created but failed to initialize
	* [Pod foo-vk6wg]: containers with unready status: [pi]
	* [Pod foo-vk6wg]: containers with unready status: [pi] -- [ContainerCannotRun] OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"perly\": executable file not found in $PATH": unknown
	* Job has reached the specified backoff limit -- [BackoffLimitExceeded] Job has reached the specified backoff limit

Successful Job:

[0/2] Waiting for Job "foo" to start
[1/2] Waiting for Job "foo" to succeed
warning: [Pod foo-wsq8r]: containers with unready status: [pi]
Job ready

@lblackstone lblackstone force-pushed the lblackstone/job-await branch 2 times, most recently from 377ce21 to 9759d2f Compare September 20, 2019 16:12
@lblackstone lblackstone marked this pull request as ready for review September 20, 2019 16:16
pkg/await/batch_job.go Show resolved Hide resolved
pkg/await/states/job.go Outdated Show resolved Hide resolved
pkg/await/states/job.go Outdated Show resolved Hide resolved
pkg/await/states/job.go Outdated Show resolved Hide resolved
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really digging the test scenarios here 🙂. I know we do similar, extended coverage for all other resources with fixtures, but am not sure to what extent. How is the parity in this space across all resources?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to the way we're testing Pod, and is the general direction I'm taking the test coverage. Existing tests for other resources are basically black box tests, and generally make it harder to reason about correctness.

There's room for another layer of tests to make sure the awaiter channels/timeouts are wired up properly, but I think that's a lot less critical and error prone than the state checking logic I'm testing here.

Copy link
Contributor

@metral metral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM - A couple of logical changes would be nice to see to reduce on complexity

Copy link
Contributor

@metral metral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cc @hausdorff - PTAL and review.

Copy link
Contributor

@hausdorff hausdorff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding await semantics to Job simply enriches the existing functionality, so I think this is the right move.

But, before we click the merge button, I just want to make sure we're all on the same page about the (very weird) implications of using Job in Pulumi. All of these were true before, but we never had this conversation before, so let me state some things I believe to be true about the Job, and if any of them are wrong, please correct me:

  • If you run pulumi up with a Job, it will stick around until you delete it. So subsequent runs of pulumi up will not cause the job to re-run.
  • Users should be very cautious of including Job in Pulumi programs! Unlike other resource types, Job is intended to run once (e.g., for a DB schema migration), so when and how it runs really matters. Once you add a Job to your Pulumi project, ordering suddenly matters a lot—so if you run a fresh pulumi up and your Job does not run exactly when it is supposed to, it could fail the whole deployment.
  • We make no attempt to be smart about automated cleanup from the TTL controller. So, if a user sets .spec.ttlSecondsAfterFinished and the Job gets cleaned up, another run of pulumi up after the TTL will re-deploy the Job.

Like I said, I think all of this is fine, especially since we support it all already, but I just want us to go in with eyes open.

// A Job is a construct that allows users to run a workload as a Pod that terminates with a
// success or failure.
//
// A Job is considered "ready" if the following conditions are true:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We say that these are the conditions required to determine a job is "ready", but it sounds below like we're describing jobs that have completed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant "ready" in the sense that we're done waiting on the resource. For Job, that would mean it is complete.

@lukehoban
Copy link
Contributor

If you run pulumi up with a Job, it will stick around until you delete it. So subsequent runs of pulumi up will not cause the job to re-run.

Yes - that is expected and desired - unless you do something to force it to replace.

Users should be very cautious of including Job in Pulumi programs! Unlike other resource types, Job is intended to run once (e.g., for a DB schema migration), so when and how it runs really matters. Once you add a Job to your Pulumi project, ordering suddenly matters a lot—so if you run a fresh pulumi up and your Job does not run exactly when it is supposed to, it could fail the whole deployment.

I think this behaves exactly as you want for scenarios where it is useful - as long as you can force it to replace when the thing that should trigger it to re-run changes.

Copy link
Contributor

@hausdorff hausdorff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is simple enough we can probably merge now. I'm fully confident we'll find more bugs, but I think we're well within our risk tolerance here.

I left a couple comments. The biggest thing that is missing is the error reporting is not super great in interactive mode. We can follow up with that though.

pkg/await/batch_job.go Show resolved Hide resolved
pkg/await/states/job.go Show resolved Hide resolved
@metral metral merged commit 5e311f8 into master Sep 30, 2019
@pulumi-bot pulumi-bot deleted the lblackstone/job-await branch September 30, 2019 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

await job completion
5 participants