fix improper remote-exec timeout and communicator error handling #17596

jbardin · 2018-03-15T17:00:26Z

The remote.Cmd struct could not convey any transport related error to
the caller, meaning that interrupted commands would show that they
succeeded.

Change Cmd.SetExited to accept an exit status, as well as an error to
store for the caller. Make the status and error fields internal,
require serialized access through the getter methods.

Users of remote.Cmd should not check both Cmd.Err() and Cmd.ExitStatus()
until after Wait() returns.

Require communicators to call Cmd.Init before executing the command.
This will indicate incorrect usage of the remote.Cmd by causing a panic
in SetExitStatus.

The timeout for the remote command was taken from the wrong config
field, and the connection timeout was being used which is 5 min. Any
remote command taking more than 5 min would be terminated by
disconnecting the communicator. Remove the timeout from the context, and
rely on the global timeout provided by terraform.

There was no way to get the error from the communicator previously, so
the broken connection was silently ignored and the provisioner returned
successfully. Now we can use the new cmd.Err() method to retrieve any
errors encountered during execution.

apparentlymart · 2018-03-15T18:58:14Z

builtin/provisioners/chef/resource_provisioner.go

@@ -697,18 +704,15 @@ func (p *provisioner) runCommand(o terraform.UIOutput, comm communicator.Communi
 	}

 	cmd.Wait()


(Can save this for another day since subjective design discussion shouldn't block bug fixes)

Could we make the interface here instead be that cmd.Wait() returns (int, error) to make it clearer that there are some return values to deal with here? The pattern of mutating the object and getting the result via accessors -- particularly now that there are two of them -- feels less obvious and easier to do incorrectly.

I'd also consider following the lead of os/exec here by making a non-zero exit status be a special type of error that implements String to return the error message you used below, thus making it simpler to handle in cases like this where the caller doesn't really care to distinguish errors returned by the program or operational errors along the way:

return cmd.Wait() // returns an error which might be a typed error wrapping an exit status

os/exec wraps the entire process state, which is not possible here (due to the remoteness) and also doesn't seem necessary. A wrapper around just the int feels sufficient.

It looks like there a few cases below where the caller does care to recognize the exit status, which we could make more ergonomic by a helper method remote.GetExitStatus(err) that returns the exit status if it's that sort of error or zero otherwise:

if exitStatus := remote.GetExitStatus(err); exitStatus != 0 { // ... }

What do you think? (Again, we can totally ignore this for now just to get the bug fixed, if you prefer.)

Yes, I was going to do the same thing, but then decided to try and keep the API changes minimal, while getting rid of the races and adding an error value. It turns out that I probably would have had the same amount of changes to handle had I added an error value to Wait, with a specific ExitError error type for the return code. 🤷‍♂️

I think we can make that a separate patch, since it's more cosmetic than anything at this point.

The remote.Cmd struct could not convey any transport related error to the caller, meaning that interrupted commands would show that they succeeded. Change Cmd.SetExited to accept an exit status, as well as an error to store for the caller. Make the status and error fields internal, require serialized access through the getter methods. Users of remote.Cmd should not check both Cmd.Err() and Cmd.ExitStatus() until after Wait returns. Require communicators to call Cmd.Init before executing the command. This will indicate incorrect usage of the remote.Cmd by causing a panic in SetExitStatus.

The timeout for the remote command was taken from the wrong config field, and the connection timeout was being used which is 5 min. Any remote command taking more than 5 min would be terminated by disconnecting the communicator. Remove the timeout from the context, and rely on the global timeout provided by terraform. There was no way to get the error from the communicator previously, so the broken connection was silently ignored and the provisioner returned successfully. Now we can use the new cmd.Err() method to retrieve any errors encountered during execution.

This was updated to see if we can get at any error status from the remote command and transport, which still is not available, but kept the latest version since it fixes a couple race conditions.

Convert to the new Cmd.ExitStatus() method in the salt-masterless provisioner. Add calls to Wait and remove race conditions around setting the status.

Use the new ExitStatus method, and also check the cmd.Err() method for errors. Remove leaks from the output goroutines in both provisioners by deferring their cleanup, and returning early on all error conditions.

EvaApplyProvisioners was not returning errors if there was already a multierror stored in the Error field. Always return the error to the caller.

The provisionerFail_createBeforeDestroy test was verifying the incorrect output. The create_before_destroy instance in the state has an ID of "bar" with require_new="abc", and a new instance would get an ID of "foo" with require_new="xyz". The existing test was expecting the following state: aws_instance.bar: (1 deposed) ID = bar provider = provider.aws require_new = abc Deposed ID 1 = foo (tainted) Which showed "bar" still the primary instance in the state, with the new instance "foo" as being the deposed instance, though properly tainted. The new output is: aws_instance.bar: (tainted) (1 deposed) ID = foo provider = provider.aws require_new = xyz type = aws_instance Deposed ID 1 = bar Showing the new "foo instance as being the primary instance in the state, with "bar" as the deposed instance.

ghost · 2020-04-04T02:15:41Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jbardin requested a review from apparentlymart March 15, 2018 17:00

apparentlymart approved these changes Mar 15, 2018

View reviewed changes

jbardin added 7 commits March 15, 2018 16:03

update the vendored winrm release

b214834

This was updated to see if we can get at any error status from the remote command and transport, which still is not available, but kept the latest version since it fixes a couple race conditions.

fix exit status handling in salt-masterless

a715430

Convert to the new Cmd.ExitStatus() method in the salt-masterless provisioner. Add calls to Wait and remove race conditions around setting the status.

update the chef and habitat error handling

a1061ed

Use the new ExitStatus method, and also check the cmd.Err() method for errors. Remove leaks from the output goroutines in both provisioners by deferring their cleanup, and returning early on all error conditions.

return provisioner Apply errors

b2d111c

EvaApplyProvisioners was not returning errors if there was already a multierror stored in the Error field. Always return the error to the caller.

jbardin force-pushed the jbardin/remote-exec-error branch from 0879adf to 88e911a Compare March 15, 2018 20:04

jbardin merged commit a17f791 into master Mar 15, 2018

jbardin deleted the jbardin/remote-exec-error branch March 15, 2018 20:12

jbardin mentioned this pull request Mar 16, 2018

clean up remote.Cmd api #17609

Merged

ghost locked and limited conversation to collaborators Apr 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix improper remote-exec timeout and communicator error handling #17596

fix improper remote-exec timeout and communicator error handling #17596

jbardin commented Mar 15, 2018 •

edited

Loading

apparentlymart Mar 15, 2018

jbardin Mar 15, 2018

ghost commented Apr 4, 2020

		@@ -697,18 +704,15 @@ func (p *provisioner) runCommand(o terraform.UIOutput, comm communicator.Communi
		}

		cmd.Wait()

fix improper remote-exec timeout and communicator error handling #17596

fix improper remote-exec timeout and communicator error handling #17596

Conversation

jbardin commented Mar 15, 2018 • edited Loading

apparentlymart Mar 15, 2018

Choose a reason for hiding this comment

jbardin Mar 15, 2018

Choose a reason for hiding this comment

ghost commented Apr 4, 2020

jbardin commented Mar 15, 2018 •

edited

Loading