Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix improper remote-exec timeout and communicator error handling #17596

Merged
merged 7 commits into from
Mar 15, 2018

Conversation

jbardin
Copy link
Member

@jbardin jbardin commented Mar 15, 2018

The remote.Cmd struct could not convey any transport related error to
the caller, meaning that interrupted commands would show that they
succeeded.

Change Cmd.SetExited to accept an exit status, as well as an error to
store for the caller. Make the status and error fields internal,
require serialized access through the getter methods.

Users of remote.Cmd should not check both Cmd.Err() and Cmd.ExitStatus()
until after Wait() returns.

Require communicators to call Cmd.Init before executing the command.
This will indicate incorrect usage of the remote.Cmd by causing a panic
in SetExitStatus.

The timeout for the remote command was taken from the wrong config
field, and the connection timeout was being used which is 5 min. Any
remote command taking more than 5 min would be terminated by
disconnecting the communicator. Remove the timeout from the context, and
rely on the global timeout provided by terraform.

There was no way to get the error from the communicator previously, so
the broken connection was silently ignored and the provisioner returned
successfully. Now we can use the new cmd.Err() method to retrieve any
errors encountered during execution.

@jbardin jbardin requested a review from apparentlymart March 15, 2018 17:00
@@ -697,18 +704,15 @@ func (p *provisioner) runCommand(o terraform.UIOutput, comm communicator.Communi
}

cmd.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Can save this for another day since subjective design discussion shouldn't block bug fixes)

Could we make the interface here instead be that cmd.Wait() returns (int, error) to make it clearer that there are some return values to deal with here? The pattern of mutating the object and getting the result via accessors -- particularly now that there are two of them -- feels less obvious and easier to do incorrectly.

I'd also consider following the lead of os/exec here by making a non-zero exit status be a special type of error that implements String to return the error message you used below, thus making it simpler to handle in cases like this where the caller doesn't really care to distinguish errors returned by the program or operational errors along the way:

return cmd.Wait() // returns an error which might be a typed error wrapping an exit status

os/exec wraps the entire process state, which is not possible here (due to the remoteness) and also doesn't seem necessary. A wrapper around just the int feels sufficient.

It looks like there a few cases below where the caller does care to recognize the exit status, which we could make more ergonomic by a helper method remote.GetExitStatus(err) that returns the exit status if it's that sort of error or zero otherwise:

if exitStatus := remote.GetExitStatus(err); exitStatus != 0 {
    // ...
}

What do you think? (Again, we can totally ignore this for now just to get the bug fixed, if you prefer.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was going to do the same thing, but then decided to try and keep the API changes minimal, while getting rid of the races and adding an error value. It turns out that I probably would have had the same amount of changes to handle had I added an error value to Wait, with a specific ExitError error type for the return code. 🤷‍♂️

I think we can make that a separate patch, since it's more cosmetic than anything at this point.

jbardin added 7 commits March 15, 2018 16:03
The remote.Cmd struct could not convey any transport related error to
the caller, meaning that interrupted commands would show that they
succeeded.

Change Cmd.SetExited to accept an exit status, as well as an error to
store for the caller.  Make the status and error fields internal,
require serialized access through the getter methods.

Users of remote.Cmd should not check both Cmd.Err() and Cmd.ExitStatus()
until after Wait returns.

Require communicators to call Cmd.Init before executing the command.
This will indicate incorrect usage of the remote.Cmd by causing a panic
in SetExitStatus.
The timeout for the remote command was taken from the wrong config
field, and the connection timeout was being used which is 5 min. Any
remote command taking more than 5 min would be terminated by
disconnecting the communicator. Remove the timeout from the context, and
rely on the global timeout provided by terraform.

There was no way to get the error from the communicator previously, so
the broken connection was silently ignored and the provisioner returned
successfully. Now we can use the new cmd.Err() method to retrieve any
errors encountered during execution.
This was updated to see if we can get at any error status from the
remote command and transport, which still is not available, but kept the
latest version since it fixes a couple race conditions.
Convert to the new Cmd.ExitStatus() method in the salt-masterless
provisioner. Add calls to Wait and remove race conditions around setting
the status.
Use the new ExitStatus method, and also check the cmd.Err() method for
errors.

Remove leaks from the output goroutines in both provisioners by
deferring their cleanup, and returning early on all error conditions.
EvaApplyProvisioners was not returning errors if there was already a
multierror stored in the Error field. Always return the error to the
caller.
The provisionerFail_createBeforeDestroy test was verifying the incorrect
output. The create_before_destroy instance in the state has an ID of
"bar" with require_new="abc", and a new instance would get an ID of
"foo" with require_new="xyz". The existing test was expecting the
following state:

aws_instance.bar: (1 deposed)
  ID = bar
  provider = provider.aws
  require_new = abc
  Deposed ID 1 = foo (tainted)

Which showed "bar" still the primary instance in the state, with the new
instance "foo" as being the deposed instance, though properly tainted.

The new output is:

aws_instance.bar: (tainted) (1 deposed)
  ID = foo
  provider = provider.aws
  require_new = xyz
  type = aws_instance
  Deposed ID 1 = bar

Showing the new "foo instance as being the primary instance in the
state, with "bar" as the deposed instance.
@jbardin jbardin force-pushed the jbardin/remote-exec-error branch from 0879adf to 88e911a Compare March 15, 2018 20:04
@jbardin jbardin merged commit a17f791 into master Mar 15, 2018
@jbardin jbardin deleted the jbardin/remote-exec-error branch March 15, 2018 20:12
@ghost
Copy link

ghost commented Apr 4, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants