Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/exec: consider changing Wait to stop copying goroutines rather than waiting for them #23019

Closed
ianlancetaylor opened this issue Dec 6, 2017 · 36 comments
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge Proposal Proposal-Accepted
Milestone

Comments

@ianlancetaylor
Copy link
Contributor

When the Stdin, Stdout, or Stderr fields of os/exec.Cmd are set to anything other than nil or a *os.File (a common case is *bytes.Buffer), we call os.Pipe to get a pipe and create goroutines to copy data in or out. The (*Cmd).Wait method first waits for the subprocess to exit, then waits for those goroutines to finish copying data.

If the subprocess C1 itself starts a subsubprocess C2, and if C1 passes any of its stdin/stdout/stderr descriptors to C2, and if C1 exits without waiting for C2 to exit, then C2 will hold an open end of the pipes created by the os/exec package. The (*Cmd).Wait method will wait until the goroutines complete, which means waiting until those pipes are closed, which in practice means waiting until C2 exits. This is confusing, because the user sees that C1 is done, and doesn't understand why their program is still waiting for it.

This confusion has been filed as an issue multiple times, at least #7378, #18874, #20730, #21922, #22485.

It doesn't have to work this way. Although the current goroutines call io.Copy, we could change them to use a loop. After every Read, the loop could check whether the process has been waited for. Then Wait could wait for the child, tell the goroutines to stop, give them a chance for a final write, and then return. The stdout/stderr goroutines would close their end of the pipe. There wouldn't be any race and there wouldn't be any unexpected waits. But in cases where there is a C2 process, not all of the standard output and standard error output that we currently collect would be available.

To be clear, programmers can already handle these cases however they like by calling os.Pipe themselves and using the pipe in the Cmd struct. That is true today and it would be true if change how it works.

The questions I want to raise are: would making this change be less confusing for people? Can we make this change without breaking the Go 1 guarantee?

CC @bradfitz

@ianlancetaylor ianlancetaylor added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Dec 6, 2017
@ianlancetaylor ianlancetaylor added this to the Go1.11 milestone Dec 6, 2017
@karkason
Copy link

This change may break existing code, but it does fix unspecified behavior in (*Cmd).Wait that is also confusing and unexpected and will definitely make the use of os.Cmd less confusing.

@ianlancetaylor
Copy link
Contributor Author

Another case: #24050.

@eliasnaur
Copy link
Contributor

FWIW, it took me quite a while to figure out that this issue was root cause for mobile tests to sometimes hang indefinitely. https://go-review.googlesource.com/#/c/42271/ is the workaround for the Android test harness.

@slrz
Copy link

slrz commented Mar 2, 2018

I actually prefer the semantics we have today. That way, it's at least apparent that something is wrong instead of silently leaving long-running processes behind.

The underlying issue of held-open pipe FDs is explained in any basic OS course and can be debugged using readily available Unix tools. I think the alternative will result in more complex failure cases.

@tsl0922
Copy link

tsl0922 commented Mar 3, 2018

@slrz It makes many people confusing that Wait do not return after the process exited. Although the document has been improved, but we still don't have a way to cancel the io.Copy in the internal goroutines unless using io.Pipe for the os.Cmd struct which will make code complex.

@ianlancetaylor
Copy link
Contributor Author

In order to handle the case where C2 doesn't output anything, we would have to put a deadline on reading from the pipe, so that while waiting for output we periodically check whether C1 has exited.

@tv42
Copy link

tv42 commented May 31, 2018

But in cases where there is a C2 process, not all of the standard output and standard error output that we currently collect would be available.

This sounds like a bug to me, and very much unexpected. I wouldn't ever expect Wait to be guaranteed to return until after I have consumed all the output; in this scenario, the output has not ended yet.

Also, I'm not sure if I understood the "give them a chance for a final write" part, but that sounds like there's a race that could also lose C1 output if C1 exits immediately after filling the pipe buffer; that's not guaranteed to come through in a single Read.

smola added a commit to smola/ci-tricks that referenced this issue Jun 1, 2018
A workaround for golang/go#23019
has been implemented to be able to execute tests on Go 1.10.
@kgadams
Copy link

kgadams commented Jun 4, 2018

Would it be possible to get back the go1.9 behavior with a flag?
I have a use case where I need to cancel C1 now and where I really don't care about C2 output and where C2 can be left hanging. I have no control over C2. That used to work with go 1.9, but with go 1.10 it insists on waiting for C2 (which I can't control).

@ianlancetaylor
Copy link
Contributor Author

@kgadams There was no intentional change in this area in 1.10 This issue is about a proposed change to os/exec; the change has not been implemented. If you are finding an unexpected change that you think is a bug, please open a new issue. Thanks.

@tv42
Copy link

tv42 commented Jun 6, 2018

@kgadams But you are interested in C1 output, and there is no way to tell which is which.

@kgadams
Copy link

kgadams commented Jun 7, 2018

My scenario is like this: we are writing an interpreter for a specific content. Other people write the content (I have no control over that). Part of that content is shell script snippets.
Now, during execution (which happens behind a UI), the end user can Cancel us at any time.

If at the point of Cancelation by the end user I am executing the shell (C1), and the shell is executing a long running command (C2), I do not care about C1 output nor about C2 output. All I care about is that the whole enchilada is killed and that the function executing C1 returns without waiting for C2.
I even can live with C2 continuing to run in the background.

Cancelation works by canceling the Context that is passed to exec.Cmd.

I'll provide an example later of a test that succeeded with go-1.9 but fails with go-1.10.

@kgadams
Copy link

kgadams commented Jun 7, 2018

Hi,
this is the example that demonstrates the problem.
We read from the StdoutPipe.
Context is canceled after one second.
With go 1.9 Cmd.Wait() returns immediately.
With go 1.10 it waits till C2 (sleep 5) is done.

If you can tell me a way to get the go 1.9 behavior with go 1.10, my problem would be solved.

package cmd_test

import (
	"bufio"
	"context"
	"os/exec"
	"strings"
	"syscall"
	"testing"
	"time"

	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
)

func TestCmd(t *testing.T) {
	ctx, cancel := context.WithTimeout(context.Background(), time.Second)
	defer cancel()

	cmd := exec.CommandContext(ctx, "sh", "-c", "echo hello world && sleep 5")
	assert.NotNil(t, cmd)

	stdoutPipe, err := cmd.StdoutPipe()
	assert.Nil(t, err)
	assert.NotNil(t, stdoutPipe)

	start := time.Now()

	err = cmd.Start()
	assert.Nil(t, err)

	var stdout string
	go func() {
		buf := bufio.NewReader(stdoutPipe)
		for {
			line, err := buf.ReadString('\n')
			if len(line) > 0 {
				stdout = stdout + line + "\n"
			}
			if err != nil {
				return
			}
		}
	}()

	err = cmd.Wait()
	d := time.Since(start)

	if err != nil {
		exiterr, ok := err.(*exec.ExitError)
		require.True(t, ok)
		status, ok := exiterr.Sys().(syscall.WaitStatus)
		require.True(t, ok)
		assert.NotEqual(t, 0, status.ExitStatus())
	}
	assert.True(t, strings.HasPrefix(stdout, "hello world"), "Stdout: %v", stdout)

	assert.True(t, d.Seconds() < 3, "Duration was %v", d)
}

@davecheney
Copy link
Contributor

davecheney commented Jun 8, 2018 via email

@kgadams
Copy link

kgadams commented Jun 8, 2018

Ok, thanks for pointing this out.
For me go test -race did not detect the problem. However moving stdout and the assert into the goroutine fixes the race (which was not germane to my problem: Wait() not returning in go 1.10, but in 1.9).

The go routine now looks like this:

go func() {
        var stdout string
        buf := bufio.NewReader(stdoutPipe)
        for {
            line, err := buf.ReadString('\n')
            if len(line) > 0 {
                stdout = stdout + line + "\n"
            }
            if err != nil {
                assert.True(t, strings.HasPrefix(stdout, "hello world"), "Stdout: %v", stdout)
                return
            }
        }
    }()

@kgadams
Copy link

kgadams commented Jun 8, 2018

Sorry, I did not yet figure out how to correctly insert code blocks here. Any pointers?

@ysmolski
Copy link
Member

ysmolski commented Jun 8, 2018

@crvv
Copy link
Contributor

crvv commented Jun 8, 2018

I guess you are talking about #24050, which is a change of go test.
Can you reproduce it without go test?

@kgadams
Copy link

kgadams commented Jun 8, 2018

Here is the reproduction as a main function with correct markdown :-)
THX @ysmolsky!

Result does not change: works with go 1.9, breaks with go 1.10.

package main

import (
	"bufio"
	"context"
	"fmt"
	"os"
	"os/exec"
	"strings"
	"syscall"
	"time"
)

func fatal(format string, args ...interface{}) {
	fmt.Println(fmt.Sprintf(format, args...))
	os.Exit(1)
}

func main() {
	ctx, cancel := context.WithTimeout(context.Background(), time.Second)
	defer cancel()

	cmd := exec.CommandContext(ctx, "sh", "-c", "echo hello world && sleep 5")
	stdoutPipe, err := cmd.StdoutPipe()
	if err != nil {
		fatal("no pipe: %v", err)
	}
	start := time.Now()

	if err = cmd.Start(); err != nil {
		fatal("start failed: %v", err)
	}

	go func() {
		var stdout string
		buf := bufio.NewReader(stdoutPipe)
		for {
			line, err := buf.ReadString('\n')
			if len(line) > 0 {
				stdout = stdout + line + "\n"
			}
			if err != nil {
				if !strings.HasPrefix(stdout, "hello world") {
					fatal("wrong output: %q", stdout)
				}
				return
			}
		}
	}()

	err = cmd.Wait()
	d := time.Since(start)

	if err != nil {
		exiterr := err.(*exec.ExitError)
		status := exiterr.Sys().(syscall.WaitStatus)
		if status.ExitStatus() == 0 {
			fatal("wrong exit status: %v", status.ExitStatus())
		}
	}

	if d.Seconds() >= 3 {
		fatal("Cancelation took too long: %v", d)
	}
	fmt.Println("Success!")
}

@fraenkel
Copy link
Contributor

fraenkel commented Jun 8, 2018

I get Success! with 1.10.2 and tip on my Ubuntu laptop.

@rsc rsc changed the title os/exec: consider changing Wait to stop copying goroutines rather than waiting for them proposal: os/exec: consider changing Wait to stop copying goroutines rather than waiting for them Jun 11, 2018
rickystewart added a commit to cockroachdb/stress that referenced this issue Mar 8, 2022
The documentation for `exec.Cmd` says this about `Stdout` and `Stderr`:

    // If either is an *os.File, the corresponding output from the process
    // is connected directly to that file.
    //
    // Otherwise, during the execution of the command a separate goroutine
    // reads from the process over a pipe and delivers that data to the
    // corresponding Writer. In this case, Wait does not complete until the
    // goroutine reaches EOF or encounters an error.

When calling `CombinedOutput()`, `Stdout` and `Stderr` are
`bytes.Buffer`s and are therefore not `*os.File`s so they fall into
this second group. This resulted in a race condition where cancelling
the context when `maxtime` has passed could cause `CombinedOutput()` to
hang indefinitely waiting for the (finished) subprocess to "finish"
writing to its pipes.

This has been reported as an issue several times. The tracking issue is
golang/go#23019 which itself links to several
other issues that are duplicates.

To work around the issue we simply force the other behavior by creating
a temporary `*os.File` for the combined `stdout` and `stderr`.
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/400877 mentions this issue: cmd/go/internal/test: used lockedStdout instead of os.Stdout

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/400877 mentions this issue: cmd/go/internal/test: wrap os.Stdout always

gopherbot pushed a commit that referenced this issue Apr 21, 2022
There is an issue where 'go test' will hang after the tests complete if
a test starts a sub-process that does not exit (see #24050).

However, go test only exhibits that behavior when a package name is
explicitly passed as an argument. If 'go test' is invoked without any
package arguments then the package in the working directory is assumed,
however in that case (and only that case) os.Stdout is used as the test
process's cmd.Stdout, which does *not* cause 'go test' wait for the
sub-process to exit (see #23019).

This change wraps os.Stdout in an io.Writer struct in this case, hiding
the *os.File from the os/exec package, causing cmd.Wait to always wait
for the full output from the test process and any of its sub-processes.

In other words, this makes 'go test' exhibit the same behavior as
'go test .' (or 'go test ./...' and so on).

Update #23019
Update #24050

Change-Id: Ica09bf156f3b017f9a31aad91ed0f16a7837195b
Reviewed-on: https://go-review.googlesource.com/c/go/+/400877
Reviewed-by: Bryan Mills <bcmills@google.com>
Run-TryBot: Andrew Gerrand <adg@golang.org>
Auto-Submit: Andrew Gerrand <adg@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Andrew Gerrand <adg@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
lornasong added a commit to lornasong/terraform-exec that referenced this issue May 10, 2022
`TestContext_sleepTimeoutExpired` can occasionally hang when killing a command
that has Stdout or Stderr set to anything besides `nil` or `*os.File`.
golang/go#23019

Use workaround to read from StdoutPipe and StderrPipe rather than setting
Stdout / Stderr
radeksimko pushed a commit to hashicorp/terraform-exec that referenced this issue May 20, 2022
* Update TestContext_sleepTimeoutExpired to check for canceling within timeframe

Add a timeout to the test to ensure that the terraform apply cancels within
a reasonable time of the 5s timeout.

Currently, this test is not canceling the terraform apply as expected. In the
logs you can see that the test takes 1 min rather than ~5s:
```
    --- PASS: TestContext_sleepTimeoutExpired/sleep-0.12.31 (62.13s)
```

```
=== RUN   TestContext_sleepTimeoutExpired/sleep-0.12.31
    util_test.go:113: [INFO] running Terraform command: /var/folders/6y/gy9gggt14379c_k39vwb50lc0000gn/T/terraform_1378921380/terraform apply -no-color -auto-approve -input=false -lock=true -parallelism=10 -refresh=true
    util_test.go:103: CLI Output:
	    // truncated ...
        time_sleep.sleep: Creating...
        time_sleep.sleep: Still creating... [10s elapsed]
        time_sleep.sleep: Still creating... [20s elapsed]
        time_sleep.sleep: Still creating... [30s elapsed]
        time_sleep.sleep: Still creating... [41s elapsed]
        time_sleep.sleep: Still creating... [51s elapsed]
        time_sleep.sleep: Creation complete after 1m0s [id=2022-05-06T17:40:20Z]

        Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
```

* Remove runTerraformCmd check for cmd.ProcessState

Processes were not being killed because cmd.ProcessState was nil. With this
change, processes will be able to make the request to Kill().

Added a temporary log to printout cmd.ProcessState to demonstrate. Will be
removed in next commit.

Note: this will cause hanging `TestContext_sleepTimeoutExpired` due to a known
Golang issue with killing a command when Stdout or Stderr are set to anything
besides `nil` or `*os.File`. This is because the Kill does not notify the
stdout/stderr subprocesses to stop. `cmd.Wait` (called by `cmd.Run`) waits
indefinitely for those subprocesses to stop.

* Read logs from Stderr/out Pipe to avoid hanging

`TestContext_sleepTimeoutExpired` can occasionally hang when killing a command
that has Stdout or Stderr set to anything besides `nil` or `*os.File`.
golang/go#23019

Use workaround to read from StdoutPipe and StderrPipe rather than setting
Stdout / Stderr

* Test for runTerraformCmd leaked go-routine

Currently, when runTerraformCmd is called, it starts a go-routine to kill the
Terraform CLI on context.Done(). However, when the Terraform CLI completes and
runTerraformCmd() finishes, the go-routine continues running unnecessarily.

If the caller cancels the context down the line, this will stop the go-routine
and it will log the error: "error from kill: os: process already finished" because
the Terraform CLI has already finished.

Added a test for this in cmd_default.go and cmd_linux.go. Have not tried it in
linux yet though.

When running with the race detector:
```
==================
WARNING: DATA RACE
Read at 0x00c0002516c8 by goroutine 7:
  bytes.(*Buffer).String()
      /usr/local/go/src/bytes/buffer.go:65 +0x36a
  github.com/hashicorp/terraform-exec/tfexec.Test_runTerraformCmd_default()
      /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default_test.go:35 +0x360
  testing.tRunner()
  // truncated ...

Previous write at 0x00c0002516c8 by goroutine 8:
  bytes.(*Buffer).grow()
      /usr/local/go/src/bytes/buffer.go:147 +0x3b1
  bytes.(*Buffer).Write()
      /usr/local/go/src/bytes/buffer.go:172 +0xcd
  log.(*Logger).Output()
      /usr/local/go/src/log/log.go:184 +0x466
  log.(*Logger).Printf()
      /usr/local/go/src/log/log.go:191 +0x6e
  github.com/hashicorp/terraform-exec/tfexec.(*Terraform).runTerraformCmd.func1()
      /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default.go:24 +0x2a5
  // truncated ...
==================

* Use CommandContext to kill instead of manually doing it

* Fix EOF error check to use error.Is()

This also fixes a race condition caused by using the pointer to the io.EOF

* Update tests to use separate string.Builder-s for stdout and stderr

string.Builder is a non-comparable type which is not safe for concurrent use
when shared by Cmd.Stdout and Cmd.Stderr. Causes a race condition when accessing
the the builder when Cmd is running.

* Fixes to runTerraformCmd for race conditions

 - Use waitgroups for more readability
 - Improve handling errors from writeOutput
 - Finish reading from pipes before calling cmd.Wait - fixes a race condition
 that leads to an error :`read |0: file already closed`
 - Because now waiting for pipes to finish reading, need to update waitGroup to
 close buf.Read on context cancel. Otherwise buf.Read blocks until next line
 before stopping. Causes TestContext_sleepTimeoutExpired takes a little too long
 to cancel (~20s)

Co-authored-by: Kyle Carberry <kyle@carberry.com>
rstandt pushed a commit to rstandt/terraform-exec that referenced this issue Jun 21, 2022
* Update TestContext_sleepTimeoutExpired to check for canceling within timeframe

Add a timeout to the test to ensure that the terraform apply cancels within
a reasonable time of the 5s timeout.

Currently, this test is not canceling the terraform apply as expected. In the
logs you can see that the test takes 1 min rather than ~5s:
```
    --- PASS: TestContext_sleepTimeoutExpired/sleep-0.12.31 (62.13s)
```

```
=== RUN   TestContext_sleepTimeoutExpired/sleep-0.12.31
    util_test.go:113: [INFO] running Terraform command: /var/folders/6y/gy9gggt14379c_k39vwb50lc0000gn/T/terraform_1378921380/terraform apply -no-color -auto-approve -input=false -lock=true -parallelism=10 -refresh=true
    util_test.go:103: CLI Output:
	    // truncated ...
        time_sleep.sleep: Creating...
        time_sleep.sleep: Still creating... [10s elapsed]
        time_sleep.sleep: Still creating... [20s elapsed]
        time_sleep.sleep: Still creating... [30s elapsed]
        time_sleep.sleep: Still creating... [41s elapsed]
        time_sleep.sleep: Still creating... [51s elapsed]
        time_sleep.sleep: Creation complete after 1m0s [id=2022-05-06T17:40:20Z]

        Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
```

* Remove runTerraformCmd check for cmd.ProcessState

Processes were not being killed because cmd.ProcessState was nil. With this
change, processes will be able to make the request to Kill().

Added a temporary log to printout cmd.ProcessState to demonstrate. Will be
removed in next commit.

Note: this will cause hanging `TestContext_sleepTimeoutExpired` due to a known
Golang issue with killing a command when Stdout or Stderr are set to anything
besides `nil` or `*os.File`. This is because the Kill does not notify the
stdout/stderr subprocesses to stop. `cmd.Wait` (called by `cmd.Run`) waits
indefinitely for those subprocesses to stop.

* Read logs from Stderr/out Pipe to avoid hanging

`TestContext_sleepTimeoutExpired` can occasionally hang when killing a command
that has Stdout or Stderr set to anything besides `nil` or `*os.File`.
golang/go#23019

Use workaround to read from StdoutPipe and StderrPipe rather than setting
Stdout / Stderr

* Test for runTerraformCmd leaked go-routine

Currently, when runTerraformCmd is called, it starts a go-routine to kill the
Terraform CLI on context.Done(). However, when the Terraform CLI completes and
runTerraformCmd() finishes, the go-routine continues running unnecessarily.

If the caller cancels the context down the line, this will stop the go-routine
and it will log the error: "error from kill: os: process already finished" because
the Terraform CLI has already finished.

Added a test for this in cmd_default.go and cmd_linux.go. Have not tried it in
linux yet though.

When running with the race detector:
```
==================
WARNING: DATA RACE
Read at 0x00c0002516c8 by goroutine 7:
  bytes.(*Buffer).String()
      /usr/local/go/src/bytes/buffer.go:65 +0x36a
  github.com/hashicorp/terraform-exec/tfexec.Test_runTerraformCmd_default()
      /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default_test.go:35 +0x360
  testing.tRunner()
  // truncated ...

Previous write at 0x00c0002516c8 by goroutine 8:
  bytes.(*Buffer).grow()
      /usr/local/go/src/bytes/buffer.go:147 +0x3b1
  bytes.(*Buffer).Write()
      /usr/local/go/src/bytes/buffer.go:172 +0xcd
  log.(*Logger).Output()
      /usr/local/go/src/log/log.go:184 +0x466
  log.(*Logger).Printf()
      /usr/local/go/src/log/log.go:191 +0x6e
  github.com/hashicorp/terraform-exec/tfexec.(*Terraform).runTerraformCmd.func1()
      /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default.go:24 +0x2a5
  // truncated ...
==================

* Use CommandContext to kill instead of manually doing it

* Fix EOF error check to use error.Is()

This also fixes a race condition caused by using the pointer to the io.EOF

* Update tests to use separate string.Builder-s for stdout and stderr

string.Builder is a non-comparable type which is not safe for concurrent use
when shared by Cmd.Stdout and Cmd.Stderr. Causes a race condition when accessing
the the builder when Cmd is running.

* Fixes to runTerraformCmd for race conditions

 - Use waitgroups for more readability
 - Improve handling errors from writeOutput
 - Finish reading from pipes before calling cmd.Wait - fixes a race condition
 that leads to an error :`read |0: file already closed`
 - Because now waiting for pipes to finish reading, need to update waitGroup to
 close buf.Read on context cancel. Otherwise buf.Read blocks until next line
 before stopping. Causes TestContext_sleepTimeoutExpired takes a little too long
 to cancel (~20s)

Co-authored-by: Kyle Carberry <kyle@carberry.com>
@lixd
Copy link

lixd commented Jul 7, 2022

there are some workarounds to use before this proposal implement.

we can kill all child process when context is done.

func main() {
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()

	cmd := exec.CommandContext(ctx, "bash","/root/sleep.sh")
	cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
	doneCh := make(chan struct{}, 1)
	defer func() {
		doneCh <- struct{}{}
		close(doneCh)
	}()
	go func() {
		select {
		case <-ctx.Done():
			err := syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)
			if err != nil {
				fmt.Printf("kill error   : [%v]\n", err)
				return
			}
			fmt.Printf("run command(%s) timeout,kill all child process\n", cmd.String())
		case <-doneCh:
		}
	}()
	output, err := cmd.CombinedOutput()
	if err != nil {
		fmt.Println(err)
		return
	}

	fmt.Println("output:", string(output))
	fmt.Printf("ctx.Err : [%v]\n", ctx.Err())
	fmt.Printf("error   : [%v]\n", err)
}

or set stdout、stderr use *os.File type to avoid bloking

in this case,when context is done,program will return and child process will run background.

func main() {
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()

	cmd := exec.CommandContext(ctx, "bash", sh)
	combinedOutput, err := ioutil.TempFile("", "stdouterr")
	if err != nil {
		fmt.Println(err)
		return
	}
	defer func() { _ = os.Remove(combinedOutput.Name()) }()
	cmd.Stdout = combinedOutput
	cmd.Stderr = combinedOutput
	err = cmd.Run()
	if err != nil {
		fmt.Println(err)
	}
	_, err = combinedOutput.Seek(0, 0)

	var b bytes.Buffer
	_, err = io.Copy(&b, combinedOutput)
	if err != nil {
		fmt.Println(err)
		return
	}
	err = combinedOutput.Close()
	if err != nil {
		fmt.Println(err)
		return
	}
	fmt.Println("output:", b.String())

	fmt.Printf("ctx.Err : [%v]\n", ctx.Err())
	fmt.Printf("error   : [%v]\n", err)
}

@ianlancetaylor
Copy link
Contributor Author

I'm retracting this proposal in favor of #50436.

@ianlancetaylor ianlancetaylor closed this as not planned Won't fix, can't repro, duplicate, stale Oct 10, 2022
minamijoyo added a commit to minamijoyo/tfmigrate that referenced this issue Mar 9, 2023
Starting from Terraform v1.4, launching terraform providers in the
acceptance test has been failing more frequently with a text file busy
error.

```
--- FAIL: TestAccMultiStateMigratorApplySimple (1.07s)
    multi_state_migrator_test.go:123: failed to run terraform init: failed to run command (exited 1): terraform init -input=false -no-color
        stdout:

        Initializing the backend...

        Successfully configured the backend "s3"! Terraform will automatically
        use this backend unless the backend configuration changes.

        Initializing provider plugins...
        - Finding latest version of hashicorp/null...
        - Installing hashicorp/null v3.2.1...

        stderr:

        Error: Failed to install provider

        Error while installing hashicorp/null v3.2.1: open
        /tmp/plugin-cache/registry.terraform.io/hashicorp/null/3.2.1/linux_amd64/terraform-provider-null_v3.2.1_x5:
        text file busy
```

After some investigation, I found Go's `os/exec.Cmd.Run()` does not wait
for the grandchild process to complete; from the point of view of
tfmigrate, the terraform command is the child process, and the provider
is the grandchild process.

golang/go#23019

If I understand correctly, this is not a Terraform issue and
theoretically should occur in versions older than v1.4; the changes in
v1.4 may have broken the balance of execution timing and made the test
very flaky. I experimented with inserting some sleep but could not get
the test to stabilize correctly. After trying various things, I found
that the test became stable by enabling the
`TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE` flag was introduced in
v1.4. This is an escape hatch to revert to the v1.3 equivalent of the
global cache behavior change in v1.4.
hashicorp/terraform#32726

This behavior change has already been addressed in the previous commit
using a local file system mirror, so activating this flag does not seem
to make any sense. Even though I have no other reasonable solutions now,
please let me know if anyone finds a better solution.
benhoyt added a commit to benhoyt/pebble that referenced this issue Aug 15, 2023
Add WaitDelay to ensure cmd.Wait() returns in a reasonable timeframe if
the goroutines that cmd.Start() uses to copy Stdin/Stdout/Stderr are
blocked when copying due to a sub-subprocess holding onto them. Read
more details in these issues:

- golang/go#23019
- golang/go#50436

This isn't the original intent of kill-delay, but it seems reasonable
to reuse it in this context.

Fixes canonical#149
benhoyt added a commit to canonical/pebble that referenced this issue Aug 18, 2023
…/out/err (#275)

Use os.exec's Cmd.WaitDelay to ensure cmd.Wait returns in a reasonable
timeframe if the goroutines that cmd.Start() uses to copy stdin/out/err
are blocked when copying due to a sub-subprocess holding onto them.
Read more details about the issue in
golang/go#23019 and the proposed solution
(that was added in Go 1.20) in
golang/go#50436.

This solves issue #149, where Patroni
wasn't restarting properly even after a `KILL` signal was sent to it.
I had originally mis-diagnosed this problem as an issue with Pebble
not tracking the process tree of processes that daemonise and change
their process group (which is still an issue, but is not causing this
problem). The Patroni process wasn't being marked as finished at all
due to being blocked on the `cmd.Wait()`. Patroni starts
sub-processes and "forwards" stdin/out/err, so the copy goroutines
block. Thankfully Go 1.20 introduced `WaitDelay` to allow you to
easily work around this exact problem.

The fix itself is [this one-liner]
(#275):

s.cmd.WaitDelay = s.killDelay() * 9 / 10  // 90% of kill-delay

This will really only be a problem for services, but we make the same
change for exec and exec health checks as it won't hurt there
either.

Also, as a drive-by, this PR also canonicalises some log messages: our
style is to start with an uppercase letter (for logs, not errors) and
to use "Cannot X" rather than "Error Xing".

Fixes #149.
@golang golang locked and limited conversation to collaborators Oct 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge Proposal Proposal-Accepted
Projects
None yet
Development

No branches or pull requests