-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/exec: consider changing Wait to stop copying goroutines rather than waiting for them #23019
Comments
This change may break existing code, but it does fix unspecified behavior in |
Another case: #24050. |
FWIW, it took me quite a while to figure out that this issue was root cause for mobile tests to sometimes hang indefinitely. https://go-review.googlesource.com/#/c/42271/ is the workaround for the Android test harness. |
I actually prefer the semantics we have today. That way, it's at least apparent that something is wrong instead of silently leaving long-running processes behind. The underlying issue of held-open pipe FDs is explained in any basic OS course and can be debugged using readily available Unix tools. I think the alternative will result in more complex failure cases. |
@slrz It makes many people confusing that |
In order to handle the case where C2 doesn't output anything, we would have to put a deadline on reading from the pipe, so that while waiting for output we periodically check whether C1 has exited. |
This sounds like a bug to me, and very much unexpected. I wouldn't ever expect Wait to be guaranteed to return until after I have consumed all the output; in this scenario, the output has not ended yet. Also, I'm not sure if I understood the "give them a chance for a final write" part, but that sounds like there's a race that could also lose C1 output if C1 exits immediately after filling the pipe buffer; that's not guaranteed to come through in a single Read. |
A workaround for golang/go#23019 has been implemented to be able to execute tests on Go 1.10.
Would it be possible to get back the go1.9 behavior with a flag? |
@kgadams There was no intentional change in this area in 1.10 This issue is about a proposed change to os/exec; the change has not been implemented. If you are finding an unexpected change that you think is a bug, please open a new issue. Thanks. |
@kgadams But you are interested in C1 output, and there is no way to tell which is which. |
My scenario is like this: we are writing an interpreter for a specific content. Other people write the content (I have no control over that). Part of that content is shell script snippets. If at the point of Cancelation by the end user I am executing the shell (C1), and the shell is executing a long running command (C2), I do not care about C1 output nor about C2 output. All I care about is that the whole enchilada is killed and that the function executing C1 returns without waiting for C2. Cancelation works by canceling the Context that is passed to exec.Cmd. I'll provide an example later of a test that succeeded with go-1.9 but fails with go-1.10. |
Hi, If you can tell me a way to get the go 1.9 behavior with go 1.10, my problem would be solved. package cmd_test
import (
"bufio"
"context"
"os/exec"
"strings"
"syscall"
"testing"
"time"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestCmd(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
defer cancel()
cmd := exec.CommandContext(ctx, "sh", "-c", "echo hello world && sleep 5")
assert.NotNil(t, cmd)
stdoutPipe, err := cmd.StdoutPipe()
assert.Nil(t, err)
assert.NotNil(t, stdoutPipe)
start := time.Now()
err = cmd.Start()
assert.Nil(t, err)
var stdout string
go func() {
buf := bufio.NewReader(stdoutPipe)
for {
line, err := buf.ReadString('\n')
if len(line) > 0 {
stdout = stdout + line + "\n"
}
if err != nil {
return
}
}
}()
err = cmd.Wait()
d := time.Since(start)
if err != nil {
exiterr, ok := err.(*exec.ExitError)
require.True(t, ok)
status, ok := exiterr.Sys().(syscall.WaitStatus)
require.True(t, ok)
assert.NotEqual(t, 0, status.ExitStatus())
}
assert.True(t, strings.HasPrefix(stdout, "hello world"), "Stdout: %v", stdout)
assert.True(t, d.Seconds() < 3, "Duration was %v", d)
} |
I’m sorry but your program has a data race assigning to the stdout variable which is shared between goroutines.
Please run your program with -race to find out more.
… On 7 Jun 2018, at 17:57, kgadams ***@***.***> wrote:
Hi,
this is the example that demonstrates the problem.
We read from the StdoutPipe.
Context is canceled after one second.
With go 1.9 Cmd.Wait() returns immediately.
With go 1.10 it waits till C2 (sleep 5) is done.
If you can tell me a way to get the go 1.9 behavior with go 1.10, my problem would be solved.
`package cmd_test
import (
"bufio"
"context"
"os/exec"
"strings"
"syscall"
"testing"
"time"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestCmd(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
defer cancel()
cmd := exec.CommandContext(ctx, "sh", "-c", "echo hello world && sleep 5")
assert.NotNil(t, cmd)
stdoutPipe, err := cmd.StdoutPipe()
assert.Nil(t, err)
assert.NotNil(t, stdoutPipe)
start := time.Now()
err = cmd.Start()
assert.Nil(t, err)
var stdout string
go func() {
buf := bufio.NewReader(stdoutPipe)
for {
line, err := buf.ReadString('\n')
if len(line) > 0 {
stdout = stdout + line + "\n"
}
if err != nil {
return
}
}
}()
err = cmd.Wait()
d := time.Since(start)
if err != nil {
exiterr, ok := err.(*exec.ExitError)
require.True(t, ok)
status, ok := exiterr.Sys().(syscall.WaitStatus)
require.True(t, ok)
assert.NotEqual(t, 0, status.ExitStatus())
}
assert.True(t, strings.HasPrefix(stdout, "hello world"), "Stdout: %v", stdout)
assert.True(t, d.Seconds() < 3, "Duration was %v", d)
}
`
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Ok, thanks for pointing this out. The go routine now looks like this: go func() {
var stdout string
buf := bufio.NewReader(stdoutPipe)
for {
line, err := buf.ReadString('\n')
if len(line) > 0 {
stdout = stdout + line + "\n"
}
if err != nil {
assert.True(t, strings.HasPrefix(stdout, "hello world"), "Stdout: %v", stdout)
return
}
}
}() |
Sorry, I did not yet figure out how to correctly insert code blocks here. Any pointers? |
I guess you are talking about #24050, which is a change of |
Here is the reproduction as a main function with correct markdown :-) Result does not change: works with go 1.9, breaks with go 1.10. package main
import (
"bufio"
"context"
"fmt"
"os"
"os/exec"
"strings"
"syscall"
"time"
)
func fatal(format string, args ...interface{}) {
fmt.Println(fmt.Sprintf(format, args...))
os.Exit(1)
}
func main() {
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
defer cancel()
cmd := exec.CommandContext(ctx, "sh", "-c", "echo hello world && sleep 5")
stdoutPipe, err := cmd.StdoutPipe()
if err != nil {
fatal("no pipe: %v", err)
}
start := time.Now()
if err = cmd.Start(); err != nil {
fatal("start failed: %v", err)
}
go func() {
var stdout string
buf := bufio.NewReader(stdoutPipe)
for {
line, err := buf.ReadString('\n')
if len(line) > 0 {
stdout = stdout + line + "\n"
}
if err != nil {
if !strings.HasPrefix(stdout, "hello world") {
fatal("wrong output: %q", stdout)
}
return
}
}
}()
err = cmd.Wait()
d := time.Since(start)
if err != nil {
exiterr := err.(*exec.ExitError)
status := exiterr.Sys().(syscall.WaitStatus)
if status.ExitStatus() == 0 {
fatal("wrong exit status: %v", status.ExitStatus())
}
}
if d.Seconds() >= 3 {
fatal("Cancelation took too long: %v", d)
}
fmt.Println("Success!")
} |
I get Success! with 1.10.2 and tip on my Ubuntu laptop. |
The documentation for `exec.Cmd` says this about `Stdout` and `Stderr`: // If either is an *os.File, the corresponding output from the process // is connected directly to that file. // // Otherwise, during the execution of the command a separate goroutine // reads from the process over a pipe and delivers that data to the // corresponding Writer. In this case, Wait does not complete until the // goroutine reaches EOF or encounters an error. When calling `CombinedOutput()`, `Stdout` and `Stderr` are `bytes.Buffer`s and are therefore not `*os.File`s so they fall into this second group. This resulted in a race condition where cancelling the context when `maxtime` has passed could cause `CombinedOutput()` to hang indefinitely waiting for the (finished) subprocess to "finish" writing to its pipes. This has been reported as an issue several times. The tracking issue is golang/go#23019 which itself links to several other issues that are duplicates. To work around the issue we simply force the other behavior by creating a temporary `*os.File` for the combined `stdout` and `stderr`.
Change https://go.dev/cl/400877 mentions this issue: |
Change https://go.dev/cl/400877 mentions this issue: |
There is an issue where 'go test' will hang after the tests complete if a test starts a sub-process that does not exit (see #24050). However, go test only exhibits that behavior when a package name is explicitly passed as an argument. If 'go test' is invoked without any package arguments then the package in the working directory is assumed, however in that case (and only that case) os.Stdout is used as the test process's cmd.Stdout, which does *not* cause 'go test' wait for the sub-process to exit (see #23019). This change wraps os.Stdout in an io.Writer struct in this case, hiding the *os.File from the os/exec package, causing cmd.Wait to always wait for the full output from the test process and any of its sub-processes. In other words, this makes 'go test' exhibit the same behavior as 'go test .' (or 'go test ./...' and so on). Update #23019 Update #24050 Change-Id: Ica09bf156f3b017f9a31aad91ed0f16a7837195b Reviewed-on: https://go-review.googlesource.com/c/go/+/400877 Reviewed-by: Bryan Mills <bcmills@google.com> Run-TryBot: Andrew Gerrand <adg@golang.org> Auto-Submit: Andrew Gerrand <adg@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Andrew Gerrand <adg@golang.org> Reviewed-by: Ian Lance Taylor <iant@google.com>
`TestContext_sleepTimeoutExpired` can occasionally hang when killing a command that has Stdout or Stderr set to anything besides `nil` or `*os.File`. golang/go#23019 Use workaround to read from StdoutPipe and StderrPipe rather than setting Stdout / Stderr
* Update TestContext_sleepTimeoutExpired to check for canceling within timeframe Add a timeout to the test to ensure that the terraform apply cancels within a reasonable time of the 5s timeout. Currently, this test is not canceling the terraform apply as expected. In the logs you can see that the test takes 1 min rather than ~5s: ``` --- PASS: TestContext_sleepTimeoutExpired/sleep-0.12.31 (62.13s) ``` ``` === RUN TestContext_sleepTimeoutExpired/sleep-0.12.31 util_test.go:113: [INFO] running Terraform command: /var/folders/6y/gy9gggt14379c_k39vwb50lc0000gn/T/terraform_1378921380/terraform apply -no-color -auto-approve -input=false -lock=true -parallelism=10 -refresh=true util_test.go:103: CLI Output: // truncated ... time_sleep.sleep: Creating... time_sleep.sleep: Still creating... [10s elapsed] time_sleep.sleep: Still creating... [20s elapsed] time_sleep.sleep: Still creating... [30s elapsed] time_sleep.sleep: Still creating... [41s elapsed] time_sleep.sleep: Still creating... [51s elapsed] time_sleep.sleep: Creation complete after 1m0s [id=2022-05-06T17:40:20Z] Apply complete! Resources: 1 added, 0 changed, 0 destroyed. ``` * Remove runTerraformCmd check for cmd.ProcessState Processes were not being killed because cmd.ProcessState was nil. With this change, processes will be able to make the request to Kill(). Added a temporary log to printout cmd.ProcessState to demonstrate. Will be removed in next commit. Note: this will cause hanging `TestContext_sleepTimeoutExpired` due to a known Golang issue with killing a command when Stdout or Stderr are set to anything besides `nil` or `*os.File`. This is because the Kill does not notify the stdout/stderr subprocesses to stop. `cmd.Wait` (called by `cmd.Run`) waits indefinitely for those subprocesses to stop. * Read logs from Stderr/out Pipe to avoid hanging `TestContext_sleepTimeoutExpired` can occasionally hang when killing a command that has Stdout or Stderr set to anything besides `nil` or `*os.File`. golang/go#23019 Use workaround to read from StdoutPipe and StderrPipe rather than setting Stdout / Stderr * Test for runTerraformCmd leaked go-routine Currently, when runTerraformCmd is called, it starts a go-routine to kill the Terraform CLI on context.Done(). However, when the Terraform CLI completes and runTerraformCmd() finishes, the go-routine continues running unnecessarily. If the caller cancels the context down the line, this will stop the go-routine and it will log the error: "error from kill: os: process already finished" because the Terraform CLI has already finished. Added a test for this in cmd_default.go and cmd_linux.go. Have not tried it in linux yet though. When running with the race detector: ``` ================== WARNING: DATA RACE Read at 0x00c0002516c8 by goroutine 7: bytes.(*Buffer).String() /usr/local/go/src/bytes/buffer.go:65 +0x36a github.com/hashicorp/terraform-exec/tfexec.Test_runTerraformCmd_default() /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default_test.go:35 +0x360 testing.tRunner() // truncated ... Previous write at 0x00c0002516c8 by goroutine 8: bytes.(*Buffer).grow() /usr/local/go/src/bytes/buffer.go:147 +0x3b1 bytes.(*Buffer).Write() /usr/local/go/src/bytes/buffer.go:172 +0xcd log.(*Logger).Output() /usr/local/go/src/log/log.go:184 +0x466 log.(*Logger).Printf() /usr/local/go/src/log/log.go:191 +0x6e github.com/hashicorp/terraform-exec/tfexec.(*Terraform).runTerraformCmd.func1() /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default.go:24 +0x2a5 // truncated ... ================== * Use CommandContext to kill instead of manually doing it * Fix EOF error check to use error.Is() This also fixes a race condition caused by using the pointer to the io.EOF * Update tests to use separate string.Builder-s for stdout and stderr string.Builder is a non-comparable type which is not safe for concurrent use when shared by Cmd.Stdout and Cmd.Stderr. Causes a race condition when accessing the the builder when Cmd is running. * Fixes to runTerraformCmd for race conditions - Use waitgroups for more readability - Improve handling errors from writeOutput - Finish reading from pipes before calling cmd.Wait - fixes a race condition that leads to an error :`read |0: file already closed` - Because now waiting for pipes to finish reading, need to update waitGroup to close buf.Read on context cancel. Otherwise buf.Read blocks until next line before stopping. Causes TestContext_sleepTimeoutExpired takes a little too long to cancel (~20s) Co-authored-by: Kyle Carberry <kyle@carberry.com>
* Update TestContext_sleepTimeoutExpired to check for canceling within timeframe Add a timeout to the test to ensure that the terraform apply cancels within a reasonable time of the 5s timeout. Currently, this test is not canceling the terraform apply as expected. In the logs you can see that the test takes 1 min rather than ~5s: ``` --- PASS: TestContext_sleepTimeoutExpired/sleep-0.12.31 (62.13s) ``` ``` === RUN TestContext_sleepTimeoutExpired/sleep-0.12.31 util_test.go:113: [INFO] running Terraform command: /var/folders/6y/gy9gggt14379c_k39vwb50lc0000gn/T/terraform_1378921380/terraform apply -no-color -auto-approve -input=false -lock=true -parallelism=10 -refresh=true util_test.go:103: CLI Output: // truncated ... time_sleep.sleep: Creating... time_sleep.sleep: Still creating... [10s elapsed] time_sleep.sleep: Still creating... [20s elapsed] time_sleep.sleep: Still creating... [30s elapsed] time_sleep.sleep: Still creating... [41s elapsed] time_sleep.sleep: Still creating... [51s elapsed] time_sleep.sleep: Creation complete after 1m0s [id=2022-05-06T17:40:20Z] Apply complete! Resources: 1 added, 0 changed, 0 destroyed. ``` * Remove runTerraformCmd check for cmd.ProcessState Processes were not being killed because cmd.ProcessState was nil. With this change, processes will be able to make the request to Kill(). Added a temporary log to printout cmd.ProcessState to demonstrate. Will be removed in next commit. Note: this will cause hanging `TestContext_sleepTimeoutExpired` due to a known Golang issue with killing a command when Stdout or Stderr are set to anything besides `nil` or `*os.File`. This is because the Kill does not notify the stdout/stderr subprocesses to stop. `cmd.Wait` (called by `cmd.Run`) waits indefinitely for those subprocesses to stop. * Read logs from Stderr/out Pipe to avoid hanging `TestContext_sleepTimeoutExpired` can occasionally hang when killing a command that has Stdout or Stderr set to anything besides `nil` or `*os.File`. golang/go#23019 Use workaround to read from StdoutPipe and StderrPipe rather than setting Stdout / Stderr * Test for runTerraformCmd leaked go-routine Currently, when runTerraformCmd is called, it starts a go-routine to kill the Terraform CLI on context.Done(). However, when the Terraform CLI completes and runTerraformCmd() finishes, the go-routine continues running unnecessarily. If the caller cancels the context down the line, this will stop the go-routine and it will log the error: "error from kill: os: process already finished" because the Terraform CLI has already finished. Added a test for this in cmd_default.go and cmd_linux.go. Have not tried it in linux yet though. When running with the race detector: ``` ================== WARNING: DATA RACE Read at 0x00c0002516c8 by goroutine 7: bytes.(*Buffer).String() /usr/local/go/src/bytes/buffer.go:65 +0x36a github.com/hashicorp/terraform-exec/tfexec.Test_runTerraformCmd_default() /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default_test.go:35 +0x360 testing.tRunner() // truncated ... Previous write at 0x00c0002516c8 by goroutine 8: bytes.(*Buffer).grow() /usr/local/go/src/bytes/buffer.go:147 +0x3b1 bytes.(*Buffer).Write() /usr/local/go/src/bytes/buffer.go:172 +0xcd log.(*Logger).Output() /usr/local/go/src/log/log.go:184 +0x466 log.(*Logger).Printf() /usr/local/go/src/log/log.go:191 +0x6e github.com/hashicorp/terraform-exec/tfexec.(*Terraform).runTerraformCmd.func1() /Users/lornasong/go/src/github.com/hashicorp/terraform-exec/tfexec/cmd_default.go:24 +0x2a5 // truncated ... ================== * Use CommandContext to kill instead of manually doing it * Fix EOF error check to use error.Is() This also fixes a race condition caused by using the pointer to the io.EOF * Update tests to use separate string.Builder-s for stdout and stderr string.Builder is a non-comparable type which is not safe for concurrent use when shared by Cmd.Stdout and Cmd.Stderr. Causes a race condition when accessing the the builder when Cmd is running. * Fixes to runTerraformCmd for race conditions - Use waitgroups for more readability - Improve handling errors from writeOutput - Finish reading from pipes before calling cmd.Wait - fixes a race condition that leads to an error :`read |0: file already closed` - Because now waiting for pipes to finish reading, need to update waitGroup to close buf.Read on context cancel. Otherwise buf.Read blocks until next line before stopping. Causes TestContext_sleepTimeoutExpired takes a little too long to cancel (~20s) Co-authored-by: Kyle Carberry <kyle@carberry.com>
there are some workarounds to use before this proposal implement. we can kill all child process when context is done. func main() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
cmd := exec.CommandContext(ctx, "bash","/root/sleep.sh")
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
doneCh := make(chan struct{}, 1)
defer func() {
doneCh <- struct{}{}
close(doneCh)
}()
go func() {
select {
case <-ctx.Done():
err := syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)
if err != nil {
fmt.Printf("kill error : [%v]\n", err)
return
}
fmt.Printf("run command(%s) timeout,kill all child process\n", cmd.String())
case <-doneCh:
}
}()
output, err := cmd.CombinedOutput()
if err != nil {
fmt.Println(err)
return
}
fmt.Println("output:", string(output))
fmt.Printf("ctx.Err : [%v]\n", ctx.Err())
fmt.Printf("error : [%v]\n", err)
} or set stdout、stderr use *os.File type to avoid bloking
func main() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
cmd := exec.CommandContext(ctx, "bash", sh)
combinedOutput, err := ioutil.TempFile("", "stdouterr")
if err != nil {
fmt.Println(err)
return
}
defer func() { _ = os.Remove(combinedOutput.Name()) }()
cmd.Stdout = combinedOutput
cmd.Stderr = combinedOutput
err = cmd.Run()
if err != nil {
fmt.Println(err)
}
_, err = combinedOutput.Seek(0, 0)
var b bytes.Buffer
_, err = io.Copy(&b, combinedOutput)
if err != nil {
fmt.Println(err)
return
}
err = combinedOutput.Close()
if err != nil {
fmt.Println(err)
return
}
fmt.Println("output:", b.String())
fmt.Printf("ctx.Err : [%v]\n", ctx.Err())
fmt.Printf("error : [%v]\n", err)
} |
I'm retracting this proposal in favor of #50436. |
Starting from Terraform v1.4, launching terraform providers in the acceptance test has been failing more frequently with a text file busy error. ``` --- FAIL: TestAccMultiStateMigratorApplySimple (1.07s) multi_state_migrator_test.go:123: failed to run terraform init: failed to run command (exited 1): terraform init -input=false -no-color stdout: Initializing the backend... Successfully configured the backend "s3"! Terraform will automatically use this backend unless the backend configuration changes. Initializing provider plugins... - Finding latest version of hashicorp/null... - Installing hashicorp/null v3.2.1... stderr: Error: Failed to install provider Error while installing hashicorp/null v3.2.1: open /tmp/plugin-cache/registry.terraform.io/hashicorp/null/3.2.1/linux_amd64/terraform-provider-null_v3.2.1_x5: text file busy ``` After some investigation, I found Go's `os/exec.Cmd.Run()` does not wait for the grandchild process to complete; from the point of view of tfmigrate, the terraform command is the child process, and the provider is the grandchild process. golang/go#23019 If I understand correctly, this is not a Terraform issue and theoretically should occur in versions older than v1.4; the changes in v1.4 may have broken the balance of execution timing and made the test very flaky. I experimented with inserting some sleep but could not get the test to stabilize correctly. After trying various things, I found that the test became stable by enabling the `TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE` flag was introduced in v1.4. This is an escape hatch to revert to the v1.3 equivalent of the global cache behavior change in v1.4. hashicorp/terraform#32726 This behavior change has already been addressed in the previous commit using a local file system mirror, so activating this flag does not seem to make any sense. Even though I have no other reasonable solutions now, please let me know if anyone finds a better solution.
Add WaitDelay to ensure cmd.Wait() returns in a reasonable timeframe if the goroutines that cmd.Start() uses to copy Stdin/Stdout/Stderr are blocked when copying due to a sub-subprocess holding onto them. Read more details in these issues: - golang/go#23019 - golang/go#50436 This isn't the original intent of kill-delay, but it seems reasonable to reuse it in this context. Fixes canonical#149
…/out/err (#275) Use os.exec's Cmd.WaitDelay to ensure cmd.Wait returns in a reasonable timeframe if the goroutines that cmd.Start() uses to copy stdin/out/err are blocked when copying due to a sub-subprocess holding onto them. Read more details about the issue in golang/go#23019 and the proposed solution (that was added in Go 1.20) in golang/go#50436. This solves issue #149, where Patroni wasn't restarting properly even after a `KILL` signal was sent to it. I had originally mis-diagnosed this problem as an issue with Pebble not tracking the process tree of processes that daemonise and change their process group (which is still an issue, but is not causing this problem). The Patroni process wasn't being marked as finished at all due to being blocked on the `cmd.Wait()`. Patroni starts sub-processes and "forwards" stdin/out/err, so the copy goroutines block. Thankfully Go 1.20 introduced `WaitDelay` to allow you to easily work around this exact problem. The fix itself is [this one-liner] (#275): s.cmd.WaitDelay = s.killDelay() * 9 / 10 // 90% of kill-delay This will really only be a problem for services, but we make the same change for exec and exec health checks as it won't hurt there either. Also, as a drive-by, this PR also canonicalises some log messages: our style is to start with an uppercase letter (for logs, not errors) and to use "Cannot X" rather than "Error Xing". Fixes #149.
When the
Stdin
,Stdout
, orStderr
fields ofos/exec.Cmd
are set to anything other thannil
or a*os.File
(a common case is*bytes.Buffer
), we callos.Pipe
to get a pipe and create goroutines to copy data in or out. The(*Cmd).Wait
method first waits for the subprocess to exit, then waits for those goroutines to finish copying data.If the subprocess C1 itself starts a subsubprocess C2, and if C1 passes any of its stdin/stdout/stderr descriptors to C2, and if C1 exits without waiting for C2 to exit, then C2 will hold an open end of the pipes created by the os/exec package. The
(*Cmd).Wait
method will wait until the goroutines complete, which means waiting until those pipes are closed, which in practice means waiting until C2 exits. This is confusing, because the user sees that C1 is done, and doesn't understand why their program is still waiting for it.This confusion has been filed as an issue multiple times, at least #7378, #18874, #20730, #21922, #22485.
It doesn't have to work this way. Although the current goroutines call
io.Copy
, we could change them to use a loop. After everyRead
, the loop could check whether the process has been waited for. ThenWait
could wait for the child, tell the goroutines to stop, give them a chance for a final write, and then return. The stdout/stderr goroutines would close their end of the pipe. There wouldn't be any race and there wouldn't be any unexpected waits. But in cases where there is a C2 process, not all of the standard output and standard error output that we currently collect would be available.To be clear, programmers can already handle these cases however they like by calling
os.Pipe
themselves and using the pipe in theCmd
struct. That is true today and it would be true if change how it works.The questions I want to raise are: would making this change be less confusing for people? Can we make this change without breaking the Go 1 guarantee?
CC @bradfitz
The text was updated successfully, but these errors were encountered: