Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't call T.FatalX inside goroutines in tests #2151

Closed
lrettig opened this issue Sep 25, 2020 · 5 comments
Closed

Don't call T.FatalX inside goroutines in tests #2151

lrettig opened this issue Sep 25, 2020 · 5 comments

Comments

@lrettig
Copy link
Member

lrettig commented Sep 25, 2020

Description

Many of the tests in https://github.com/spacemeshos/go-spacemesh/blob/develop/cmd/node/node_test.go use goroutines to run and interact with a process concurrently. They have require statements inside these goroutines, which is a no-no because when a require statement fails inside a goroutine it does not stop the test, in some cases causing tests to hang indefinitely. Here's one such example:

go func() {
// This makes sure the test doesn't end until this goroutine closes
defer wg.Done()
str, err := testArgs(app,
"--grpc-port-new", strconv.Itoa(port),
"--grpc", "node",
// the following prevents obnoxious warning in macOS
"--acquire-port=false",
"--tcp-interface", "127.0.0.1",
"--grpc-interface-new", "localhost",
)
require.Empty(t, str)
require.NoError(t, err)
}()

Which is causing this: https://travis-ci.org/github/spacemeshos/go-spacemesh/jobs/730399257

Steps To Reproduce

  1. Insert a require.FailNow() inside one of these goroutines
  2. Watch it break everything

Expected Behavior

The test should fail completely and cleanly when one of these test statements fails

Environment

Please complete the following information:

Additional Resources

For context see golang/go#15758

How to fix

Unclear. These tests do some messy concurrency. Either the tests need to be rewritten to add a failChannel and be smart enough to stop when a test fails (for inspiration see golang/mock#346 (comment)), or else we could just use panic() now as a workaround.

lrettig added a commit that referenced this issue Sep 25, 2020
Update command line flags. This should stop tests from running forever,
but the underlying issue is unaddressed, see
#2151.
@lrettig
Copy link
Member Author

lrettig commented Oct 6, 2020

Here's some sample code that does this in a relatively canonical way, for reference:

// Scan and print the poet output, and catch errors early
l := lg.New(os.Stderr, "[poet]\t", 0)
failChan := make(chan struct{})
go func() {
scanner := bufio.NewScanner(io.MultiReader(poetHarness.Stdout, poetHarness.Stderr))
for scanner.Scan() {
line := scanner.Text()
matched, err := regexp.MatchString(`\bERROR\b`, line)
require.NoError(suite.T(), err)
// Fail fast if we encounter a poet error
// Must use a channel since we're running inside a goroutine
if matched {
close(failChan)
suite.T().Fatalf("got error from poet: %s", line)
}
l.Println(line)
}
}()
rolacle := eligibility.New()
rng := BLS381.DefaultSeed()
gTime, err := time.Parse(time.RFC3339, genesisTime)
if err != nil {
log.Error("cannot parse genesis time %v", err)
}
ld := time.Duration(20) * time.Second
clock := timesync.NewClock(timesync.RealClock{}, ld, gTime, log.NewDefault("clock"))
suite.initMultipleInstances(cfg, rolacle, rng, 5, path, genesisTime, poetHarness.HTTPPoetClient, clock, net)
defer GracefulShutdown(suite.apps)
for _, a := range suite.apps {
a.startServices()
}
ActivateGrpcServer(suite.apps[0])
if err := poetHarness.Start([]string{"127.0.0.1:9092"}); err != nil {
suite.T().Fatalf("failed to start poet server: %v", err)
}
timeout := time.After(6 * time.Minute)
// Run setup first. We need to allow this to timeout, and monitor the failure channel too,
// as this can also loop forever.
doneChan := make(chan struct{})
go func() {
setupTests(suite)
close(doneChan)
}()
loopSetup:
for {
select {
case <-doneChan:
break loopSetup
case <-timeout:
suite.T().Fatal("timed out")
case <-failChan:
suite.T().Fatal("error from poet harness")
}
}

@Amirhossein2000
Copy link

Hi
May I do this issue?
I know how to do this job and I think using Context.Context is a better idea.

@lrettig
Copy link
Member Author

lrettig commented Oct 7, 2020

Hi @Amirhossein2000. That's an interesting idea, and one I hadn't previously considered. My understanding of golang's context object is that it's used to propagate cancel signals down to sub-processes, not the other way, up to the top-level context (or throughout the context tree). For this reason I'm not sure it's the right solution here. We need a way for a goroutine to send an immediate cancel/failure signal up to its parent. What do you think?

@Amirhossein2000
Copy link

Amirhossein2000 commented Oct 7, 2020

I changed your example and this is going to be like this but we can't separate timeout and fail errors now.
if these separate logs are not important this way is a better way.

 	// Scan and print the poet output, and catch errors early 
 	l := lg.New(os.Stderr, "[poet]\t", 0) 
 	ctx, cancel := context.WithTimeout(context.Background(), time.minute * 6)
 	go func(cancelFunc context.Cancel) { 
 		scanner := bufio.NewScanner(io.MultiReader(poetHarness.Stdout, poetHarness.Stderr)) 
 		for scanner.Scan() { 
 			line := scanner.Text() 
 			matched, err := regexp.MatchString(`\bERROR\b`, line) 
 			require.NoError(suite.T(), err) 
 			// Fail fast if we encounter a poet error 
 			// Must use a channel since we're running inside a goroutine 
 			if matched { 
 				cancelFunc()
 				suite.T().Fatalf("got error from poet: %s", line) 
 			} 
 			l.Println(line) 
 		} 
 	}() 
  
 	rolacle := eligibility.New() 
 	rng := BLS381.DefaultSeed() 
  
 	gTime, err := time.Parse(time.RFC3339, genesisTime) 
 	if err != nil { 
 		log.Error("cannot parse genesis time %v", err) 
 	} 
 	ld := time.Duration(20) * time.Second 
 	clock := timesync.NewClock(timesync.RealClock{}, ld, gTime, log.NewDefault("clock")) 
 	suite.initMultipleInstances(cfg, rolacle, rng, 5, path, genesisTime, poetHarness.HTTPPoetClient, clock, net) 
 	defer GracefulShutdown(suite.apps) 
 	for _, a := range suite.apps { 
 		a.startServices() 
 	} 
  
 	ActivateGrpcServer(suite.apps[0]) 
  
 	if err := poetHarness.Start([]string{"127.0.0.1:9092"}); err != nil { 
 		suite.T().Fatalf("failed to start poet server: %v", err) 
 	} 
  
 	timeout := time.After(6 * time.Minute) 
  
 	// Run setup first. We need to allow this to timeout, and monitor the failure channel too, 
 	// as this can also loop forever. 
 	doneChan := make(chan struct{}) 
 	go func() { 
 		setupTests(suite) 
 		close(doneChan) 
 	}() 
  
 loopSetup: 
 	for { 
 		select { 
 		case <-doneChan: 
 			break loopSetup 
 		case <-ctx.Done(): 
 			suite.T().Fatal("Test Failed")
 		} 
 	} 

@lrettig
Copy link
Member Author

lrettig commented Oct 8, 2020

@Amirhossein2000 thanks for putting together this example, it's helpful. I think it could probably be made to work, but it seems like it's not idiomatic go. It seems like it's using context in a way that it's not intended to be used. Here are the sources I'm referring to:

The first of these explains how Context is supposed to be used, and the second contains a wealth of information on how channels can be used to propagate cancel signals.

Let me know if you disagree!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants