Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix pump storage quit bug #739

Merged
merged 3 commits into from
Sep 5, 2019

Conversation

lichunzhu
Copy link
Contributor

@lichunzhu lichunzhu commented Sep 3, 2019

What problem does this PR solve?

A son PR of #735.
When we try to close pump server, we will cancel s.ctx first. But we wish that storage will pull storaged binlogs until drainer receives them. Then pump can safely quit. But if we use s.ctx for storage.PullCommitBinlog it will quit at first and drainer may no longer receive binlog from this pump.

What is changed and how it works?

Change s.ctx to context.Background() and add s.pullClose channel to control the closure of PullBinlogs. Run close(s.pullClose) after commitStatus.

Check List

Tests

  • Unit test
  • Integration test

Code changes

Side effects

Related changes

@lichunzhu
Copy link
Contributor Author

/run-all-tests

@july2993
Copy link
Contributor

july2993 commented Sep 3, 2019

why ci pass before, this bug seems exists before pr735...means can't offline pump

@lichunzhu
Copy link
Contributor Author

lichunzhu commented Sep 4, 2019

why ci pass before, this bug seems exists before pr735...means can't offline pump

Perhaps drainer had already cached binlogs from pump.storage before pump starts to quit in the past versions. When we reduce cache size to 0 and this problem appears.

@lichunzhu
Copy link
Contributor Author

why ci pass before, this bug seems exists before pr735...means can't offline pump

image
Meanwhile, if you check drainer's log you will find drainer is keeping creating connection to pump but never receives a binlog, like this screenshot shows. That's because the pullBinlog always returns code.canceled since s.cancel in pump.storage. The drainer can receive binlog normally after I change s.ctx to context.Background. I've tested it on my server.

@july2993
Copy link
Contributor

july2993 commented Sep 4, 2019

      // notify other goroutines to exit
      s.cancel()
      s.wg.Wait()
      log.Info("background goroutins are stopped")

      s.commitStatus()

we cancel ctx before commitStatus() so, in commitStatus we will write a fakebinlog to make sure drainer consume this fakebinlog, so before pr735, but the ctx is canceled, so drainer can't receive t
it(not related to the cache)

@lichunzhu
Copy link
Contributor Author

lichunzhu commented Sep 4, 2019

      // notify other goroutines to exit
      s.cancel()
      s.wg.Wait()
      log.Info("background goroutins are stopped")

      s.commitStatus()

we cancel ctx before commitStatus() so, in commitStatus we will write a fakebinlog to make sure drainer consume this fakebinlog, so before pr735, but the ctx is canceled, so drainer can't receive t
it(not related to the cache)

I have a new conjecture now. I will do some tests to confirm it and later I will give a conclusion.

@lichunzhu
Copy link
Contributor Author

lichunzhu commented Sep 4, 2019

@july2993
When /drainer/pump.go find that pump.PullBinlogs is closed, it will keep trying starting a new one again and again since pump is still not offline.
These code is from '/pump/storage/storage.go:PullCommitBinlog':

				select {
				case values <- value:
					log.Debug("send value success")
				case <-ctx.Done():
					iter.Release()
					return
				}

Although, ctx.Done is activated, but there is still possibility for pump to send logs out since select treats every case as the same. If the remained unsent logs are not too much we can successfully send them to drainer in 15s(the time set in check_status). But the newly PR has more unsent logs which makes this process very slow(because of 'possibility'). So, it failed.

@lichunzhu
Copy link
Contributor Author

image
Besides, this is part of drainer.log of newest master branch which passed CI. It creates pullCli 3 times and finally receives the binlog at the 3st pullCli. Then pump quits successfully.

@july2993
Copy link
Contributor

july2993 commented Sep 5, 2019

    blesCount:0 LevelSizes:[] LevelTablesCounts:[] LevelRead:[] LevelWrite:[] LevelDurations:[]}^[[0m
 32 2019/09/05 13:04:37 server.go:546: ^[[0;37m[info] writeBinlogCount: 40, alivePullerCount: 1,  maxCommitTS: 410951686898057217^[[0m
 33 2019/09/05 13:04:39 main.go:50: ^[[0;37m[info] got signal [15] to exit.^[[0m
 34 2019/09/05 13:04:39 server.go:854: ^[[0;37m[info] begin to close pump server ^[[0m
 35 2019/09/05 13:04:39 server.go:520: ^[[0;37m[info] detect drainer checkpoint routine exit ^[[0m
 36 2019/09/05 13:04:39 node.go:177: ^[[0;37m[info] Heartbeat goroutine exited ^[[0m
 37 2019/09/05 13:04:39 server.go:543: ^[[0;37m[info] printServerInfo exit ^[[0m
 38 2019/09/05 13:04:39 server.go:496: ^[[0;37m[info] genFakeBinlog exit ^[[0m
 39 2019/09/05 13:04:39 server.go:559: ^[[0;37m[info] gcBinlogFile exit ^[[0m
 40 2019/09/05 13:04:39 server.go:863: ^[[0;37m[info] background goroutins are stopped ^[[0m
 41 2019/09/05 13:04:39 server.go:849: ^[[0;37m[info] pump:8250 has update status to paused^[[0m
 42 2019/09/05 13:04:39 server.go:866: ^[[0;37m[info] commit status done ^[[0m
 43 2019/09/05 13:04:47 storage.go:374: ^[[0;37m[info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:5437 IORead:0 BlockCacheSize:0 OpenedTa    blesCount:0 LevelSizes:[] LevelTablesCounts:[] LevelRead:[] LevelWrite:[] LevelDurations:[]}^[[0m
 44 2019/09/05 13:04:57 storage.go:374: ^[[0;37m[info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:5437 IORead:0 BlockCacheSize:0 OpenedTa    blesCount:0 LevelSizes:[] LevelTablesCounts:[] LevelRead:[] LevelWrite:[] LevelDurations:[]}^[[0m

will block at s.gs.GracefulStop

guess because the flighting PullBinlog gRPC not quit

@lichunzhu
Copy link
Contributor Author

lichunzhu commented Sep 5, 2019

If so, we have two solutions.

  1. Run s.storage.Close() before s.gs.GracefulStop:
    Current code in pump.Close:
        // stop the gRPC server
	s.gs.GracefulStop()
	log.Info("grpc is stopped")

	if err := s.storage.Close(); err != nil {
		log.Error("close storage failed", zap.Error(err))
	}

Code in pump/storage.PullCommitBinlog:

// PullCommitBinlog return commit binlog  > last
func (a *Append) PullCommitBinlog(ctx context.Context, last int64) <-chan []byte {
	log.Debug("new PullCommitBinlog", zap.Int64("last ts", last))

	ctx, cancel := context.WithCancel(ctx)
	go func() {
		select {
		case <-a.close:
			cancel()
		case <-ctx.Done():
		}
	}()

After s.storage.Close() the flighting PullBinlog gRPC will quit.
2. Add a new channel or context to control Pullbinlog to quit.

In 1. we may read metadata after storage is closed which may cause panic.
So maybe the best solution is adding a channel to close Pullbinlog after commitStatus?

@july2993
Copy link
Contributor

july2993 commented Sep 5, 2019

for 1, inflight WriteBinlog() gRPC need to use s.Storage, so can't just close s.Storage first
for 2, can quickly fix this problem.

pump/server.go Outdated Show resolved Hide resolved
Copy link
Contributor

@july2993 july2993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@july2993
Copy link
Contributor

july2993 commented Sep 5, 2019

@leoppro @suzaku PTAL

@lichunzhu
Copy link
Contributor Author

/run-all-tests

Copy link
Contributor

@suzaku suzaku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lichunzhu lichunzhu merged commit 277f113 into pingcap:release-2.1 Sep 5, 2019
@lichunzhu lichunzhu deleted the czli/fixPumpStorageQuit branch September 5, 2019 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants