fix pump storage quit bug #739

lichunzhu · 2019-09-03T11:49:00Z

What problem does this PR solve?

A son PR of #735.
When we try to close pump server, we will cancel s.ctx first. But we wish that storage will pull storaged binlogs until drainer receives them. Then pump can safely quit. But if we use s.ctx for storage.PullCommitBinlog it will quit at first and drainer may no longer receive binlog from this pump.

What is changed and how it works?

Change s.ctx to context.Background() and add s.pullClose channel to control the closure of PullBinlogs. Run close(s.pullClose) after commitStatus.

Check List

Tests

Unit test
Integration test

Code changes

Side effects

Related changes

lichunzhu · 2019-09-03T11:51:01Z

/run-all-tests

july2993 · 2019-09-03T14:27:59Z

why ci pass before, this bug seems exists before pr735...means can't offline pump

lichunzhu · 2019-09-04T02:12:42Z

why ci pass before, this bug seems exists before pr735...means can't offline pump

Perhaps drainer had already cached binlogs from pump.storage before pump starts to quit in the past versions. When we reduce cache size to 0 and this problem appears.

lichunzhu · 2019-09-04T02:17:01Z

why ci pass before, this bug seems exists before pr735...means can't offline pump

Meanwhile, if you check drainer's log you will find drainer is keeping creating connection to pump but never receives a binlog, like this screenshot shows. That's because the pullBinlog always returns code.canceled since s.cancel in pump.storage. The drainer can receive binlog normally after I change s.ctx to context.Background. I've tested it on my server.

july2993 · 2019-09-04T06:55:30Z

      // notify other goroutines to exit
      s.cancel()
      s.wg.Wait()
      log.Info("background goroutins are stopped")

      s.commitStatus()

we cancel ctx before commitStatus() so, in commitStatus we will write a fakebinlog to make sure drainer consume this fakebinlog, so before pr735, but the ctx is canceled, so drainer can't receive t
it(not related to the cache)

lichunzhu · 2019-09-04T08:09:27Z

      // notify other goroutines to exit
      s.cancel()
      s.wg.Wait()
      log.Info("background goroutins are stopped")

      s.commitStatus()
we cancel ctx before commitStatus() so, in commitStatus we will write a fakebinlog to make sure drainer consume this fakebinlog, so before pr735, but the ctx is canceled, so drainer can't receive t
it(not related to the cache)

I have a new conjecture now. I will do some tests to confirm it and later I will give a conclusion.

lichunzhu · 2019-09-04T08:39:05Z

@july2993
When /drainer/pump.go find that pump.PullBinlogs is closed, it will keep trying starting a new one again and again since pump is still not offline.
These code is from '/pump/storage/storage.go:PullCommitBinlog':

				select {
				case values <- value:
					log.Debug("send value success")
				case <-ctx.Done():
					iter.Release()
					return
				}

Although, ctx.Done is activated, but there is still possibility for pump to send logs out since select treats every case as the same. If the remained unsent logs are not too much we can successfully send them to drainer in 15s(the time set in check_status). But the newly PR has more unsent logs which makes this process very slow(because of 'possibility'). So, it failed.

lichunzhu · 2019-09-04T08:45:31Z

Besides, this is part of drainer.log of newest master branch which passed CI. It creates pullCli 3 times and finally receives the binlog at the 3st pullCli. Then pump quits successfully.

july2993 · 2019-09-05T04:11:20Z

    blesCount:0 LevelSizes:[] LevelTablesCounts:[] LevelRead:[] LevelWrite:[] LevelDurations:[]}^[[0m
 32 2019/09/05 13:04:37 server.go:546: ^[[0;37m[info] writeBinlogCount: 40, alivePullerCount: 1,  maxCommitTS: 410951686898057217^[[0m
 33 2019/09/05 13:04:39 main.go:50: ^[[0;37m[info] got signal [15] to exit.^[[0m
 34 2019/09/05 13:04:39 server.go:854: ^[[0;37m[info] begin to close pump server ^[[0m
 35 2019/09/05 13:04:39 server.go:520: ^[[0;37m[info] detect drainer checkpoint routine exit ^[[0m
 36 2019/09/05 13:04:39 node.go:177: ^[[0;37m[info] Heartbeat goroutine exited ^[[0m
 37 2019/09/05 13:04:39 server.go:543: ^[[0;37m[info] printServerInfo exit ^[[0m
 38 2019/09/05 13:04:39 server.go:496: ^[[0;37m[info] genFakeBinlog exit ^[[0m
 39 2019/09/05 13:04:39 server.go:559: ^[[0;37m[info] gcBinlogFile exit ^[[0m
 40 2019/09/05 13:04:39 server.go:863: ^[[0;37m[info] background goroutins are stopped ^[[0m
 41 2019/09/05 13:04:39 server.go:849: ^[[0;37m[info] pump:8250 has update status to paused^[[0m
 42 2019/09/05 13:04:39 server.go:866: ^[[0;37m[info] commit status done ^[[0m
 43 2019/09/05 13:04:47 storage.go:374: ^[[0;37m[info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:5437 IORead:0 BlockCacheSize:0 OpenedTa    blesCount:0 LevelSizes:[] LevelTablesCounts:[] LevelRead:[] LevelWrite:[] LevelDurations:[]}^[[0m
 44 2019/09/05 13:04:57 storage.go:374: ^[[0;37m[info] DBStats: {WriteDelayCount:0 WriteDelayDuration:0s WritePaused:false AliveSnapshots:0 AliveIterators:0 IOWrite:5437 IORead:0 BlockCacheSize:0 OpenedTa    blesCount:0 LevelSizes:[] LevelTablesCounts:[] LevelRead:[] LevelWrite:[] LevelDurations:[]}^[[0m

will block at s.gs.GracefulStop

guess because the flighting PullBinlog gRPC not quit

lichunzhu · 2019-09-05T04:38:12Z

If so, we have two solutions.

Run s.storage.Close() before s.gs.GracefulStop:
Current code in pump.Close:

        // stop the gRPC server
	s.gs.GracefulStop()
	log.Info("grpc is stopped")

	if err := s.storage.Close(); err != nil {
		log.Error("close storage failed", zap.Error(err))
	}

Code in pump/storage.PullCommitBinlog:

// PullCommitBinlog return commit binlog  > last
func (a *Append) PullCommitBinlog(ctx context.Context, last int64) <-chan []byte {
	log.Debug("new PullCommitBinlog", zap.Int64("last ts", last))

	ctx, cancel := context.WithCancel(ctx)
	go func() {
		select {
		case <-a.close:
			cancel()
		case <-ctx.Done():
		}
	}()

After s.storage.Close() the flighting PullBinlog gRPC will quit.
2. Add a new channel or context to control Pullbinlog to quit.

In 1. we may read metadata after storage is closed which may cause panic.
So maybe the best solution is adding a channel to close Pullbinlog after commitStatus?

july2993 · 2019-09-05T05:38:44Z

for 1, inflight WriteBinlog() gRPC need to use s.Storage, so can't just close s.Storage first
for 2, can quickly fix this problem.

…GracefulStop

pump/server.go

july2993

LGTM

july2993 · 2019-09-05T07:14:29Z

@leoppro @suzaku PTAL

lichunzhu · 2019-09-05T07:15:19Z

/run-all-tests

suzaku

LGTM

fix pump storage quit bug

60a7095

lichunzhu added the status/PTAL label Sep 3, 2019

lichunzhu requested review from july2993 and zier-one September 3, 2019 11:51

add pullCtx and pullCancel after commitStatus to avoid block at s.gs.…

5eaac5b

…GracefulStop

july2993 reviewed Sep 5, 2019

View reviewed changes

pump/server.go Outdated Show resolved Hide resolved

change pullCtx & pullCancel to pullClose

51a79d6

july2993 reviewed Sep 5, 2019

View reviewed changes

july2993 added status/LGT1 and removed status/PTAL labels Sep 5, 2019

suzaku approved these changes Sep 5, 2019

View reviewed changes

lichunzhu merged commit 277f113 into pingcap:release-2.1 Sep 5, 2019

lichunzhu deleted the czli/fixPumpStorageQuit branch September 5, 2019 08:20

lichunzhu mentioned this pull request Sep 5, 2019

fix pump pull pop bug while closing pump #745

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix pump storage quit bug #739

fix pump storage quit bug #739

lichunzhu commented Sep 3, 2019 •

edited

Loading

lichunzhu commented Sep 3, 2019

july2993 commented Sep 3, 2019

lichunzhu commented Sep 4, 2019 •

edited

Loading

lichunzhu commented Sep 4, 2019

july2993 commented Sep 4, 2019

lichunzhu commented Sep 4, 2019 •

edited

Loading

lichunzhu commented Sep 4, 2019 •

edited

Loading

lichunzhu commented Sep 4, 2019

july2993 commented Sep 5, 2019

lichunzhu commented Sep 5, 2019 •

edited

Loading

july2993 commented Sep 5, 2019

july2993 left a comment

july2993 commented Sep 5, 2019

lichunzhu commented Sep 5, 2019

suzaku left a comment

fix pump storage quit bug #739

fix pump storage quit bug #739

Conversation

lichunzhu commented Sep 3, 2019 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

lichunzhu commented Sep 3, 2019

july2993 commented Sep 3, 2019

lichunzhu commented Sep 4, 2019 • edited Loading

lichunzhu commented Sep 4, 2019

july2993 commented Sep 4, 2019

lichunzhu commented Sep 4, 2019 • edited Loading

lichunzhu commented Sep 4, 2019 • edited Loading

lichunzhu commented Sep 4, 2019

july2993 commented Sep 5, 2019

lichunzhu commented Sep 5, 2019 • edited Loading

july2993 commented Sep 5, 2019

july2993 left a comment

Choose a reason for hiding this comment

july2993 commented Sep 5, 2019

lichunzhu commented Sep 5, 2019

suzaku left a comment

Choose a reason for hiding this comment

lichunzhu commented Sep 3, 2019 •

edited

Loading

lichunzhu commented Sep 4, 2019 •

edited

Loading

lichunzhu commented Sep 4, 2019 •

edited

Loading

lichunzhu commented Sep 4, 2019 •

edited

Loading

lichunzhu commented Sep 5, 2019 •

edited

Loading