Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many changefeeds' checkpoint suspended but status is normal #5142

Closed
Tammyxia opened this issue Apr 11, 2022 · 6 comments
Closed

Many changefeeds' checkpoint suspended but status is normal #5142

Tammyxia opened this issue Apr 11, 2022 · 6 comments
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@Tammyxia
Copy link

What did you do?

  • Create new cluster 6.0.0
  • Create 4 kafka changefeeds with 2 Kafka servers:
  1. tiup cdc:v6.0.0 cli changefeed create --pd=http://172.16.6.5:2379 --sink-uri="kafka://172.16.6.46:9092/cdc-events-n1?kafka-version=3.1.0&partition-num=4&protocol=canal-json" --changefeed-id="kafka-changefeed-canal-json-46-n1"
  2. tiup cdc:v6.0.0 cli changefeed create --pd=http://172.16.6.5:2379 --sink-uri="kafka://172.16.6.46:9092/cdc-events-n2?kafka-version=3.1.0&partition-num=4&protocol=canal-json" --changefeed-id="kafka-changefeed-canal-json-46-n2"
  3. tiup cdc:v6.0.0 cli changefeed create --pd=http://172.16.6.5:2379 --sink-uri="kafka://172.16.6.47:9092/cdc-events-n2?kafka-version=3.1.0&partition-num=4&protocol=canal-json" --changefeed-id="kafka-changefeed-canal-json-47-n2"
  4. tiup cdc:v6.0.0 cli changefeed create --pd=http://172.16.6.5:2379 --sink-uri="kafka://172.16.6.47:9092/cdc-events-n1?kafka-version=3.1.0&partition-num=4&protocol=canal-json" --changefeed-id="kafka-changefeed-canal-json-47-n1"
  • Start workload sysbench

What did you expect to see?

No response

What did you see instead?

  • All changefeeds get stuck after 5 min.

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)
6.0.0

Upstream TiKV version (execute tikv-server --version):

(paste TiKV version here)
6.0.0

TiCDC version (execute cdc version):

(paste TiCDC version here)
6.0.0
@Tammyxia Tammyxia added type/bug The issue is confirmed as a bug. area/ticdc Issues or PRs related to TiCDC. labels Apr 11, 2022
@Tammyxia
Copy link
Author

There are 3 capture, cdc_stderr:
2022/04/11 17:12:11 000010.log:
rename /tmp/cdc_data/tmp/sorter/0004/000002.log /tmp/cdc_data/tmp/sorter/0004/000010.log: no such file or directory
directory contains 1037 files, 0 unknown, 1027 tables, 6 logs, 1 manifests
2022/04/11 17:12:44 000429.log:
rename /tmp/cdc_data/tmp/sorter/0004/000417.log /tmp/cdc_data/tmp/sorter/0004/000429.log: no such file or directory
directory contains 39 files, 0 unknown, 33 tables, 2 logs, 1 manifests

Error: run server: create sorter system: [CDC:ErrNewCaptureFailed]new capture failed: resource temporarily unavailable
run server: create sorter system: [CDC:ErrNewCaptureFailed]new capture failed: resource temporarily unavailable
Error: run server: owner exited with error: [CDC:ErrOwnerNotFound]owner not found
run server: owner exited with error: [CDC:ErrOwnerNotFound]owner not found
panic: The CommitTs must be greater than the resolvedTs

goroutine 1661 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc002a06000, {0xc004fb5400, 0x4, 0x4})
go.uber.org/zap@v1.20.0/zapcore/entry.go:232 +0x44c
go.uber.org/zap.(*Logger).Panic(0xc00154a300?, {0x31b50f1?, 0x1?}, {0xc004fb5400, 0x4, 0x4})
go.uber.org/zap@v1.20.0/logger.go:230 +0x59
github.com/pingcap/log.Panic({0x31b50f1, 0x30}, {0xc004fb5400, 0x4, 0x4})
github.com/pingcap/log@v0.0.0-20211215031037-e024ba4eb0ee/global.go:54 +0x10c
github.com/pingcap/tiflow/cdc/kv.(*regionWorker).handleEventEntry(0xc004059c20, {0x37fff88, 0xc004470d00}, 0x0?, 0xc00d22f6b0)
github.com/pingcap/tiflow/cdc/kv/region_worker.go:694 +0xee9
github.com/pingcap/tiflow/cdc/kv.(*regionWorker).processEvent(0xc004059c20, {0x37fff88, 0xc004470d00}, 0xc016071530)
github.com/pingcap/tiflow/cdc/kv/region_worker.go:391 +0x3e9
github.com/pingcap/tiflow/cdc/kv.(*regionWorker).eventHandler(0xc004059c20, {0x37fff88, 0xc004470d00})
github.com/pingcap/tiflow/cdc/kv/region_worker.go:486 +0x1b4
github.com/pingcap/tiflow/cdc/kv.(*regionWorker).run.func3()
github.com/pingcap/tiflow/cdc/kv/region_worker.go:611 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57 +0x67
created by golang.org/x/sync/errgroup.(*Group).Go
golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:54 +0x8d
panic: The CommitTs must be greater than the resolvedTs

@overvenus
Copy link
Member

After offline investigation, we identified three issues

  1. ticdc data-dir setting is not enabled properly for v6.0.0.
  2. fd limit is too small (1000) for ticdc.
  3. changefeed status is normal while ticdc keeps panicking.

@hicqu
Copy link
Contributor

hicqu commented Apr 18, 2022

panic: The CommitTs must be greater than the resolvedTs

https://github.com/tikv/tikv/blob/master/components/cdc/src/delegate.rs#L640 shows TiKV will panic if commitTs is less than or equal to resolvedTS.

And although the resolve_ts module can advance resolved timestamps when a region leader hasn't applied to its current term, cdc will fails because of "stale command", which means incorrect resolved timestamps won't be pushed to clients.

@Tammyxia
Copy link
Author

Verified this scenario again, update the results:

  1. fd limit for cdc process is 1million which is hard coded by tiup. There's no fd limit issue.
  2. When sorter directory is full, levelDB is always compacting, which makes changefeed don't work but the status keeps normal.
  3. Make sorter directory has enough capacity(by clean other data), changefeed can recover to work.

@Tammyxia Tammyxia changed the title Many kafka changefeed get stuck... Many changefeeds' checkpoint suspended but status is normal Apr 21, 2022
@nongfushanquan
Copy link
Contributor

/close

@ti-chi-bot
Copy link
Member

@nongfushanquan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

5 participants