Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Panic: etcdserver: too many operations in txn request #26748

Closed
1 task done
darkerin opened this issue Aug 31, 2023 · 14 comments
Closed
1 task done

[Bug]: Panic: etcdserver: too many operations in txn request #26748

darkerin opened this issue Aug 31, 2023 · 14 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@darkerin
Copy link

darkerin commented Aug 31, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.3.0
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq   
- SDK version(e.g. pymilvus v2.0.0rc2): 2.3.0
- OS(Ubuntu or CentOS): Ubuntu18.04
- CPU/Memory: 4C8G
- GPU: 
- Others:

Current Behavior

  1. I want to use a script to synchronously create 500 partitions in a collection, each with 1000 entities (1536 dimensions)

  2. When I insert to about 300 partition,the container crashed, the logs seems to show "DataCoord is not serving"

  3. when i want to start again container, it failed, the error msg is:

[2023/08/31 04:05:08.752 +00:00] [WARN] [meta/collection_manager.go:202] ["upgrade recover failed"] [error="etcdserver: too many operations in txn request"]
[2023/08/31 04:05:08.752 +00:00] [WARN] [querycoordv2/server.go:313] ["failed to recover collections"]
[2023/08/31 04:05:08.752 +00:00] [ERROR] [components/query_coord.go:54] ["QueryCoord starts error"] [error="etcdserver: too many operations in txn request"] [stack="github.com/milvus-io/milvus/cmd/components.(*QueryCoord).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/query_coord.go:54\ngit.luolix.top/milvus-io/milvus/cmd/roles.runComponent[...].func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:112"]
panic: etcdserver: too many operations in txn request

goroutine 237 [running]:
panic({0x3ceae00, 0xc0010adb48})
	/usr/local/go/src/runtime/panic.go:987 +0x3bb fp=0xc000317f58 sp=0xc000317e98 pc=0x153fbdb
github.com/milvus-io/milvus/cmd/roles.runComponent[...].func1()
	/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:113 +0x185 fp=0xc000317fe0 sp=0xc000317f58 pc=0x3755225
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000317fe8 sp=0xc000317fe0 pc=0x1579301
created by github.com/milvus-io/milvus/cmd/roles.runComponent[...]
	/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:99 +0x15c

goroutine 1 [semacquire]:
runtime.gopark(0x0?, 0xc000fdd7b8?, 0x60?, 0x83?, 0xc000cf7c60?)
	/usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc00179f778 sp=0xc00179f758 pc=0x1543096
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:387
runtime.semacquire1(0xc00065cc68, 0x60?, 0x1, 0x0, 0x1?)
	/usr/local/go/src/runtime/sema.go:160 +0x20f fp=0xc00179f7e0 sp=0xc00179f778 pc=0x1554e6f
sync.runtime_Semacquire(0x153e47f?)
	/usr/local/go/src/runtime/sema.go:62 +0x27 fp=0xc00179f818 sp=0xc00179f7e0 pc=0x1574d07
sync.(*WaitGroup).Wait(0xc000d804c0?)
	/usr/local/go/src/sync/waitgroup.go:116 +0x4b fp=0xc00179f840 sp=0xc00179f818 pc=0x1588f4b
github.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).Run(0xc00179fe48, 0x1, {0xc00045c080?, 0xe?})
	/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:340 +0x8fa fp=0xc00179fdf8 sp=0xc00179f840 pc=0x3754a3a
github.com/milvus-io/milvus/cmd/milvus.(*run).execute(0xc000474000, {0xc000052090?, 0x3, 0x3}, 0xc000448240)
	/go/src/github.com/milvus-io/milvus/cmd/milvus/run.go:117 +0x68e fp=0xc00179fee0 sp=0xc00179fdf8 pc=0x3760f2e
github.com/milvus-io/milvus/cmd/milvus.RunMilvus({0xc000052090?, 0x3, 0x3})
	/go/src/github.com/milvus-io/milvus/cmd/milvus/milvus.go:60 +0x21e fp=0xc00179ff58 sp=0xc00179fee0 pc=0x376079e
main.main()
	/go/src/github.com/milvus-io/milvus/cmd/main.go:26 +0x2e fp=0xc00179ff80 sp=0xc00179ff58 pc=0x376376e
runtime.main()
	/usr/local/go/src/runtime/proc.go:250 +0x207 fp=0xc00179ffe0 sp=0xc00179ff80 pc=0x1542c67
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00179ffe8 sp=0xc00179ffe0 pc=0x1579301

Expected Behavior

I want to get some help about:

  1. why this is happening and why a restart isn't solving it
  2. How many partitions and entities can a collection maintain without performance problems
  3. When I encounter this kind of problem in a production environment, how should I recover the data?

thanks a lot!

Steps To Reproduce

No response

Milvus Log

milvus0831.log

Anything else?

No response

@darkerin darkerin added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 31, 2023
@darkerin
Copy link
Author

darkerin commented Aug 31, 2023

My sequence of operations is as follows:

c = Collection("name")

p1 = Partition(c, "p1")
p1.load()
c.insert(insert_list, partition_name="p1")
.....

p300 = Partition(c, "p300")
p300.load()
c.insert(insert_list, partition_name="p300")

@xiaofan-luan
Copy link
Contributor

p300 = Partition(c, "p300")
p300.load()
c.insert(insert_list, partition_name="p2")

is this c.insert(insert_list, partition_name="p300")?

@darkerin
Copy link
Author

p300 = Partition(c, "p300") p300.load() c.insert(insert_list, partition_name="p2")

is this c.insert(insert_list, partition_name="p300")?

This is just a typo, I actually did it in a loop in the code

@xiaofan-luan
Copy link
Contributor

After check the code the code I think there is a bug in 2.3 when partition number is larger than 128.
We will fixed it and release in 2.3.1 next week.
So far you can change etcd configure to work aorund

@xiaofan-luan
Copy link
Contributor

see details of etcd-io/etcd#10048

@darkerin
Copy link
Author

see details of etcd-io/etcd#10048

Ok, thanks, is there a way for me to modify this parameter in milvus.yaml? I don’t know how to modify etcd server config.

@darkerin
Copy link
Author

After check the code the code I think there is a bug in 2.3 when partition number is larger than 128. We will fixed it and release in 2.3.1 next week. So far you can change etcd configure to work aorund

Is the limit of 128 the number of all partitions in milvus or the number of patitions in single collections?

@xiaofan-luan
Copy link
Contributor

this is a etcd parameter and milvus.yaml won't help.

if you use docker compose

add ETCD_MAX_TXN_OPS = 1024 at etcd.environement should help

@darkerin
Copy link
Author

this is a etcd parameter and milvus.yaml won't help.

if you use docker compose

add ETCD_MAX_TXN_OPS = 1024 at etcd.environement should help

ok, it works!

@darkerin
Copy link
Author

After check the code the code I think there is a bug in 2.3 when partition number is larger than 128. We will fixed it and release in 2.3.1 next week. So far you can change etcd configure to work aorund

Is the limit of 128 the number of all partitions in milvus or the number of patitions in single collections?

Or is it the number of operations to add new partitions within a period of time?

@yanliang567
Copy link
Contributor

/assign @yah01
/unassign

@sre-ci-robot sre-ci-robot assigned yah01 and unassigned yanliang567 Aug 31, 2023
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 31, 2023
@yanliang567 yanliang567 added this to the 2.3 milestone Aug 31, 2023
@darkerin
Copy link
Author

darkerin commented Sep 5, 2023

I have modified the etcd parameter ETCD_MAX_TXN_OPS, Is it related to this?

see #26855

@yah01
Copy link
Member

yah01 commented Sep 12, 2023

fixed by #26763
/assign @yanliang567 could you help add tests for this?

@yanliang567
Copy link
Contributor

i think we had added the tests. Not reproduced on 2.3.0-20230907-264c542b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants