Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] v3.3.10 panic: runtime error: index out of range #10399

Closed
clearcodecn opened this issue Jan 12, 2019 · 12 comments
Closed

[bug] v3.3.10 panic: runtime error: index out of range #10399

clearcodecn opened this issue Jan 12, 2019 · 12 comments
Labels

Comments

@clearcodecn
Copy link

version:

etcd Version: 3.3.10
Git SHA: 27fc7e2
Go Version: go1.10.4
Go OS/Arch: linux/amd64
2019-01-11 02:26:23.219865 W | etcdserver: request "header:<ID:7587834754236963201 > lease_grant:<ttl:5-second id:694d6781c4a85980>" with result "size:43" took too long (426.998155ms) to execute
2019-01-11 03:18:24.078247 W | etcdserver: request "header:<ID:7587834754236977093 > lease_grant:<ttl:5-second id:694d6781c4a88fc4>" with result "size:43" took too long (131.060197ms) to execute
2019-01-11 03:18:35.460940 W | etcdserver: request "header:<ID:7587834754236977127 > lease_revoke:<id:694d6781c4a88fd9>" with result "size:31" took too long (291.459845ms) to execute
2019-01-11 03:37:03.162383 W | wal: sync duration of 1.175994731s, expected less than 1s
2019-01-11 04:05:46.091414 W | wal: sync duration of 1.118662548s, expected less than 1s
2019-01-11 04:05:46.092621 W | etcdserver: failed to revoke 694d6781c4a8c1ae ("lease not found")
2019-01-11 04:05:55.724921 W | etcdserver: request "header:<ID:7587834754236989968 > lease_revoke:<id:694d6781c4a8c203>" with result "size:31" took too long (249.44156ms) to execute
2019-01-11 04:09:05.888778 I | wal: segmented wal file default.etcd/member/wal/0000000000000024-0000000000ae666e.wal is created
2019-01-11 04:09:11.563298 I | pkg/fileutil: purged file default.etcd/member/wal/000000000000001f-00000000009624dd.wal successfully
2019-01-11 04:12:44.651381 W | wal: sync duration of 1.021896404s, expected less than 1s
2019-01-11 04:12:44.964374 W | etcdserver: request "header:<ID:7587834754236991841 > lease_grant:<ttl:5-second id:694d6781c4a8c960>" with result "size:43" took too long (312.863854ms) to execute
2019-01-11 04:12:56.065315 W | etcdserver: request "header:<ID:7587834754236991897 > lease_grant:<ttl:5-second id:694d6781c4a8c998>" with result "size:43" took too long (645.098486ms) to execute
2019-01-11 04:36:03.663307 W | etcdserver: request "header:<ID:7587834754236998275 > lease_revoke:<id:694d6781c4a8e239>" with result "size:31" took too long (223.688324ms) to execute
2019-01-11 05:24:08.962590 W | etcdserver: request "header:<ID:7587834754237011116 > lease_grant:<ttl:5-second id:694d6781c4a914ab>" with result "size:43" took too long (184.729395ms) to execute
2019-01-11 05:24:19.042241 W | etcdserver: request "header:<ID:7587834754237011170 > put:<key:\"/services/heqingsongdeMacBook-Pro.local/5cd3fd05-2e01-4885-a330-b0310f08f5e5\" value_size:261 lease:7587834754237011168 >" with result "size:7" took too long (143.070838ms) to execute
2019-01-11 05:52:54.804059 W | wal: sync duration of 1.142695399s, expected less than 1s
2019-01-11 05:52:55.077332 W | etcdserver: request "header:<ID:7587834754237018849 > lease_revoke:<id:694d6781c4a932a5>" with result "size:31" took too long (273.114568ms) to execute
2019-01-11 05:52:55.077473 W | etcdserver: failed to revoke 694d6781c4a932a5 ("lease not found")
2019-01-11 05:52:55.077554 W | etcdserver: failed to revoke 694d6781c4a932a5 ("lease not found")
2019-01-11 06:12:31.988739 W | etcdserver: request "header:<ID:7587834754237023910 > put:<key:\"/services/wolf.local/d553bcbc-a2d7-498f-b39b-055c8ca9a408\" value_size:252 lease:7587834754237023908 >" with result "size:7" took too long (109.355496ms) to execute
2019-01-11 07:26:20.728715 W | wal: sync duration of 1.215774461s, expected less than 1s
panic: runtime error: index out of range

goroutine 131 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*DB).page(...)
        /tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:859
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Tx).rollback(0xc43906f500)
        /tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/tx.go:267 +0xe9
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Tx).Commit(0xc43906f500, 0xb3474f1521ed2, 0x165d340)
        /tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/tx.go:186 +0x4fe
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTx).commit(0xc420545ef0, 0x0)
        /tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:167 +0x8f
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTxBuffered).unsafeCommit(0xc420545ef0, 0xc4201e0000)
        /tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:239 +0x49
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTxBuffered).commit(0xc420545ef0, 0xc4201e0000)
        /tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:227 +0x4c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTxBuffered).Commit(0xc420545ef0)
        /tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:214 +0x40
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*backend).run(0xc4201240e0)
        /tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:273 +0xbb
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.newBackend
        /tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:161 +0x2ea

@hexfusion
Copy link
Contributor

@clearcodecn thank you for the report. I have a few questions so that we can better address this.

  • Any additional context to this panic? I can see a reference to TXN but perhaps you have more details.

  • Is it reproducible? If yes can you please provide steps?

  • Any additional server and workload detals, k8s, heavy traffic etc.

@clearcodecn
Copy link
Author

@hexfusion

  • etcd running on a single server. command was
nohup ./etcd --advertise-client-urls http://0.0.0.0:2379 --listen-client-urls http://0.0.0.0:2379 &
  • It is't reproducible. It was suddenly. restart it then ok .

  • my program was just get and set key and values and it was very healthy for a long time when etcd was run , about above 20 days . but panic suddenly. I can show my code .

  • there was main code


type etcdv3Registry struct {
	client  *clientv3.Client
	options Options
	sync.Mutex
	register map[string]uint64
	leases   map[string]clientv3.LeaseID
}

func encode(s *Service) string {
	b, _ := json.Marshal(s)
	return string(b)
}

func decode(ds []byte) *Service {
	var s *Service
	json.Unmarshal(ds, &s)
	return s
}

func nodePath(s, id string) string {
	service := strings.Replace(s, "/", "-", -1)
	node := strings.Replace(id, "/", "-", -1)
	return path.Join(prefix, service, node)
}

func servicePath(s string) string {
	return path.Join(prefix, strings.Replace(s, "/", "-", -1))
}

func (e *etcdv3Registry) Init(opts ...Option) error {
	return configure(e, opts...)
}

func (e *etcdv3Registry) Options() Options {
	return e.options
}

func (e *etcdv3Registry) Deregister(s *Service) error {
	if len(s.Nodes) == 0 {
		return errors.New("Require at least one node")
	}

	e.Lock()
	// delete our hash of the service
	delete(e.register, s.Name)
	// delete our lease of the service
	delete(e.leases, s.Name)
	e.Unlock()

	ctx, cancel := context.WithTimeout(context.Background(), e.options.Timeout)
	defer cancel()

	etcdClient := getEtcdClient()

	for _, node := range s.Nodes {
		_, err := etcdClient.Delete(ctx, nodePath(s.Name, node.Id))
		if err != nil {
			return err
		}
	}
	return nil
}

func (e *etcdv3Registry) Register(s *Service, opts ...RegisterOption) error {
	if len(s.Nodes) == 0 {
		return errors.New("Require at least one node")
	}

	etcdClient := getEtcdClient()
	var leaseNotFound bool
	//refreshing lease if existing
	leaseID, ok := e.leases[s.Name]
	if ok {
		if _, err := etcdClient.KeepAliveOnce(context.TODO(), leaseID); err != nil {
			if err != rpctypes.ErrLeaseNotFound {
				return err
			}

			// lease not found do register
			leaseNotFound = true
		}
	}

	// create hash of service; uint64
	h, err := hash.Hash(s, nil)
	if err != nil {
		return err
	}

	// get existing hash
	e.Lock()
	v, ok := e.register[s.Name]
	e.Unlock()

	// the service is unchanged, skip registering
	if ok && v == h && !leaseNotFound {
		return nil
	}

	service := &Service{
		Name:      s.Name,
		Version:   s.Version,
		Metadata:  s.Metadata,
		Endpoints: s.Endpoints,
	}

	var options RegisterOptions
	for _, o := range opts {
		o(&options)
	}

	ctx, cancel := context.WithTimeout(context.Background(), e.options.Timeout)
	defer cancel()

	var lgr *clientv3.LeaseGrantResponse
	if options.TTL.Seconds() > 0 {
		lgr, err = etcdClient.Grant(ctx, int64(options.TTL.Seconds()))
		if err != nil {
			return err
		}
	}

	for _, node := range s.Nodes {
		service.Nodes = []*Node{node}
		if lgr != nil {
			_, err = etcdClient.Put(ctx, nodePath(service.Name, node.Id), encode(service), clientv3.WithLease(lgr.ID))
		} else {
			_, err = etcdClient.Put(ctx, nodePath(service.Name, node.Id), encode(service))
		}
		if err != nil {
			return err
		}
	}

	e.Lock()
	// save our hash of the service
	e.register[s.Name] = h
	// save our leaseID of the service
	if lgr != nil {
		e.leases[s.Name] = lgr.ID
	}
	e.Unlock()

	return nil
}

func (e *etcdv3Registry) GetService(name string) ([]*Service, error) {
	ctx, cancel := context.WithTimeout(context.Background(), e.options.Timeout)
	defer cancel()

	etcdClient := getEtcdClient()

	rsp, err := etcdClient.Get(ctx, servicePath(name)+"/", clientv3.WithPrefix(), clientv3.WithSort(clientv3.SortByKey, clientv3.SortDescend))
	if err != nil {
		return nil, err
	}

	if len(rsp.Kvs) == 0 {
		return nil, ErrNotFound
	}

	serviceMap := map[string]*Service{}

	for _, n := range rsp.Kvs {
		if sn := decode(n.Value); sn != nil {
			s, ok := serviceMap[sn.Version]
			if !ok {
				s = &Service{
					Name:      sn.Name,
					Version:   sn.Version,
					Metadata:  sn.Metadata,
					Endpoints: sn.Endpoints,
				}
				serviceMap[s.Version] = s
			}

			for _, node := range sn.Nodes {
				s.Nodes = append(s.Nodes, node)
			}
		}
	}

	var services []*Service
	for _, service := range serviceMap {
		services = append(services, service)
	}
	return services, nil
}

// 原始版本不带任何信息,只带服务名称
// 更改成了带上Node的metaData 信息.
func (e *etcdv3Registry) ListServices() ([]*Service, error) {
	var services []*Service

	ctx, cancel := context.WithTimeout(context.Background(), e.options.Timeout)
	defer cancel()

	etcdClient := getEtcdClient()

	rsp, err := etcdClient.Get(ctx, prefix, clientv3.WithPrefix(), clientv3.WithSort(clientv3.SortByKey, clientv3.SortDescend))
	if err != nil {
		return nil, err
	}

	if len(rsp.Kvs) == 0 {
		return []*Service{}, nil
	}

	for _, n := range rsp.Kvs {
		if sn := decode(n.Value); sn != nil {
			service := &Service{
				Name:      sn.Name,
				Version:   sn.Version,
				Metadata:  sn.Metadata,
				Endpoints: sn.Endpoints,
				Nodes:     sn.Nodes,
			}
			services = append(services, service)
		}
	}

	//for k := range nameSet {
	//	service := &Service{}
	//	service.Name = k
	//	services = append(services, service)
	//}

	return services, nil
}


@vaddenz
Copy link

vaddenz commented Jan 21, 2019

Any updates?

@jingyih
Copy link
Contributor

jingyih commented Mar 11, 2019

This is concerning. Is this still happening? Maybe check what is the bbolt version?

@clearcodecn
Copy link
Author

🤕 I did not care about that . May be it was ok .

@WIZARD-CXY
Copy link
Contributor

I did some investigating, same symptom as etcd-io/bbolt#152.

@WIZARD-CXY
Copy link
Contributor

WIZARD-CXY commented Mar 12, 2019

we will fix it first in the boltdb or u can just use this patch as kozlovic suggests in etcd-io/bbolt#152 and see if this panic happens or not.

@WIZARD-CXY
Copy link
Contributor

@clearcodecn

@clearcodecn
Copy link
Author

@WIZARD-CXY . OK. thanks. I will test it and reply .

@jingyih
Copy link
Contributor

jingyih commented Mar 12, 2019

@WIZARD-CXY Thanks for digging into this.

@jingyih
Copy link
Contributor

jingyih commented May 21, 2019

Can we try to get etcd-io/bbolt#153 merged?

@xiang90
Copy link
Contributor

xiang90 commented Jul 9, 2019

already merged. and bbolt version is updated in etcd. so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

6 participants