Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io: load closed segment 0000000000064324-0000000000064793: found 469 entries (expected 470) #1639

Closed
mikeccuk2005 opened this issue Apr 14, 2020 · 20 comments
Assignees
Milestone

Comments

@mikeccuk2005
Copy link

mikeccuk2005 commented Apr 14, 2020

This is my 5 times that this happened to me
Random Failures after restart using 'systemctl restart k3s'. Happens after serval days of use, then restart using 'systemctl restart k3s'

Version:
k3s version v1.17.2+k3s1 (cdab19b)
Ubuntu 18.04.3 LTS

K3s arguments:
ExecStart=/usr/local/bin/k3s
server
'--cluster-init'
'--tls-san'
'10.7.55.77'
'K3S_KUBECONFIG_MODE=664' \

What was install on the k3s:

  • Longhorn
  • Rbac-manager

What was configured:

  • Custom Registries with custom CA

What was the Log:

sudo journalctl -u k3s | tail -n 20
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 18964 (longhorn-manage) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 19021 (csi-node-driver) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 19072 (longhorn-manage) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 19608 (csi-provisioner) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 7526 (pause) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 17808 (npm) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 17858 (node) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 k3s[7257]: time="2020-04-14T16:26:52.507407670+07:00" level=info msg="Starting k3s v1.17.2+k3s1 (cdab19b0)"
Apr 14 16:26:53 devops02 k3s[7257]: time="2020-04-14T16:26:53.197376154+07:00" level=fatal msg="starting kubernetes: preparing server: start cluster and https: raft_start(): io: load closed segment 0000000000064324-0000000000064793: found 469 entries (expected 470)"
Apr 14 16:26:53 devops02 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Apr 14 16:26:53 devops02 systemd[1]: k3s.service: Failed with result 'exit-code'.
Apr 14 16:26:53 devops02 systemd[1]: Failed to start Lightweight Kubernetes.
@SteelCrow
Copy link

SteelCrow commented Apr 14, 2020

I have the same problem with my cluster on Rasp4B, so It became one-time usable. Any service restart or node reboot simply destroy cluster.

It's probably dependencies issue:
k3s-io/kine#29
canonical/dqlite#190

@leolb-aphp
Copy link

Similar issue here

@mikeccuk2005
Copy link
Author

I reinstalled everything,
I do not recommend using buildin dsqlite for production. Especally multiple master, that will f-ed everything up fast.
Instead I hook k3s with postgresql, works like a charm ( for now )

@leolb-aphp
Copy link

@mikeccuk2005 Don't close the issue, the bug isnt fixed yet.

@leolb-aphp
Copy link

leolb-aphp commented Apr 15, 2020

I've manually recompiled k3s with go-dqlite v1.4.1 - will report back if it resolves the issue after a little while of uptime.

@mikeccuk2005 mikeccuk2005 reopened this Apr 15, 2020
@mikeccuk2005
Copy link
Author

In the sqlite directory there were snapshot files. /var/lib/rancher/k3s/server/db/
Can I rollback?
I would recommend have k3s a feature for backup and restore, if this issue continues.

@leolb-aphp
Copy link

leolb-aphp commented Apr 15, 2020

Receiving errors like this after a little while:

avril 15 19:22:30 k3s[40012]: time="2020-04-15T19:22:30.268644397+02:00" level=error msg="error while range on /registry/longhorn.io/settings/ /registry/longhorn.io/settings/: context canceled"
avril 15 19:22:30 k3s[40012]: panic: runtime error: index out of range [2127] with length 232
avril 15 19:22:30 k3s[40012]: goroutine 19865 [running]:
avril 15 19:22:30 k3s[40012]: github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Message).lastByte(...)
avril 15 19:22:30 k3s[40012]: /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:476
avril 15 19:22:30 k3s[40012]: github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Rows).Close(0xc00f036fa8, 0xc00c406a20, 0x3)
avril 15 19:22:30 k3s[40012]: /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:631 +0x14c
avril 15 19:22:30 k3s[40012]: github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver.(*Rows).Close(0xc00f036f80, 0x42ff6a, 0x6dc47a0)
avril 15 19:22:30 k3s[40012]: /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver/driver.go:560 +0x35
avril 15 19:22:30 k3s[40012]: database/sql.(*Rows).close.func1()
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:3076 +0x3c
avril 15 19:22:30 k3s[40012]: database/sql.withLock(0x4773ee0, 0xc00ef8c000, 0xc00b44e6f8)
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:3184 +0x6d
avril 15 19:22:30 k3s[40012]: database/sql.(*Rows).close(0xc00f037000, 0x4718940, 0xc0001321e0, 0x0, 0x0)
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:3075 +0x129
avril 15 19:22:30 k3s[40012]: database/sql.(*Rows).awaitDone(0xc00f037000, 0x47d9320, 0xc00c8b6c00, 0x0, 0x0)
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:2733 +0x101
avril 15 19:22:30 k3s[40012]: created by database/sql.(*Rows).initContextClose
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:2717 +0xc1

And @mikeccuk2005 - I think backup of storage and Kubernetes configuration is to be done at another level, storage you can backup from your storage class, if local path, find the path and copy files away. For configuration, just save/export all the yamls.

Backing up and restoring using the db files seems unsafe.

@leolb-aphp
Copy link

And (code=killed, signal=SEGV) on others, so doesnt solve the problem.

@freeekanayaka
Copy link

See k3s-io/kine#29 (comment)

@mikeccuk2005
Copy link
Author

Is there a way to stop k3s gracefully to avoid this problem?
Currently using systemctl restart k3s is a bit aggressive, don't you think?

@freeekanayaka
Copy link

freeekanayaka commented Apr 16, 2020

Receiving errors like this after a little while:

avril 15 19:22:30 bbs-edscpu-p001.bbs.aphp.fr k3s[40012]: time="2020-04-15T19:22:30.268644397+02:00" level=error msg="error while range on /registry/longhorn.io/settings/ /registry/longhorn.io/settings/: context canceled"
avril 15 19:22:30 bbs-edscpu-p001.bbs.aphp.fr k3s[40012]: panic: runtime error: index out of range [2127] with length 232

This specific error should be fixed when using latest go-dqlite, libdqlite and libraft.

@freeekanayaka
Copy link

Is there a way to stop k3s gracefully to avoid this problem?
Currently using systemctl restart k3s is a bit aggressive, don't you think?

The way to avoid this problem is to upgrade these dependencies. If that still doesn't work for you, please let me know.

@SteelCrow
Copy link

SteelCrow commented Apr 17, 2020

So I've compiled kine, dqlite-build, k3s with RAFT_VER=v0.9.18 DQLITE_VER=v1.4.1 DQLITE_VER=v1.5.0 on arm64 and start testing it on my rasp4b cluster with 3 master nodes. I didn't see any previous error crush for about 4 hours when restarted services or shutdowned nodes. But dqlite leader election is not working, if I turn off first master (deployed with --cluster-init) all cluster is failed with error:

1 node

turned off

2 node

Apr 17 18:51:15 rpi-01 k3s[10100]: F0417 18:51:15.721930   10100 controllermanager.go:213] leaderelection lost
Apr 17 18:51:16 rpi-01 systemd[1]: k3s.service: Main process exited, code=exited, status=255/EXCEPTION

3 node

Apr 17 18:52:21 rpi-02 k3s[5600]: time="2020-04-17T18:52:21.325072836Z" level=error msg="error while range on /registry/apiextensions.k8s.io/customresourcedefinitions /registry/apiextensions.k8s.io/customresourcedefinitions: failed to create dqlite connection: no available dqlite leader server found"

@freeekanayaka
Copy link

I'm not totally sure about this one, but sounds like either a deployment issue of some kind on your side or k3s/kine not forming the dqlite cluster correctly.

Probably something k3s/kine devs should look at and possibly get back to me if they find dqlite bugs or have questions about how to correctly form a cluster.

I'd suggest to open a separate issue for this tho, since it seems unrelated to the original one.

@cjellick cjellick added this to the v1.18 - April milestone May 5, 2020
@zube zube bot removed the [zube]: To Triage label May 5, 2020
@cjellick
Copy link
Contributor

cjellick commented May 5, 2020

Thanks for the clarification on the unrealted issue @freeekanayaka

Regarding the original issue, we've bumped the necessary dependencies here:
https://github.com/rancher/dqlite-build/pull/13/files

@mikeccuk2005 or anyone else who hit this issue. Can you attempt reproduce with v1.18.2 (either an RC or the real release once its out). Since dqlite support is still experimental, this won't be consider a blocker for 1.18.2 and thus we won't holding up the release to test it.

@jamesorlakin
Copy link

@SteelCrow I believe your issue is more tied to #1391

@coldpenguin
Copy link

coldpenguin commented May 18, 2020

...

Can you attempt reproduce with v1.18.2 (either an RC or the real release once its out). Since dqlite support is still experimental, this won't be consider a blocker for 1.18.2 and thus we won't holding up the release to test it.

Came here because I had the error message: load closed segment 0000000000100224-0000000000100489: found 265 entries (expected 266)

Running on RPI-4 buster,
v1.18.2+k3s1 (698e444)

Installed a fresh 3 node cluster. Was trying to debug why when node 0 was turned off, everything stopped (a la 1391), so was turning the service on and off.
Used systemctl stop k3s / start k3s several times (probably 6-7 times) on the first master without issues before the problem occurred. Total uptime < 3 hours

@freeekanayaka
Copy link

Try to remove the file 0000000000100224-0000000000100489 from your k3s data directory (not sure where that is).

I don't know exactly the internals of k3s, so there might be an issue with it creates the dqlite which makes it not really HA.

@jamesorlakin
Copy link

If it's of interest, dqlite may be replaced by etcd in #1770

@davidnuzik
Copy link
Contributor

We would like to close out this issue. If you strongly object to this you can leave comments - or if you wish create a separate GH issue.

dqlite and raft versions have been bumped. We are having a hard time reproducing this issue now - in fact we have not been able to reproduce the issue.

Further, as @jamesorlakin mentioned we are indeed moving to embedded etcd in K3s. You will find in our upcoming v1.19.0 release of K3s that this feature has replaced dqlite. We are planning to release v1.19.0 K3s not later than September 1, 2020 (however this is subject to change).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants