io: load closed segment 0000000000064324-0000000000064793: found 469 entries (expected 470) #1639

mikeccuk2005 · 2020-04-14T09:26:14Z

This is my 5 times that this happened to me
Random Failures after restart using 'systemctl restart k3s'. Happens after serval days of use, then restart using 'systemctl restart k3s'

Version:
k3s version v1.17.2+k3s1 (cdab19b)
Ubuntu 18.04.3 LTS

K3s arguments:
ExecStart=/usr/local/bin/k3s
server
'--cluster-init'
'--tls-san'
'10.7.55.77'
'K3S_KUBECONFIG_MODE=664' \

What was install on the k3s:

Longhorn
Rbac-manager

What was configured:

Custom Registries with custom CA

What was the Log:

sudo journalctl -u k3s | tail -n 20
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 18964 (longhorn-manage) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 19021 (csi-node-driver) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 19072 (longhorn-manage) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 19608 (csi-provisioner) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 7526 (pause) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 17808 (npm) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 systemd[1]: k3s.service: Found left-over process 17858 (node) in control group while starting unit. Ignoring.
Apr 14 16:26:52 devops02 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 14 16:26:52 devops02 k3s[7257]: time="2020-04-14T16:26:52.507407670+07:00" level=info msg="Starting k3s v1.17.2+k3s1 (cdab19b0)"
Apr 14 16:26:53 devops02 k3s[7257]: time="2020-04-14T16:26:53.197376154+07:00" level=fatal msg="starting kubernetes: preparing server: start cluster and https: raft_start(): io: load closed segment 0000000000064324-0000000000064793: found 469 entries (expected 470)"
Apr 14 16:26:53 devops02 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Apr 14 16:26:53 devops02 systemd[1]: k3s.service: Failed with result 'exit-code'.
Apr 14 16:26:53 devops02 systemd[1]: Failed to start Lightweight Kubernetes.

The text was updated successfully, but these errors were encountered:

SteelCrow · 2020-04-14T17:52:37Z

I have the same problem with my cluster on Rasp4B, so It became one-time usable. Any service restart or node reboot simply destroy cluster.

It's probably dependencies issue:
k3s-io/kine#29
canonical/dqlite#190

leolb-aphp · 2020-04-14T21:55:12Z

Similar issue here

mikeccuk2005 · 2020-04-15T05:45:37Z

I reinstalled everything,
I do not recommend using buildin dsqlite for production. Especally multiple master, that will f-ed everything up fast.
Instead I hook k3s with postgresql, works like a charm ( for now )

leolb-aphp · 2020-04-15T06:52:23Z

@mikeccuk2005 Don't close the issue, the bug isnt fixed yet.

leolb-aphp · 2020-04-15T06:54:10Z

I've manually recompiled k3s with go-dqlite v1.4.1 - will report back if it resolves the issue after a little while of uptime.

mikeccuk2005 · 2020-04-15T07:40:35Z

In the sqlite directory there were snapshot files. /var/lib/rancher/k3s/server/db/
Can I rollback?
I would recommend have k3s a feature for backup and restore, if this issue continues.

leolb-aphp · 2020-04-15T17:25:17Z

Receiving errors like this after a little while:

avril 15 19:22:30 k3s[40012]: time="2020-04-15T19:22:30.268644397+02:00" level=error msg="error while range on /registry/longhorn.io/settings/ /registry/longhorn.io/settings/: context canceled"
avril 15 19:22:30 k3s[40012]: panic: runtime error: index out of range [2127] with length 232
avril 15 19:22:30 k3s[40012]: goroutine 19865 [running]:
avril 15 19:22:30 k3s[40012]: github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Message).lastByte(...)
avril 15 19:22:30 k3s[40012]: /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:476
avril 15 19:22:30 k3s[40012]: github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol.(*Rows).Close(0xc00f036fa8, 0xc00c406a20, 0x3)
avril 15 19:22:30 k3s[40012]: /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/internal/protocol/message.go:631 +0x14c
avril 15 19:22:30 k3s[40012]: github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver.(*Rows).Close(0xc00f036f80, 0x42ff6a, 0x6dc47a0)
avril 15 19:22:30 k3s[40012]: /go/src/github.com/rancher/k3s/vendor/github.com/canonical/go-dqlite/driver/driver.go:560 +0x35
avril 15 19:22:30 k3s[40012]: database/sql.(*Rows).close.func1()
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:3076 +0x3c
avril 15 19:22:30 k3s[40012]: database/sql.withLock(0x4773ee0, 0xc00ef8c000, 0xc00b44e6f8)
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:3184 +0x6d
avril 15 19:22:30 k3s[40012]: database/sql.(*Rows).close(0xc00f037000, 0x4718940, 0xc0001321e0, 0x0, 0x0)
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:3075 +0x129
avril 15 19:22:30 k3s[40012]: database/sql.(*Rows).awaitDone(0xc00f037000, 0x47d9320, 0xc00c8b6c00, 0x0, 0x0)
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:2733 +0x101
avril 15 19:22:30 k3s[40012]: created by database/sql.(*Rows).initContextClose
avril 15 19:22:30 k3s[40012]: /usr/local/go/src/database/sql/sql.go:2717 +0xc1

And @mikeccuk2005 - I think backup of storage and Kubernetes configuration is to be done at another level, storage you can backup from your storage class, if local path, find the path and copy files away. For configuration, just save/export all the yamls.

Backing up and restoring using the db files seems unsafe.

leolb-aphp · 2020-04-15T17:32:18Z

And (code=killed, signal=SEGV) on others, so doesnt solve the problem.

freeekanayaka · 2020-04-15T18:42:51Z

See k3s-io/kine#29 (comment)

mikeccuk2005 · 2020-04-16T06:15:57Z

Is there a way to stop k3s gracefully to avoid this problem?
Currently using systemctl restart k3s is a bit aggressive, don't you think?

freeekanayaka · 2020-04-16T07:06:23Z

Receiving errors like this after a little while:

avril 15 19:22:30 bbs-edscpu-p001.bbs.aphp.fr k3s[40012]: time="2020-04-15T19:22:30.268644397+02:00" level=error msg="error while range on /registry/longhorn.io/settings/ /registry/longhorn.io/settings/: context canceled"
avril 15 19:22:30 bbs-edscpu-p001.bbs.aphp.fr k3s[40012]: panic: runtime error: index out of range [2127] with length 232

This specific error should be fixed when using latest go-dqlite, libdqlite and libraft.

freeekanayaka · 2020-04-16T07:08:15Z

Is there a way to stop k3s gracefully to avoid this problem?
Currently using systemctl restart k3s is a bit aggressive, don't you think?

The way to avoid this problem is to upgrade these dependencies. If that still doesn't work for you, please let me know.

SteelCrow · 2020-04-17T18:57:07Z

So I've compiled kine, dqlite-build, k3s with RAFT_VER=v0.9.18 DQLITE_VER=v1.4.1 DQLITE_VER=v1.5.0 on arm64 and start testing it on my rasp4b cluster with 3 master nodes. I didn't see any previous error crush for about 4 hours when restarted services or shutdowned nodes. But dqlite leader election is not working, if I turn off first master (deployed with --cluster-init) all cluster is failed with error:

1 node

turned off

2 node

Apr 17 18:51:15 rpi-01 k3s[10100]: F0417 18:51:15.721930   10100 controllermanager.go:213] leaderelection lost
Apr 17 18:51:16 rpi-01 systemd[1]: k3s.service: Main process exited, code=exited, status=255/EXCEPTION

3 node

Apr 17 18:52:21 rpi-02 k3s[5600]: time="2020-04-17T18:52:21.325072836Z" level=error msg="error while range on /registry/apiextensions.k8s.io/customresourcedefinitions /registry/apiextensions.k8s.io/customresourcedefinitions: failed to create dqlite connection: no available dqlite leader server found"

freeekanayaka · 2020-04-20T14:41:21Z

I'm not totally sure about this one, but sounds like either a deployment issue of some kind on your side or k3s/kine not forming the dqlite cluster correctly.

Probably something k3s/kine devs should look at and possibly get back to me if they find dqlite bugs or have questions about how to correctly form a cluster.

I'd suggest to open a separate issue for this tho, since it seems unrelated to the original one.

cjellick · 2020-05-05T18:25:10Z

Thanks for the clarification on the unrealted issue @freeekanayaka

Regarding the original issue, we've bumped the necessary dependencies here:
https://github.com/rancher/dqlite-build/pull/13/files

@mikeccuk2005 or anyone else who hit this issue. Can you attempt reproduce with v1.18.2 (either an RC or the real release once its out). Since dqlite support is still experimental, this won't be consider a blocker for 1.18.2 and thus we won't holding up the release to test it.

jamesorlakin · 2020-05-10T22:35:57Z

@SteelCrow I believe your issue is more tied to #1391

coldpenguin · 2020-05-18T00:15:43Z

...

Can you attempt reproduce with v1.18.2 (either an RC or the real release once its out). Since dqlite support is still experimental, this won't be consider a blocker for 1.18.2 and thus we won't holding up the release to test it.

Came here because I had the error message: load closed segment 0000000000100224-0000000000100489: found 265 entries (expected 266)

Running on RPI-4 buster,
v1.18.2+k3s1 (698e444)

Installed a fresh 3 node cluster. Was trying to debug why when node 0 was turned off, everything stopped (a la 1391), so was turning the service on and off.
Used systemctl stop k3s / start k3s several times (probably 6-7 times) on the first master without issues before the problem occurred. Total uptime < 3 hours

freeekanayaka · 2020-05-18T07:23:46Z

Try to remove the file 0000000000100224-0000000000100489 from your k3s data directory (not sure where that is).

I don't know exactly the internals of k3s, so there might be an issue with it creates the dqlite which makes it not really HA.

jamesorlakin · 2020-05-18T17:35:04Z

If it's of interest, dqlite may be replaced by etcd in #1770

davidnuzik · 2020-08-11T18:21:09Z

We would like to close out this issue. If you strongly object to this you can leave comments - or if you wish create a separate GH issue.

dqlite and raft versions have been bumped. We are having a hard time reproducing this issue now - in fact we have not been able to reproduce the issue.

Further, as @jamesorlakin mentioned we are indeed moving to embedded etcd in K3s. You will find in our upcoming v1.19.0 release of K3s that this feature has replaced dqlite. We are planning to release v1.19.0 K3s not later than September 1, 2020 (however this is subject to change).

leolb-aphp mentioned this issue Apr 14, 2020

Bump go-dqlite to v1.5.1 k3s-io/kine#29

Merged

mikeccuk2005 closed this as completed Apr 15, 2020

mikeccuk2005 reopened this Apr 15, 2020

davidnuzik added [zube]: To Triage labels Apr 16, 2020

cjellick added this to the v1.18 - April milestone May 5, 2020

cjellick added the [zube]: To Test label May 5, 2020

zube bot removed the [zube]: To Triage label May 5, 2020

cjellick mentioned this issue May 5, 2020

Bump dqlite & raft versions rancher/dqlite-build#13

Merged

davidnuzik assigned ShylajaDevadiga May 27, 2020

davidnuzik unassigned ShylajaDevadiga Jul 17, 2020

davidnuzik assigned rancher-max Jul 17, 2020

davidnuzik assigned davidnuzik and unassigned rancher-max Aug 10, 2020

davidnuzik closed this as completed Aug 11, 2020

zube bot added [zube]: Done and removed [zube]: To Test labels Aug 11, 2020

davidnuzik removed the [zube]: Done label Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

io: load closed segment 0000000000064324-0000000000064793: found 469 entries (expected 470) #1639

io: load closed segment 0000000000064324-0000000000064793: found 469 entries (expected 470) #1639

mikeccuk2005 commented Apr 14, 2020 •

edited

Loading

SteelCrow commented Apr 14, 2020 •

edited

Loading

leolb-aphp commented Apr 14, 2020

mikeccuk2005 commented Apr 15, 2020

leolb-aphp commented Apr 15, 2020

leolb-aphp commented Apr 15, 2020 •

edited

Loading

mikeccuk2005 commented Apr 15, 2020

leolb-aphp commented Apr 15, 2020 •

edited

Loading

leolb-aphp commented Apr 15, 2020

freeekanayaka commented Apr 15, 2020

mikeccuk2005 commented Apr 16, 2020

freeekanayaka commented Apr 16, 2020 •

edited

Loading

freeekanayaka commented Apr 16, 2020

SteelCrow commented Apr 17, 2020 •

edited

Loading

freeekanayaka commented Apr 20, 2020

cjellick commented May 5, 2020

jamesorlakin commented May 10, 2020

coldpenguin commented May 18, 2020 •

edited

Loading

freeekanayaka commented May 18, 2020

jamesorlakin commented May 18, 2020

davidnuzik commented Aug 11, 2020

io: load closed segment 0000000000064324-0000000000064793: found 469 entries (expected 470) #1639

io: load closed segment 0000000000064324-0000000000064793: found 469 entries (expected 470) #1639

Comments

mikeccuk2005 commented Apr 14, 2020 • edited Loading

SteelCrow commented Apr 14, 2020 • edited Loading

leolb-aphp commented Apr 14, 2020

mikeccuk2005 commented Apr 15, 2020

leolb-aphp commented Apr 15, 2020

leolb-aphp commented Apr 15, 2020 • edited Loading

mikeccuk2005 commented Apr 15, 2020

leolb-aphp commented Apr 15, 2020 • edited Loading

leolb-aphp commented Apr 15, 2020

freeekanayaka commented Apr 15, 2020

mikeccuk2005 commented Apr 16, 2020

freeekanayaka commented Apr 16, 2020 • edited Loading

freeekanayaka commented Apr 16, 2020

SteelCrow commented Apr 17, 2020 • edited Loading

freeekanayaka commented Apr 20, 2020

cjellick commented May 5, 2020

jamesorlakin commented May 10, 2020

coldpenguin commented May 18, 2020 • edited Loading

freeekanayaka commented May 18, 2020

jamesorlakin commented May 18, 2020

davidnuzik commented Aug 11, 2020

mikeccuk2005 commented Apr 14, 2020 •

edited

Loading

SteelCrow commented Apr 14, 2020 •

edited

Loading

leolb-aphp commented Apr 15, 2020 •

edited

Loading

leolb-aphp commented Apr 15, 2020 •

edited

Loading

freeekanayaka commented Apr 16, 2020 •

edited

Loading

SteelCrow commented Apr 17, 2020 •

edited

Loading

coldpenguin commented May 18, 2020 •

edited

Loading