CockroachDB unable to start due to gossip cache error after crashes #3883

knz · 2016-01-15T10:56:38Z

So I am running this 5-node cluster on AWS and have been running tests that have crashed the 5 servers regularly. For the second time in a week now, I have reached a point where the servers start but never become ready to serve requests:

lubuntu@ip-172-31-63-242:~$ ./cockroach sql --insecure
# Welcome to the cockroach SQL interface.
# All statements must be terminated by a semicolon.
# To exit: CTRL + D.
ip-172-31-63-242:26257> show databases;
I0115 10:49:42.003156 23384 context.go:168  running in insecure mode, this is strongly discouraged. See --insecure and --certs.
query error: 404 Not Found

The startup log then informs:

I0115 10:49:15.268256 23320 cli/start.go:141  build Vers: go1.5.2
I0115 10:49:15.268312 23320 cli/start.go:142  build Tag:  v0.1-alpha-424-g78ae391
I0115 10:49:15.268324 23320 cli/start.go:143  build Time: 2016/01/15 03:40:26
I0115 10:49:15.268340 23320 cli/start.go:144  build Deps: github.com/VividCortex/ewma:c34099b489e4ac33ca8d8c5f9d29d6eeaf69f2ed github.com/biogo/store:3b4c041f52c224ee4a44f5c8b150d003a40643a0 github.com/cockroachdb/c-lz4:c40aaae2fc50293eb8750b34632bc3efe813e23f github.com/cockroachdb/c-protobuf:4feb192131ea08dfbd7253a00868ad69cbb61b81 github.com/cockroachdb/c-rocksdb:b7fb7bddcb55be35eacdf67e9e2c931083ce02c4 github.com/cockroachdb/c-snappy:5c6d0932e0adaffce4bfca7bdf2ac37f79952ccf github.com/cockroachdb/cockroach:78ae391f73d5a275b9eb6e9b95dc054653517933 github.com/codahale/hdrhistogram:954f16e8b9ef0e5d5189456aa4c1202758e04f17 github.com/coreos/etcd:cb3ca4f8fbc58a900e3b606c40b84d137a9b7abf github.com/elazarl/go-bindata-assetfs:57eb5e1fc594ad4b0b1dbea7b286d299e0cb43c2 github.com/gogo/protobuf:7b1331554dbe882cb3613ee8f1824a5583627963 github.com/google/btree:cc6329d4279e3f025a53a83c397d2339b5705c45 github.com/julienschmidt/httprouter:21439ef4d70ba4f3e2a5ed9249e7b03af4019b40 github.com/lib/pq:11fc39a580a008f1f39bb3d11d984fb34ed778d9 github.com/mattn/go-runewidth:d96d1bd051f2bd9e7e43d602782b37b93b1b5666 github.com/montanaflynn/stats:2c10aa99e7ec8c4607d4427ec7a1a60fcdfce85f github.com/olekukonko/tablewriter:48dc4474bcf3e0134e9a64222207b1e020f171e9 github.com/peterh/liner:3f1c20449d1836aa4cbe38731b96f95cdf89634d github.com/rcrowley/go-metrics:7839c01b09d2b1d7068034e5fe6e423f6ac5be22 github.com/spf13/cobra:2a426b5c596880d305d848c6765850e2d1bee95c github.com/spf13/pflag:7f60f83a2c81bc3c3c0d5297f61ddfa68da9d3b7 golang.org/x/crypto:f23ba3a5ee43012fcb4b92e1a2a405a92554f4f2 golang.org/x/net:415f1917e1dbc946ec834288a8a1e5ff6eee2900 golang.org/x/text:cf4986612c83df6c55578ba198316d1684a9a287 gopkg.in/yaml.v1:9f9df34309c04878acc86042b16630b0f696e1de
I0115 10:49:15.268380 23320 server/context.go:186  1 storage engine(s) specified
I0115 10:49:15.268399 23320 cli/start.go:176  starting cockroach node
W0115 10:49:15.271757 23320 server/server.go:102  running in insecure mode, this is strongly discouraged. See --insecure and --certs.
I0115 10:49:15.272768 23320 gossip/resolver/node_lookup.go:83  querying http://cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257/_status/details/local for gossip nodes
I0115 10:49:15.272974 23320 storage/engine/rocksdb.go:106  opening rocksdb instance at "data"
E0115 10:49:15.277232 23320 gossip/gossip.go:711  invalid bootstrap address: &{context:0xc8201d2640 typ:http-lb addr:cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257 exhausted:true httpClient:0xc8201d5f20}, Get http://cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257/_status/details/local: EOF
I0115 10:49:15.317581 23320 server/node.go:293  initialized store store=1:1 ([ssd]=data): {Capacity:8312655872 Available:6661312512 RangeCount:0}
I0115 10:49:15.317633 23320 server/node.go:220  node ID 1 initialized
I0115 10:49:15.317728 23320 gossip/gossip.go:218  setting node descriptor node_id:1 address:<network_field:"tcp" address_field:"ip-172-31-63-242:26257" > attrs:<> 
I0115 10:49:15.317869 23320 gossip/gossip.go:267  read 6 gossip host(s) for bootstrapping from persistent storage
W0115 10:49:15.317927 23320 gossip/gossip.go:300  bad bootstrap address cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257: gossip/resolver/resolver.go:96: unknown address network "http-lb" for cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257
I0115 10:49:15.317952 23320 server/node.go:389  connecting to gossip network to verify cluster ID...
I0115 10:49:16.282043 23320 gossip/gossip.go:916  starting client to ip-172-31-61-70:26257
W0115 10:49:16.283335 23320 rpc/client.go:320  dial tcp 172.31.61.70:26257: getsockopt: connection refused
W0115 10:49:17.176693 23320 rpc/client.go:320  dial tcp 172.31.61.70:26257: getsockopt: connection refused
I0115 10:49:17.286613 23320 gossip/gossip.go:916  starting client to ip-172-31-60-70:26257
W0115 10:49:17.287973 23320 rpc/client.go:320  dial tcp 172.31.60.70:26257: getsockopt: connection refused
W0115 10:49:18.168184 23320 rpc/client.go:320  dial tcp 172.31.60.70:26257: getsockopt: connection refused
I0115 10:49:18.291404 23320 gossip/gossip.go:916  starting client to ip-172-31-59-91:26257
W0115 10:49:18.292573 23320 rpc/client.go:320  dial tcp 172.31.59.91:26257: getsockopt: connection refused
W0115 10:49:19.203451 23320 rpc/client.go:320  dial tcp 172.31.59.91:26257: getsockopt: connection refused
I0115 10:49:19.294408 23320 gossip/gossip.go:916  starting client to ip-172-31-52-229:26257
W0115 10:49:19.296450 23320 rpc/client.go:320  rpc/codec/tls.go:62: unexpected HTTP response: 404 Not Found
W0115 10:49:19.421446 23320 rpc/client.go:320  dial tcp 172.31.61.70:26257: getsockopt: connection refused
E0115 10:49:20.294719 23320 gossip/gossip.go:711  invalid bootstrap address: &{context:0xc8201d2640 typ:http-lb addr:cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257 exhausted:false httpClient:0xc8201d5f20}, gossip/resolver/node_lookup.go:62: skipping temporarily-exhausted resolver
W0115 10:49:20.312869 23320 rpc/client.go:320  dial tcp 172.31.60.70:26257: getsockopt: connection refused
W0115 10:49:20.404995 23320 rpc/client.go:320  rpc/codec/tls.go:62: unexpected HTTP response: 404 Not Found
W0115 10:49:21.126951 23320 rpc/client.go:320  dial tcp 172.31.59.91:26257: getsockopt: connection refused
I0115 10:49:21.297595 23320 gossip/resolver/node_lookup.go:83  querying http://cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257/_status/details/local for gossip nodes
E0115 10:49:21.301657 23320 gossip/gossip.go:711  invalid bootstrap address: &{context:0xc8201d2640 typ:http-lb addr:cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257 exhausted:true httpClient:0xc8201d5f20}, Get http://cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257/_status/details/local: EOF
W0115 10:49:22.224956 23320 rpc/client.go:320  rpc/codec/tls.go:62: unexpected HTTP response: 404 Not Found
E0115 10:49:22.304356 23320 gossip/gossip.go:711  invalid bootstrap address: &{context:0xc8201d2640 typ:http-lb addr:cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257 exhausted:false httpClient:0xc8201d5f20}, gossip/resolver/node_lookup.go:62: skipping temporarily-exhausted resolver
I0115 10:49:23.307207 23320 gossip/resolver/node_lookup.go:83  querying http://cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257/_status/details/local for gossip nodes
E0115 10:49:23.309944 23320 gossip/gossip.go:711  invalid bootstrap address: &{context:0xc8201d2640 typ:http-lb addr:cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257 exhausted:true httpClient:0xc8201d5f20}, Get http://cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257/_status/details/local: EOF
W0115 10:49:23.834328 23320 rpc/client.go:320  rpc/codec/tls.go:62: unexpected HTTP response: 404 Not Found

(the last few messages then repeat forever)

Not sure whether this is a bug in the product or in the documentation. The problem for now is that I do not know how to reset the gossiping state to a clean state and get my servers to start again, and i could not find this information in the documentation.

The text was updated successfully, but these errors were encountered:

tbg · 2016-01-15T11:29:36Z

this is one for @spencerkimball, introduced in #3711. But please (always) post the exact invocations. It's not clear how you're starting your server, so it's not clear whether you're passing a gossip bootstrap list, etc.

knz · 2016-01-15T11:31:46Z

This is with 78ae391.

Ok after further investigation, did the syntax/semantics for --gossip change? I am getting the following:

With --gossip=tcp=self

E0115 11:28:46.174715 24916 gossip/gossip.go:711  invalid bootstrap address: &{typ:tcp addr:self:26257 exhausted:false}, lookup self on 172.31.0.2:53: no such host

With --gossip=tcp=localhost

W0115 11:29:25.447627 24923 rpc/client.go:320  dial tcp 127.0.0.1:26257: getsockopt: connection refused

With --gossip=http-lb=...

E0115 11:30:50.379144 25005 gossip/gossip.go:711  invalid bootstrap address: &{context:0xc8201b6a00 typ:http-lb addr:cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257 exhausted:false httpClient:0xc8201dd110}, gossip/resolver/node_lookup.go:62: skipping temporarily-exhausted resolver
I0115 11:30:51.410366 25005 gossip/resolver/node_lookup.go:83  querying http://cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257/_status/details/local for gossip nodes
E0115 11:30:51.412951 25005 gossip/gossip.go:711  invalid bootstrap address: &{context:0xc8201b6a00 typ:http-lb addr:cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257 exhausted:true httpClient:0xc8201dd110}, Get http://cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257/_status/details/local: EOF
E0115 11:30:52.441588 25005 gossip/gossip.go:711  invalid bootstrap address: &{context:0xc8201b6a00 typ:http-lb addr:cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257 exhausted:false httpClient:0xc8201dd110}, gossip/resolver/node_lookup.go:62: skipping temporarily-exhausted resolver
I0115 11:30:53.472719 25005 gossip/resolver/node_lookup.go:83  querying http://cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257/_status/details/local for gossip nodes
E0115 11:30:53.475259 25005 gossip/gossip.go:711  invalid bootstrap address: &{context:0xc8201b6a00 typ:http-lb addr:cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257 exhausted:true httpClient:0xc8201dd110}, Get http://cockroach-knz-elb-540260865.us-east-1.elb.amazonaws.com:26257/_status/details/local: EOF

knz · 2016-01-15T11:32:39Z

(Note that I am now getting these errors even after erasing the data stores completely.)

tbg · 2016-01-15T11:41:11Z

Try --gossip=self= (the below is misleading)

      --gossip string
        A comma-separated list of gossip addresses or resolvers for gossip
        bootstrap. Each item in the list has an optional type:
        [type=]<address>. An unspecified type means ip address or dns.
        Type is one of:
        - tcp: (default if type is omitted): plain ip address or hostname,
          or "self" for single-node systems.

Instead, we only record addresses gossiped by nodes. The previous code was using the resolver type for the address network type which was brain dead. There's no reason to set this list with resolvers in any case. That was a misguided notion and unnecessary. Fixes #3883

tbg assigned spencerkimball Jan 15, 2016

This was referenced Jan 15, 2016

Don't store initial resolver addresses in the bootstrap info. #3887

Merged

Update --gossip flag help text #3888

Merged

spencerkimball closed this as completed in #3887 Jan 15, 2016

knz mentioned this issue Jan 17, 2016

Unclear error when trying to boostrap without prior init #3909

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CockroachDB unable to start due to gossip cache error after crashes #3883

CockroachDB unable to start due to gossip cache error after crashes #3883

knz commented Jan 15, 2016

tbg commented Jan 15, 2016

knz commented Jan 15, 2016

knz commented Jan 15, 2016

tbg commented Jan 15, 2016

CockroachDB unable to start due to gossip cache error after crashes #3883

CockroachDB unable to start due to gossip cache error after crashes #3883

Comments

knz commented Jan 15, 2016

tbg commented Jan 15, 2016

knz commented Jan 15, 2016

knz commented Jan 15, 2016

tbg commented Jan 15, 2016