mongodb petsets extended tests: dns record of replica cannot be resolved #12588

mmilata · 2017-01-20T17:18:10Z

Seen here: https://ci.openshift.redhat.com/jenkins/job/origin_extended_image_tests/825/testReport/junit/(root)/Extended/_image_ecosystem__mongodb__Slow__openshift_mongodb_replication__with_petset__creating_from_a_template_should_process_and_create_the__https___raw_githubusercontent_com_sclorg_mongodb_container_master_examples_petset_mongodb_petset_persistent_yaml__template/

The test fails on reading record on second replica mongodb-replicaset-1:

Jan 20 04:39:16.600: INFO: Running 'oc exec --config=/tmp/extended-test-mongodb-petset-replica-jn4bx-blvrl-user.kubeconfig --namespace=extended-test-mongodb-petset-replica-jn4bx-blvrl mongodb-replicaset-1 -- bash -c mongo --quiet "$MONGODB_DATABASE" --username "$MONGODB_USER" --password "$MONGODB_PASSWORD" --eval 'rs.slaveOk(); printjson(db.test.find({}, {_id: 0}).toArray())''
Jan 20 04:39:16.911: INFO: Error running &{/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/oc [oc exec --config=/tmp/extended-test-mongodb-petset-replica-jn4bx-blvrl-user.kubeconfig --namespace=extended-test-mongodb-petset-replica-jn4bx-blvrl mongodb-replicaset-1 -- bash -c mongo --quiet "$MONGODB_DATABASE" --username "$MONGODB_USER" --password "$MONGODB_PASSWORD" --eval 'rs.slaveOk(); printjson(db.test.find({}, {_id: 0}).toArray())'] []   
2017-01-20T09:39:16.902+0000 E QUERY    [thread1] Error: Authentication failed. :
DB.prototype._authOrThrow@src/mongo/shell/db.js:1441:20
@(auth):6:1
@(auth):1:2

exception: login failed

This, in turn, seems to be caused by the first replica, mongodb-replicaset-0, not being able to resolve mongodb-replicaset-1's hostname:

2017-01-20T09:39:02.354971000Z 2017-01-20T09:39:02.354+0000 W NETWORK  [conn7] getaddrinfo("mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-jn4bx-blvrl.svc.cluster.local") failed: Name or service not known
2017-01-20T09:39:02.356061000Z 2017-01-20T09:39:02.355+0000 I NETWORK  [conn7] getaddrinfo("mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-jn4bx-blvrl.svc.cluster.local") failed: Name or service not known
2017-01-20T09:39:02.356295000Z 2017-01-20T09:39:02.355+0000 I REPL     [conn7] replSetReconfig config object with 2 members parses ok
2017-01-20T09:39:02.357005000Z 2017-01-20T09:39:02.356+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-jn4bx-blvrl.svc.cluster.local:27017
2017-01-20T09:39:02.358586000Z 2017-01-20T09:39:02.358+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-jn4bx-blvrl.svc.cluster.local:27017 - HostUnreachable: HostUnreachable
2017-01-20T09:39:02.358807000Z 2017-01-20T09:39:02.358+0000 W REPL     [ReplicationExecutor] Failed to complete heartbeat request to mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-jn4bx-blvrl.svc.cluster.local:27017; HostUnreachable: HostUnreachable

Can't seem to reproduce this locally.

The text was updated successfully, but these errors were encountered:

bparees · 2017-01-20T18:14:47Z

Hm. the pods aren't supposed to report ready until they can at least resolve their own hostname. i'd think once they can do that, their hostname could be resolved by other pods too.

but regardless it may mean we need more robust startup logic that waits for the DNS to be resolvable.

mmilata · 2017-01-23T13:38:41Z

Seen another failure where test seems to query the mongodb cluster before it is ready: https://ci.openshift.redhat.com/jenkins/job/origin_extended_image_tests/828/consoleText

We currently only wait before the mongodb pods are Running, maybe we need to add some kind of readiness probe and wait for them to be Ready?

bparees · 2017-01-23T16:59:32Z

i think adding a readiness probe will cause problems w/ the replica init since the pods contact each other before they are "up/ready". There is the tolerate unready annotation, but then i think you're back to the original problem.

Solving this may require fixing the test to explicitly wait for a condition.

jim-minter · 2017-02-02T14:15:42Z

I think https://ci.openshift.redhat.com/jenkins/job/origin_extended_image_tests/837/consoleFull is another example of this issue.

Pod 1 starts at 9.50.34.542, attempts to register against pod 0 at 9.50.34.764, gives up and goes home at 9.50.35.372 because 0 can't find 1 in the DNS. Almost certainly that's not giving enough time for k8s' DNS to settle.

Sleep and retry logic is needed in the container at registration time (e.g. have slaves try to register 10 times, with a 1sec gap between failures).
The test shouldn't be allowed in until the cluster is properly up and running. I think this implies two services being defined - an "internal" one with TolerateUnreadyEndpointsAnnotation set, an "external" one without (see Readiness probe optionally not affecting receiving traffic via service kubernetes/kubernetes#39207 (comment)), and readinessprobe flagged for a pod at least only after its init scripts have completed.

@bparees is this a case of sending a PR to https://github.com/sclorg/mongodb-container or opening an issue there, or something else?

bparees · 2017-02-02T15:12:43Z

@jim-minter my read of the logs is that this pod:

Feb  1 04:51:44.924: INFO: Running 'oc logs --config=/tmp/extended-test-mongodb-petset-replica-pv639-mtv6m-user.kubeconfig --namespace=extended-test-mongodb-petset-replica-pv639-mtv6m mongodb-replicaset-1 --timestamps'

could not find itself:

2017-02-01T09:50:35.367813000Z 	"errmsg" : "Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: mongodb-replicaset-0.mongodb-replicaset.extended-test-mongodb-petset-replica-pv639-mtv6m.svc.cluster.local:27017; the following nodes did not respond affirmatively: mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-pv639-mtv6m.svc.cluster.local:27017 failed with HostUnreachable",

Which i had hoped to have fixed earlier by forcing mongo to connect to itself via DNS before considering itself started and beginning the process of setting up replication:
sclorg/mongodb-container#217

That said, both of your suggestions sound reasonable, if you are in a position to do so you can just submit the PR and tag me.

jim-minter · 2017-02-02T15:19:49Z

That said, both of your suggestions sound reasonable, if you are in a position to do so you can just submit the PR and tag me.

Will do.

mmilata added area/tests kind/test-flake Categorizes issue or PR as related to test flakes. labels Jan 20, 2017

bparees assigned mmilata Jan 20, 2017

bparees added priority/P2 component/image labels Jan 20, 2017

jim-minter assigned jim-minter and unassigned mmilata Feb 1, 2017

This was referenced Feb 9, 2017

Take more care in handling statefulset pods in mongodb test #12887

Merged

Replica set fixes sclorg/mongodb-container#227

Merged

openshift-bot closed this as completed in #12887 Feb 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mongodb petsets extended tests: dns record of replica cannot be resolved #12588

mongodb petsets extended tests: dns record of replica cannot be resolved #12588

mmilata commented Jan 20, 2017

bparees commented Jan 20, 2017

mmilata commented Jan 23, 2017

bparees commented Jan 23, 2017

jim-minter commented Feb 2, 2017

bparees commented Feb 2, 2017

jim-minter commented Feb 2, 2017

mongodb petsets extended tests: dns record of replica cannot be resolved #12588

mongodb petsets extended tests: dns record of replica cannot be resolved #12588

Comments

mmilata commented Jan 20, 2017

bparees commented Jan 20, 2017

mmilata commented Jan 23, 2017

bparees commented Jan 23, 2017

jim-minter commented Feb 2, 2017

bparees commented Feb 2, 2017

jim-minter commented Feb 2, 2017