Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mongodb petsets extended tests: dns record of replica cannot be resolved #12588

Closed
mmilata opened this issue Jan 20, 2017 · 6 comments
Closed
Assignees
Labels
area/tests component/image kind/test-flake Categorizes issue or PR as related to test flakes. priority/P2

Comments

@mmilata
Copy link
Contributor

mmilata commented Jan 20, 2017

Seen here: https://ci.openshift.redhat.com/jenkins/job/origin_extended_image_tests/825/testReport/junit/(root)/Extended/_image_ecosystem__mongodb__Slow__openshift_mongodb_replication__with_petset__creating_from_a_template_should_process_and_create_the__https___raw_githubusercontent_com_sclorg_mongodb_container_master_examples_petset_mongodb_petset_persistent_yaml__template/

The test fails on reading record on second replica mongodb-replicaset-1:

Jan 20 04:39:16.600: INFO: Running 'oc exec --config=/tmp/extended-test-mongodb-petset-replica-jn4bx-blvrl-user.kubeconfig --namespace=extended-test-mongodb-petset-replica-jn4bx-blvrl mongodb-replicaset-1 -- bash -c mongo --quiet "$MONGODB_DATABASE" --username "$MONGODB_USER" --password "$MONGODB_PASSWORD" --eval 'rs.slaveOk(); printjson(db.test.find({}, {_id: 0}).toArray())''
Jan 20 04:39:16.911: INFO: Error running &{/data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/oc [oc exec --config=/tmp/extended-test-mongodb-petset-replica-jn4bx-blvrl-user.kubeconfig --namespace=extended-test-mongodb-petset-replica-jn4bx-blvrl mongodb-replicaset-1 -- bash -c mongo --quiet "$MONGODB_DATABASE" --username "$MONGODB_USER" --password "$MONGODB_PASSWORD" --eval 'rs.slaveOk(); printjson(db.test.find({}, {_id: 0}).toArray())'] []   
2017-01-20T09:39:16.902+0000 E QUERY    [thread1] Error: Authentication failed. :
DB.prototype._authOrThrow@src/mongo/shell/db.js:1441:20
@(auth):6:1
@(auth):1:2

exception: login failed

This, in turn, seems to be caused by the first replica, mongodb-replicaset-0, not being able to resolve mongodb-replicaset-1's hostname:

2017-01-20T09:39:02.354971000Z 2017-01-20T09:39:02.354+0000 W NETWORK  [conn7] getaddrinfo("mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-jn4bx-blvrl.svc.cluster.local") failed: Name or service not known
2017-01-20T09:39:02.356061000Z 2017-01-20T09:39:02.355+0000 I NETWORK  [conn7] getaddrinfo("mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-jn4bx-blvrl.svc.cluster.local") failed: Name or service not known
2017-01-20T09:39:02.356295000Z 2017-01-20T09:39:02.355+0000 I REPL     [conn7] replSetReconfig config object with 2 members parses ok
2017-01-20T09:39:02.357005000Z 2017-01-20T09:39:02.356+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-jn4bx-blvrl.svc.cluster.local:27017
2017-01-20T09:39:02.358586000Z 2017-01-20T09:39:02.358+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-jn4bx-blvrl.svc.cluster.local:27017 - HostUnreachable: HostUnreachable
2017-01-20T09:39:02.358807000Z 2017-01-20T09:39:02.358+0000 W REPL     [ReplicationExecutor] Failed to complete heartbeat request to mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-jn4bx-blvrl.svc.cluster.local:27017; HostUnreachable: HostUnreachable

Can't seem to reproduce this locally.

@mmilata mmilata added area/tests kind/test-flake Categorizes issue or PR as related to test flakes. labels Jan 20, 2017
@bparees
Copy link
Contributor

bparees commented Jan 20, 2017

Hm. the pods aren't supposed to report ready until they can at least resolve their own hostname. i'd think once they can do that, their hostname could be resolved by other pods too.

but regardless it may mean we need more robust startup logic that waits for the DNS to be resolvable.

@mmilata
Copy link
Contributor Author

mmilata commented Jan 23, 2017

Seen another failure where test seems to query the mongodb cluster before it is ready: https://ci.openshift.redhat.com/jenkins/job/origin_extended_image_tests/828/consoleText

We currently only wait before the mongodb pods are Running, maybe we need to add some kind of readiness probe and wait for them to be Ready?

@bparees
Copy link
Contributor

bparees commented Jan 23, 2017

i think adding a readiness probe will cause problems w/ the replica init since the pods contact each other before they are "up/ready". There is the tolerate unready annotation, but then i think you're back to the original problem.

Solving this may require fixing the test to explicitly wait for a condition.

@jim-minter jim-minter assigned jim-minter and unassigned mmilata Feb 1, 2017
@jim-minter
Copy link
Contributor

I think https://ci.openshift.redhat.com/jenkins/job/origin_extended_image_tests/837/consoleFull is another example of this issue.

Pod 1 starts at 9.50.34.542, attempts to register against pod 0 at 9.50.34.764, gives up and goes home at 9.50.35.372 because 0 can't find 1 in the DNS. Almost certainly that's not giving enough time for k8s' DNS to settle.

  1. Sleep and retry logic is needed in the container at registration time (e.g. have slaves try to register 10 times, with a 1sec gap between failures).
  2. The test shouldn't be allowed in until the cluster is properly up and running. I think this implies two services being defined - an "internal" one with TolerateUnreadyEndpointsAnnotation set, an "external" one without (see Readiness probe optionally not affecting receiving traffic via service kubernetes/kubernetes#39207 (comment)), and readinessprobe flagged for a pod at least only after its init scripts have completed.

@bparees is this a case of sending a PR to https://github.com/sclorg/mongodb-container or opening an issue there, or something else?

@bparees
Copy link
Contributor

bparees commented Feb 2, 2017

@jim-minter my read of the logs is that this pod:

Feb  1 04:51:44.924: INFO: Running 'oc logs --config=/tmp/extended-test-mongodb-petset-replica-pv639-mtv6m-user.kubeconfig --namespace=extended-test-mongodb-petset-replica-pv639-mtv6m mongodb-replicaset-1 --timestamps'

could not find itself:

2017-02-01T09:50:35.367813000Z 	"errmsg" : "Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: mongodb-replicaset-0.mongodb-replicaset.extended-test-mongodb-petset-replica-pv639-mtv6m.svc.cluster.local:27017; the following nodes did not respond affirmatively: mongodb-replicaset-1.mongodb-replicaset.extended-test-mongodb-petset-replica-pv639-mtv6m.svc.cluster.local:27017 failed with HostUnreachable",

Which i had hoped to have fixed earlier by forcing mongo to connect to itself via DNS before considering itself started and beginning the process of setting up replication:
sclorg/mongodb-container#217

That said, both of your suggestions sound reasonable, if you are in a position to do so you can just submit the PR and tag me.

@jim-minter
Copy link
Contributor

That said, both of your suggestions sound reasonable, if you are in a position to do so you can just submit the PR and tag me.

Will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tests component/image kind/test-flake Categorizes issue or PR as related to test flakes. priority/P2
Projects
None yet
Development

No branches or pull requests

3 participants