Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] ConfigurationTests test20HostnameSubstitution failing #109660

Closed
piergm opened this issue Jun 13, 2024 · 5 comments · Fixed by #111216
Closed

[CI] ConfigurationTests test20HostnameSubstitution failing #109660

piergm opened this issue Jun 13, 2024 · 5 comments · Fixed by #111216
Labels
:Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts low-risk An open issue or test failure that is a low risk to future releases Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI

Comments

@piergm
Copy link
Member

piergm commented Jun 13, 2024

Build scan:
https://gradle-enterprise.elastic.co/s/rszk57krw5vti/tests/:qa:packaging:destructiveDistroTest.default-linux-archive/org.elasticsearch.packaging.test.ConfigurationTests/test20HostnameSubstitution

Reproduction line:

null

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
Failure dashboard for org.elasticsearch.packaging.test.ConfigurationTests#test20HostnameSubstitution

Failure excerpt:

java.lang.RuntimeException: Request failed:
HTTP/1.1 503 Service Unavailable
{"error":{"root_cause":[{"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"}],"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"},"status":503}

  at __randomizedtesting.SeedInfo.seed([14018179B976132C:BC50ABF8CCA0EA90]:0)
  at org.elasticsearch.packaging.util.ServerUtils.makeRequest(ServerUtils.java:366)
  at org.elasticsearch.packaging.test.ConfigurationTests.lambda$test20HostnameSubstitution$1(ConfigurationTests.java:53)
  at org.elasticsearch.packaging.test.PackagingTestCase.assertWhileRunning(PackagingTestCase.java:298)
  at org.elasticsearch.packaging.test.ConfigurationTests.lambda$test20HostnameSubstitution$3(ConfigurationTests.java:52)
  at org.elasticsearch.packaging.test.PackagingTestCase.withCustomConfig(PackagingTestCase.java:537)
  at org.elasticsearch.packaging.test.ConfigurationTests.test20HostnameSubstitution(ConfigurationTests.java:41)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
  at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1570)

@piergm piergm added :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts >test-failure Triaged test failures from CI Team:Delivery Meta label for Delivery team labels Jun 13, 2024
@elasticsearchmachine elasticsearchmachine added the needs:risk Requires assignment of a risk label (low, medium, blocker) label Jun 13, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@mark-vieira
Copy link
Contributor

This looks more like a timeout as the node simply wasn't up yet. This has failed only one time in the last 30 days and hasn't reproduced since. I'm just going to close this for now. We can reopen if this continues to fail.

@mark-vieira
Copy link
Contributor

I hit this again. Reopening.

https://gradle-enterprise.elastic.co/s/6wi4lbyqjbd2o

@mark-vieira mark-vieira reopened this Jun 18, 2024
@mark-vieira mark-vieira added the low-risk An open issue or test failure that is a low risk to future releases label Jun 18, 2024
@elasticsearchmachine elasticsearchmachine removed the needs:risk Requires assignment of a risk label (low, medium, blocker) label Jun 18, 2024
@mark-vieira
Copy link
Contributor

I believe issue here is simply that the node is not yet ready to server responses. It seems that when security is enabled on a test node, we use a considerably less reliable heuristic for determining if the node is "up". In ServerUtils we are simply looking for a listening socket on port 9200 but I don't think that's sufficient. The reason we don't use the health endpoint for this is because the node might be setup with security auto configuration and we don't know the credentials.

For now I've opened #111216 to simply wrap the assertion in an assertBusy but there could be a better solution here as this only fixes this problem for this single test case. @rjernst would using the readiness endpoint be better here? Is that only available in the Docker distribution?

mark-vieira added a commit to mark-vieira/elasticsearch that referenced this issue Jul 23, 2024
…astic#111216)

This is an attempt to fix occasional test failures where asserting on a
request response fails because the cluster has not finished
initialization and cannot yet serve requests.

Closes elastic#109660
mark-vieira added a commit to mark-vieira/elasticsearch that referenced this issue Jul 23, 2024
…astic#111216)

This is an attempt to fix occasional test failures where asserting on a
request response fails because the cluster has not finished
initialization and cannot yet serve requests.

Closes elastic#109660
@rjernst
Copy link
Member

rjernst commented Jul 24, 2024

would using the readiness endpoint be better here?

Yes, I think so. However, we don't run the readiness service except in cloud environments. Ultimately the node itself should wait on the readiness service (see #102325). Until that work is finished, I think using an assertBusy is an ok workaround.

elasticsearchmachine pushed a commit that referenced this issue Jul 24, 2024
…11216) (#111219)

This is an attempt to fix occasional test failures where asserting on a
request response fails because the cluster has not finished
initialization and cannot yet serve requests.

Closes #109660
elasticsearchmachine pushed a commit that referenced this issue Jul 24, 2024
…11216) (#111218)

This is an attempt to fix occasional test failures where asserting on a
request response fails because the cluster has not finished
initialization and cannot yet serve requests.

Closes #109660
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts low-risk An open issue or test failure that is a low risk to future releases Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants