Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] DefaultEndPointsIT testInferDeploysDefaultE5 failing #115361

Open
elasticsearchmachine opened this issue Oct 22, 2024 · 7 comments
Open

[CI] DefaultEndPointsIT testInferDeploysDefaultE5 failing #115361

elasticsearchmachine opened this issue Oct 22, 2024 · 7 comments
Assignees
Labels
low-risk An open issue or test failure that is a low risk to future releases :ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Oct 22, 2024

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:inference:qa:inference-service-tests:javaRestTest" --tests "org.elasticsearch.xpack.inference.DefaultEndPointsIT.testInferDeploysDefaultE5" -Dtests.seed=FF3CF2BC792F7F1B -Dtests.locale=my -Dtests.timezone=Asia/Irkutsk -Druntime.java=17 -Dtests.fips.enabled=true

Applicable branches:
8.x

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.net.SocketTimeoutException: ၁၂၀,၀၀၀ milliseconds timeout on connection http-outgoing-1 [ACTIVE]

Issue Reasons:

  • [8.x] 5 failures in test testInferDeploysDefaultE5 (0.7% fail rate in 667 executions)
  • [8.x] 3 failures in step part-2 (3.2% fail rate in 93 executions)
  • [8.x] 3 failures in pipeline elasticsearch-pull-request (3.2% fail rate in 93 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :ml Machine learning >test-failure Triaged test failures from CI labels Oct 22, 2024
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch main

Mute Reasons:

  • [main] 2 failures in test testInferDeploysDefaultE5 (0.9% fail rate in 214 executions)

Build Scans:

@elasticsearchmachine elasticsearchmachine added Team:ML Meta label for the ML team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Oct 22, 2024
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.16

Mute Reasons:

  • [8.16] 2 failures in test testInferDeploysDefaultE5 (0.5% fail rate in 441 executions)

Build Scans:

smalyshev pushed a commit to smalyshev/elasticsearch that referenced this issue Oct 23, 2024
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this issue Oct 25, 2024
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.x

Mute Reasons:

  • [8.x] 5 failures in test testInferDeploysDefaultE5 (0.7% fail rate in 667 executions)
  • [8.x] 3 failures in step part-2 (3.2% fail rate in 93 executions)
  • [8.x] 3 failures in pipeline elasticsearch-pull-request (3.2% fail rate in 93 executions)

Build Scans:

@maxhniebergall maxhniebergall added the low-risk An open issue or test failure that is a low risk to future releases label Oct 29, 2024
@elasticsearchmachine elasticsearchmachine removed the needs:risk Requires assignment of a risk label (low, medium, blocker) label Oct 29, 2024
@maxhniebergall
Copy link
Member

the feature under test here is behind a feature flag, so, despite these tests being muted, I believe this is overall low-risk

jfreden pushed a commit to jfreden/elasticsearch that referenced this issue Nov 4, 2024
@davidkyle
Copy link
Member

davidkyle commented Nov 5, 2024

The test has as 30 second timeout but the model download took much longer. The fix is use a longer timeout

[o.e.x.m.p.a.TransportLoadTrainedModelPackage] [test-cluster-0] [.multilingual-e5-small_linux-x86_64] finished model import after [85] seconds

@dan-rubinstein
Copy link
Member

dan-rubinstein commented Nov 12, 2024

The builds above seem to have 2 failures:

  1. Expected and actual max_number_of_allocations is different. This seems to have been fixed as part of this change

  2. Timeout failures when making an inference call to the default endpoints.

Looking at the timeout failures, it seems that the exception message is:

java.net.SocketTimeoutException: 120,000 milliseconds timeout on connection http-outgoing-1 [ACTIVE]

Seems like the test has a 120 second timeout instead of a 30 second timeout as mentioned above. The original call to create the endpoint has a 30 second timeout but it does not stop the deployment process and we do not wait on this response. Instead we likely need to bump up the 120 second timeout here. I'll put out a PR to raise the timeout and unmute the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
low-risk An open issue or test failure that is a low risk to future releases :ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants