Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] OpenAiServiceUpgradeIT testOpenAiCompletions {upgradedNodes=2} failing #118163

Open
elasticsearchmachine opened this issue Dec 6, 2024 · 8 comments · Fixed by #118624
Open
Labels
medium-risk An open issue or test failure that is a medium risk to future releases :ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Dec 6, 2024

Build Scans:

Reproduction Line:

./gradlew ":x-pack:plugin:inference:qa:rolling-upgrade:v8.14.3#bwcTest" -Dtests.class="org.elasticsearch.xpack.application.OpenAiServiceUpgradeIT" -Dtests.method="testOpenAiCompletions {upgradedNodes=1}" -Dtests.seed=FE841534EA5BAB36 -Dtests.bwc=true -Dtests.locale=xnr-Deva-IN -Dtests.timezone=Pacific/Rarotonga -Druntime.java=23

Applicable branches:
8.17

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.NullPointerException: Cannot invoke "java.util.List.addAll(java.util.Collection)" because "configs" is null

Issue Reasons:

  • [8.17] 2 consecutive failures in step 8.14.3_bwc
  • [8.17] 6 failures in test testOpenAiCompletions {upgradedNodes=2} (3.2% fail rate in 188 executions)
  • [8.17] 6 failures in step 8.14.3_bwc (54.5% fail rate in 11 executions)
  • [8.17] 6 failures in pipeline elasticsearch-periodic (54.5% fail rate in 11 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :ml Machine learning >test-failure Triaged test failures from CI labels Dec 6, 2024
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine elasticsearchmachine added Team:ML Meta label for the ML team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Dec 6, 2024
@maxhniebergall maxhniebergall added medium-risk An open issue or test failure that is a medium risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Dec 9, 2024
@maxhniebergall
Copy link
Member

seeing the following message in the error dashboard:

org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:40811], URI [_inference/completion/old-cluster-completions], status line [HTTP/1.1 400 Bad Request]
{"error":"no handler found for uri [_inference/completion/old-cluster-completions] and method [GET]"}

@maxhniebergall
Copy link
Member

maxhniebergall commented Dec 9, 2024

I think the root cause here is that the tests are no longer running in the order we are expecting (i.e., the oldCluster, then the mixedCluster, then the upgradedCluster).

@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.x

Mute Reasons:

  • [8.x] 13 consecutive failures in step 8.14.3_bwc
  • [8.x] 13 failures in test testOpenAiCompletions {upgradedNodes=2} (2.1% fail rate in 606 executions)
  • [8.x] 13 failures in step 8.14.3_bwc (100.0% fail rate in 13 executions)
  • [8.x] 12 failures in pipeline elasticsearch-periodic (100.0% fail rate in 12 executions)

Build Scans:

elasticsearchmachine added a commit that referenced this issue Dec 9, 2024
@maxhniebergall
Copy link
Member

All of the UpgradeIT tests that started failing around the same time are on 8.x.

@maxhniebergall
Copy link
Member

Experimental testing shows these upgrade failures were caused by this PR, but the reason this PR causes these failures is still unknown #118105

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 12, 2024
We need to load the two fields from the same response. Otherwise, we can have a sort of race
where we load "endpoints" from pre-8.15 as empty and then load "models" from a post-8.15 node
also empty, resulting in an empty list because we took the wrong info from either response.

closes elastic#118163
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 13, 2024
…ing (elastic#118624)

We need to load the two fields from the same response. Otherwise, we can have a sort of race
where we load "endpoints" from pre-8.15 as empty and then load "models" from a post-8.15 node
also empty, resulting in an empty list because we took the wrong info from either response.

closes elastic#118163
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 13, 2024
…ing (elastic#118624)

We need to load the two fields from the same response. Otherwise, we can have a sort of race
where we load "endpoints" from pre-8.15 as empty and then load "models" from a post-8.15 node
also empty, resulting in an empty list because we took the wrong info from either response.

closes elastic#118163
elasticsearchmachine pushed a commit that referenced this issue Dec 13, 2024
…ing (#118624) (#118663)

We need to load the two fields from the same response. Otherwise, we can have a sort of race
where we load "endpoints" from pre-8.15 as empty and then load "models" from a post-8.15 node
also empty, resulting in an empty list because we took the wrong info from either response.

closes #118163
elasticsearchmachine pushed a commit that referenced this issue Dec 13, 2024
#118664

```
- class: org.elasticsearch.xpack.application.CohereServiceUpgradeIT
  method: testRerank {upgradedNodes=1}
  issue: #116973
- class: org.elasticsearch.xpack.application.CohereServiceUpgradeIT
  method: testCohereEmbeddings {upgradedNodes=1}
  issue: #116974
- class: org.elasticsearch.xpack.application.CohereServiceUpgradeIT
  method: testCohereEmbeddings {upgradedNodes=2}
  issue: #116975

- class: org.elasticsearch.xpack.application.OpenAiServiceUpgradeIT
  method: testOpenAiEmbeddings {upgradedNodes=1}
  issue: #118156
- class: org.elasticsearch.xpack.application.HuggingFaceServiceUpgradeIT
  method: testElser {upgradedNodes=1}
  issue: #118127
- class: org.elasticsearch.xpack.application.OpenAiServiceUpgradeIT
  method: testOpenAiCompletions {upgradedNodes=1}
  issue: #118162
- class: org.elasticsearch.xpack.application.OpenAiServiceUpgradeIT
  method: testOpenAiCompletions {upgradedNodes=2}
  issue: #118163
- class: org.elasticsearch.xpack.application.OpenAiServiceUpgradeIT
  method: testOpenAiEmbeddings {upgradedNodes=2}
  issue: #118204

- class: org.elasticsearch.xpack.application.HuggingFaceServiceUpgradeIT
  method: testHFEmbeddings {upgradedNodes=1}
  issue: #118197
```
elasticsearchmachine added a commit that referenced this issue Dec 13, 2024
@elasticsearchmachine
Copy link
Collaborator Author

This has been muted on branch 8.x

Mute Reasons:

  • [8.x] 12 consecutive failures in step 8.14.3_bwc
  • [8.x] 12 failures in test testOpenAiCompletions {upgradedNodes=2} (2.3% fail rate in 522 executions)
  • [8.x] 12 failures in step 8.14.3_bwc (100.0% fail rate in 12 executions)
  • [8.x] 11 failures in pipeline elasticsearch-periodic (100.0% fail rate in 11 executions)

Build Scans:

maxhniebergall added a commit to maxhniebergall/elasticsearch that referenced this issue Dec 16, 2024
elastic#118664

```
- class: org.elasticsearch.xpack.application.CohereServiceUpgradeIT
  method: testRerank {upgradedNodes=1}
  issue: elastic#116973
- class: org.elasticsearch.xpack.application.CohereServiceUpgradeIT
  method: testCohereEmbeddings {upgradedNodes=1}
  issue: elastic#116974
- class: org.elasticsearch.xpack.application.CohereServiceUpgradeIT
  method: testCohereEmbeddings {upgradedNodes=2}
  issue: elastic#116975

- class: org.elasticsearch.xpack.application.OpenAiServiceUpgradeIT
  method: testOpenAiEmbeddings {upgradedNodes=1}
  issue: elastic#118156
- class: org.elasticsearch.xpack.application.HuggingFaceServiceUpgradeIT
  method: testElser {upgradedNodes=1}
  issue: elastic#118127
- class: org.elasticsearch.xpack.application.OpenAiServiceUpgradeIT
  method: testOpenAiCompletions {upgradedNodes=1}
  issue: elastic#118162
- class: org.elasticsearch.xpack.application.OpenAiServiceUpgradeIT
  method: testOpenAiCompletions {upgradedNodes=2}
  issue: elastic#118163
- class: org.elasticsearch.xpack.application.OpenAiServiceUpgradeIT
  method: testOpenAiEmbeddings {upgradedNodes=2}
  issue: elastic#118204

- class: org.elasticsearch.xpack.application.HuggingFaceServiceUpgradeIT
  method: testHFEmbeddings {upgradedNodes=1}
  issue: elastic#118197
```
maxhniebergall pushed a commit to maxhniebergall/elasticsearch that referenced this issue Dec 16, 2024
@maxhniebergall
Copy link
Member

seems like the fix backport failed for an unrelated reason. Will get the backport merged and then unmute and close. #118664

elasticsearchmachine pushed a commit that referenced this issue Dec 16, 2024
…ing (#118624) (#118664)

We need to load the two fields from the same response. Otherwise, we can have a sort of race
where we load "endpoints" from pre-8.15 as empty and then load "models" from a post-8.15 node
also empty, resulting in an empty list because we took the wrong info from either response.

closes #118163

Co-authored-by: Max Hniebergall <137079448+maxhniebergall@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
medium-risk An open issue or test failure that is a medium risk to future releases :ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
2 participants