-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate random flakiness of our Jenkins pipeline #81457
Comments
@kertal How is the validity of the field determined? Is there a chance for a race condition in the way that information is fetched, validated or refreshed? The flaky test runner and standard CI both run a bunch of things in parallel on the same machine, the only difference being the amount of parallelism on standard CI jobs is a lot higher, but we also use larger machines for those tests. We use separate ES instances for every test execution though so there shouldn't be any interaction between the separate parallel executions except the kind of interactions that might happen when ES is busy with other tasks, which might be why users have seen this behavior in production as well. |
Took a look at the code, the validation of the field takes place in the |
It's curious that the tests use different index patterns, but the error seems to be the same in all of them. One common denominator I noticed is that linked tests are all running as part of That group is also running "index pattern without timefield" test - so could be something gets messed up there? |
@majagrubic "index pattern without time field" is good a good hint, however it's the last of all this tests so I can't think of a way how it should influence the former one, but it's good to rule out possible causes, I've started a PR to add the name of the index pattern to the error message, would love to add more info, but since this is user - facing, don't think we can add nice pretty json |
I'm pretty sure where this error is coming from, it's not related to the test, but some of the tests require additional test data and custom index patterns. Lot's of our index pattern data we use in our test cases contain fields with legacy structure. What our index pattern service does in this case, is fetching fields using the fields API. But this doesn't work occasionally (had to run our OSS group6 discover test case 300 times for several times to reproduce). I've been adding additional logs to find out why the index pattern service or Elasticsearch causes troubles here. This is the latest failure: the field refresh doesn't work , the fields fetched content still contains legacy fields like
The good news: we should be able to fix this by updating our test data, refreshing the index pattern's fields property 🥳 |
My concern with updating test data is that we could be masking an upgrade problem. If it's an issue that the |
@LeeDr yes, investigation should continue, problem here is, we currently don't know that it fails, since there is no error, just the fields data stays the same. I've added more logs to investigate. I also think it's also fine to update test data, so fetching is not necessary, would resolve 4+ issues and stabilize our Jenkins pipeline, here's the latest flaky test: 300 tries https://kibana-ci.elastic.co/job/kibana+flaky-test-suite-runner/957/ |
Closing this, it didn't fail recently, flaky runs were ok, but it's clear what should be improved, there should be a retry logic for fetching fields to prevent such flakiness, I've opened an issue for that |
Recently our functional test pipeline showed a kind of flakiness that was not reproduceable with our flaky test runner. Some tests failed randomly, the only thing they had in common is the error message:
#78689
#39842
Note that this is an older screenshot because the error message was improved to give a bit more information
#80812
#82035
This is an odd error, why should the
@timestamp
suddenly be invalid? ... unless the underlying index pattern has changed? It could be the case that another test suite running in parallel is changing the index pattern, that would explain why in never fails when the flaky test runner is doing his work.And there's a related post in our Discuss which indicates that the index pattern was modified without user interaction, excluding
@timestamp
from_source
https://discuss.elastic.co/t/what-is-excluded-column-in-kibana-index-pattern/252667/10
The user had the same error, because the index pattern excluded
@timestamp
after 5 minutes, without the user actively interacted:The text was updated successfully, but these errors were encountered: