Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate random flakiness of our Jenkins pipeline #81457

Closed
kertal opened this issue Oct 22, 2020 · 9 comments
Closed

Investigate random flakiness of our Jenkins pipeline #81457

kertal opened this issue Oct 22, 2020 · 9 comments
Assignees

Comments

@kertal
Copy link
Member

kertal commented Oct 22, 2020

Recently our functional test pipeline showed a kind of flakiness that was not reproduceable with our flaky test runner. Some tests failed randomly, the only thing they had in common is the error message:

#78689
image

#39842
image
Note that this is an older screenshot because the error message was improved to give a bit more information
#80812
image

#82035
image (2)

This is an odd error, why should the @timestamp suddenly be invalid? ... unless the underlying index pattern has changed? It could be the case that another test suite running in parallel is changing the index pattern, that would explain why in never fails when the flaky test runner is doing his work.

And there's a related post in our Discuss which indicates that the index pattern was modified without user interaction, excluding @timestamp from _source

https://discuss.elastic.co/t/what-is-excluded-column-in-kibana-index-pattern/252667/10

The user had the same error, because the index pattern excluded @timestamp after 5 minutes, without the user actively interacted:
image

@spalger
Copy link
Contributor

spalger commented Oct 22, 2020

@kertal How is the validity of the field determined? Is there a chance for a race condition in the way that information is fetched, validated or refreshed? The flaky test runner and standard CI both run a bunch of things in parallel on the same machine, the only difference being the amount of parallelism on standard CI jobs is a lot higher, but we also use larger machines for those tests. We use separate ES instances for every test execution though so there shouldn't be any interaction between the separate parallel executions except the kind of interactions that might happen when ES is busy with other tasks, which might be why users have seen this behavior in production as well.

@kertal
Copy link
Member Author

kertal commented Nov 4, 2020

Took a look at the code, the validation of the field takes place in the data plugin, @timestamp no longer seems to be an available field currently selected index pattern, I think I'm adding the title of the index pattern to the error message, it would also be convenient get the "See the full error" msg, but I don't know if there's an easy way to do this in the test suite

@majagrubic
Copy link
Contributor

It's curious that the tests use different index patterns, but the error seems to be the same in all of them.

One common denominator I noticed is that linked tests are all running as part of ciGroup6:
https://github.com/elastic/kibana/blob/0bae5d62c932c670b9da55575fbf5caaffbc88e5/test/functional/apps/discover/index.js

That group is also running "index pattern without timefield" test - so could be something gets messed up there?

@kertal
Copy link
Member Author

kertal commented Nov 4, 2020

@maja but this index pattern is called differently, and there are also logstash-* named index patterns with this failure
@spalger is there a way to add better debugging for this case, e.g. if a tests fails additional logs, screens are recorded in the afterEach hook?

@kertal
Copy link
Member Author

kertal commented Nov 4, 2020

@majagrubic "index pattern without time field" is good a good hint, however it's the last of all this tests so I can't think of a way how it should influence the former one, but it's good to rule out possible causes, I've started a PR to add the name of the index pattern to the error message, would love to add more info, but since this is user - facing, don't think we can add nice pretty json

#82604

@kertal
Copy link
Member Author

kertal commented Nov 9, 2020

I'm pretty sure where this error is coming from, it's not related to the test, but some of the tests require additional test data and custom index patterns. Lot's of our index pattern data we use in our test cases contain fields with legacy structure. What our index pattern service does in this case, is fetching fields using the fields API. But this doesn't work occasionally (had to run our OSS group6 discover test case 300 times for several times to reproduce). I've been adding additional logs to find out why the index pattern service or Elasticsearch causes troubles here. This is the latest failure:

https://kibana-ci.elastic.co/job/kibana+flaky-test-suite-runner/956/testReport/junit/Chrome%20UI%20Functional%20Tests/test_functional_apps_discover__field_visualize·ts/Kibana_Pipeline___agent_3___discover_app_discover_field_visualize_button_should_be_able_to_visualize_a_field_and_save_the_visualization/

the field refresh doesn't work , the fields fetched content still contains legacy fields like analyzed, that's why we get the error message, because aggregatable is not set in this case.

[00:06:40] │ debg browser[INFO] http://localhost:6121/37921/bundles/plugin/data/data.plugin.js 0:598131 "fields fetched" "{\"referer\":{\"name\":\"referer\",\"type\":\"string\",\"count\":0,\"scripted\":false,\"indexed\":true,\"analyzed\":false,\"doc_values\":true},\"agent\":{\"name\":\"agent\",\"type\":\"string\",\"count\":0,\"scripted\":false,\"indexed\":true,\"analyzed\":true,\"doc_values\":false},\"relatedContent.og:image:width\":

The good news: we should be able to fix this by updating our test data, refreshing the index pattern's fields property 🥳
The bad news: we need to find out why this is happening, so I've been constantly adding more console.logs in a test PR to find out: #82878 (cc: @mattkime), so up to another flaky test suite runner for 300 hoping it will happen again

@LeeDr
Copy link

LeeDr commented Nov 9, 2020

My concern with updating test data is that we could be masking an upgrade problem. If it's an issue that the fetching fields using the fields API sometimes fails, I don't know if there's retry logic around that or needs to be?

@kertal
Copy link
Member Author

kertal commented Nov 9, 2020

@LeeDr yes, investigation should continue, problem here is, we currently don't know that it fails, since there is no error, just the fields data stays the same. I've added more logs to investigate. I also think it's also fine to update test data, so fetching is not necessary, would resolve 4+ issues and stabilize our Jenkins pipeline, here's the latest flaky test: 300 tries

https://kibana-ci.elastic.co/job/kibana+flaky-test-suite-runner/957/

@kertal
Copy link
Member Author

kertal commented Nov 17, 2020

Closing this, it didn't fail recently, flaky runs were ok, but it's clear what should be improved, there should be a retry logic for fetching fields to prevent such flakiness, I've opened an issue for that

#83448

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants