Investigate random flakiness of our Jenkins pipeline #81457

kertal · 2020-10-22T09:43:04Z

Recently our functional test pipeline showed a kind of flakiness that was not reproduceable with our flaky test runner. Some tests failed randomly, the only thing they had in common is the error message:

#78689

#39842

Note that this is an older screenshot because the error message was improved to give a bit more information
#80812

#82035

This is an odd error, why should the @timestamp suddenly be invalid? ... unless the underlying index pattern has changed? It could be the case that another test suite running in parallel is changing the index pattern, that would explain why in never fails when the flaky test runner is doing his work.

And there's a related post in our Discuss which indicates that the index pattern was modified without user interaction, excluding @timestamp from _source

https://discuss.elastic.co/t/what-is-excluded-column-in-kibana-index-pattern/252667/10

The user had the same error, because the index pattern excluded @timestamp after 5 minutes, without the user actively interacted:

The text was updated successfully, but these errors were encountered:

spalger · 2020-10-22T18:01:50Z

@kertal How is the validity of the field determined? Is there a chance for a race condition in the way that information is fetched, validated or refreshed? The flaky test runner and standard CI both run a bunch of things in parallel on the same machine, the only difference being the amount of parallelism on standard CI jobs is a lot higher, but we also use larger machines for those tests. We use separate ES instances for every test execution though so there shouldn't be any interaction between the separate parallel executions except the kind of interactions that might happen when ES is busy with other tasks, which might be why users have seen this behavior in production as well.

kertal · 2020-11-04T10:39:04Z

Took a look at the code, the validation of the field takes place in the data plugin, @timestamp no longer seems to be an available field currently selected index pattern, I think I'm adding the title of the index pattern to the error message, it would also be convenient get the "See the full error" msg, but I don't know if there's an easy way to do this in the test suite

majagrubic · 2020-11-04T12:34:59Z

It's curious that the tests use different index patterns, but the error seems to be the same in all of them.

One common denominator I noticed is that linked tests are all running as part of ciGroup6:
https://github.com/elastic/kibana/blob/0bae5d62c932c670b9da55575fbf5caaffbc88e5/test/functional/apps/discover/index.js

That group is also running "index pattern without timefield" test - so could be something gets messed up there?

kertal · 2020-11-04T13:08:49Z

@maja but this index pattern is called differently, and there are also logstash-* named index patterns with this failure
@spalger is there a way to add better debugging for this case, e.g. if a tests fails additional logs, screens are recorded in the afterEach hook?

kertal · 2020-11-04T13:50:46Z

@majagrubic "index pattern without time field" is good a good hint, however it's the last of all this tests so I can't think of a way how it should influence the former one, but it's good to rule out possible causes, I've started a PR to add the name of the index pattern to the error message, would love to add more info, but since this is user - facing, don't think we can add nice pretty json

#82604

kertal · 2020-11-09T15:26:30Z

I'm pretty sure where this error is coming from, it's not related to the test, but some of the tests require additional test data and custom index patterns. Lot's of our index pattern data we use in our test cases contain fields with legacy structure. What our index pattern service does in this case, is fetching fields using the fields API. But this doesn't work occasionally (had to run our OSS group6 discover test case 300 times for several times to reproduce). I've been adding additional logs to find out why the index pattern service or Elasticsearch causes troubles here. This is the latest failure:

https://kibana-ci.elastic.co/job/kibana+flaky-test-suite-runner/956/testReport/junit/Chrome%20UI%20Functional%20Tests/test_functional_apps_discover__field_visualize·ts/Kibana_Pipeline___agent_3___discover_app_discover_field_visualize_button_should_be_able_to_visualize_a_field_and_save_the_visualization/

the field refresh doesn't work , the fields fetched content still contains legacy fields like analyzed, that's why we get the error message, because aggregatable is not set in this case.

[00:06:40] │ debg browser[INFO] http://localhost:6121/37921/bundles/plugin/data/data.plugin.js 0:598131 "fields fetched" "{\"referer\":{\"name\":\"referer\",\"type\":\"string\",\"count\":0,\"scripted\":false,\"indexed\":true,\"analyzed\":false,\"doc_values\":true},\"agent\":{\"name\":\"agent\",\"type\":\"string\",\"count\":0,\"scripted\":false,\"indexed\":true,\"analyzed\":true,\"doc_values\":false},\"relatedContent.og:image:width\":

The good news: we should be able to fix this by updating our test data, refreshing the index pattern's fields property 🥳
The bad news: we need to find out why this is happening, so I've been constantly adding more console.logs in a test PR to find out: #82878 (cc: @mattkime), so up to another flaky test suite runner for 300 hoping it will happen again

LeeDr · 2020-11-09T15:42:20Z

My concern with updating test data is that we could be masking an upgrade problem. If it's an issue that the fetching fields using the fields API sometimes fails, I don't know if there's retry logic around that or needs to be?

kertal · 2020-11-09T16:42:04Z

@LeeDr yes, investigation should continue, problem here is, we currently don't know that it fails, since there is no error, just the fields data stays the same. I've added more logs to investigate. I also think it's also fine to update test data, so fetching is not necessary, would resolve 4+ issues and stabilize our Jenkins pipeline, here's the latest flaky test: 300 tries

https://kibana-ci.elastic.co/job/kibana+flaky-test-suite-runner/957/

kertal · 2020-11-17T09:00:14Z

Closing this, it didn't fail recently, flaky runs were ok, but it's clear what should be improved, there should be a retry logic for fetching fields to prevent such flakiness, I've opened an issue for that

#83448

kertal added test-failure-flaky :KibanaApp/query-related labels Oct 22, 2020

kertal assigned LeeDr and spalger Oct 22, 2020

kertal mentioned this issue Nov 4, 2020

[Search] Add used index pattern name to the search agg error field #82604

Merged

This was referenced Nov 10, 2020

[Discover] Unskip flaky tests based on discover fixture index pattern #82991

Merged

[Discover] Unskip date_nanos and shard links functional tests #82878

Merged

LeeDr mentioned this issue Nov 10, 2020

[test-failed]: Chrome UI Functional Tests1.test/functional/apps/timelion/_expression_typeahead·js - timelion app expression typeahead "before all" hook for "should display function suggestions filtered by function name" #80247

Closed

kertal closed this as completed Nov 17, 2020

elasticmachine unassigned LeeDr Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate random flakiness of our Jenkins pipeline #81457

Investigate random flakiness of our Jenkins pipeline #81457

kertal commented Oct 22, 2020 •

edited

Loading

spalger commented Oct 22, 2020

kertal commented Nov 4, 2020

majagrubic commented Nov 4, 2020

kertal commented Nov 4, 2020

kertal commented Nov 4, 2020

kertal commented Nov 9, 2020 •

edited

Loading

LeeDr commented Nov 9, 2020

kertal commented Nov 9, 2020 •

edited

Loading

kertal commented Nov 17, 2020

Investigate random flakiness of our Jenkins pipeline #81457

Investigate random flakiness of our Jenkins pipeline #81457

Comments

kertal commented Oct 22, 2020 • edited Loading

spalger commented Oct 22, 2020

kertal commented Nov 4, 2020

majagrubic commented Nov 4, 2020

kertal commented Nov 4, 2020

kertal commented Nov 4, 2020

kertal commented Nov 9, 2020 • edited Loading

LeeDr commented Nov 9, 2020

kertal commented Nov 9, 2020 • edited Loading

kertal commented Nov 17, 2020

kertal commented Oct 22, 2020 •

edited

Loading

kertal commented Nov 9, 2020 •

edited

Loading

kertal commented Nov 9, 2020 •

edited

Loading