Automating Flaky Test Detection #3575

siddhantdange · 2023-10-24T17:50:33Z

Flaky tests unnecessarily waste developer and CI time, impact product stability, and reduce confidence in tests. It seems that the https://github.com/nodejs/node repository suffers from flaky tests, as shown by the context below.

Context:

BuildPulse is a platform for detecting flaky tests, reporting on impact, mitigating against that impact, and ultimately fixing flaky tests. We rely on open source ourselves, so in an effort to give back, we offer BuildPulse to open-source projects for free (up to a certain volume on test results per day).

I'd like to propose using BuildPulse to replace the current flaky test workflow, which seems to be through https://github.com/nodejs/reliability. We can replace this repository by making it easy to search for known flaky tests and build context (links to test flakiness occurrences, error logs, commits, etc).

We can also enhance your current workflow through ncu-ci integrations, GitHub Issues integrations, pull request comments, reports, notifications, badges, and more.

I'm willing to do the integration myself, which largely consists of 2 steps:

Configuring the test framework to export test results to JUnit XML
Upload result to BuildPulse using our buildpulse-action (for GitHub Actions) or test-reporter (Jenkins)

If the integration is helpful, BuildPulse would appreciate a testimonial down the road :)

Happy to answer any questions and interested in what next steps might look like!

The text was updated successfully, but these errors were encountered:

benjamingr · 2023-10-24T18:04:04Z

We track flaky tests in https://github.com/nodejs/reliability/

tniessen · 2023-10-24T21:55:28Z

FWIW, I believe that detecting flaky tests hasn't been a big issue. "setting them as flaky to make our CI green again", as the referenced tweet put it, is the easy part. Finding people who are willing to fix them, on the other hand, has been rather difficult.

Also, this was merged only recently: #3056

anonrig · 2023-10-25T00:03:34Z

We track flaky tests in https://github.com/nodejs/reliability/

IMHO, github issues are not a good way of tracking issues that persist in different days. For example, I'd be super interested in learning how long does it take a flaky test to appear or fixed when introduced into the codebase. I don't know if the proposed service supports it or not, but I believe there is no harm in trying a new solution. Maybe a new product might create the dopamine effect we need to promote fixing the issues.

mcollina · 2023-10-25T06:03:45Z

I would not replace the current infrastructure with a 3rd party system straight away, but I'm not opposed to add it in a non-destructive way, evaluate it, and see how it reacts.

I'm not sure having that data available would give us anything more useful than our current system, but I'd be happy to be proven wrong.

Something important to consider is that sending data to the 3rd party must be disabled during security releases while CI is locked down.

joyeecheung · 2023-10-26T15:57:21Z

Configuring the test framework to export test results to JUnit XML

I think for Jenkins this does not need additional work, at the end of every test run we already run tap2junit -i test.tap -o out/junit/test.xml.

Upload result to BuildPulse using our buildpulse-action (for GitHub Actions) or test-reporter (Jenkins)

I think that can be done in parallel to reliability report generation, which is done as a daily job scrawling data from Jenkins. I agree with @tniessen that the difficult part is actually fixing the flakes. I think we have more than enough open issues about flaky tests that contain whatever data a service like BuildPulse can collect, but from what I've observed that doesn't help much in terms of actually getting them fixed, some of the flaky test issues in the core repo are years old and still reproduce from time to time, no matter they are marked flaky or not, and we never know if it's caused by a real bug or if it's caused by platform issues, because we just don't have enough volunteers to actually look into them.

I think when marking tests as flaky, we need to actually spend some brain energy into figuring out why it is flaky and why an immediate solution is difficult to come up with & the impact is small enough so we have to put on this band-aid for now. We also need to check that it is not actually a regression (e.g. nodejs/node#49852) (IMO the best solution to potential regression is revert ASAP and only reland when the flake is addressed). That part probably can't be accomplished by automation.

joyeecheung · 2023-10-27T05:58:32Z

Also I think what would be really useful is having automation to bisect regressions for newly appeared flakes (ideally on affected platforms only) to find the commit that introduce a flake. (Maybe old ones too, they can be bisected just fine). Even better if this can associate flakes with infra updates somehow in case it is caused by infra updates. This would be very helpful in addressing flakes like nodejs/node#49852 (for that it took some human intelligence to find the suspicious commit even though we did not verify it via bisect and just got a lucky guess). I worked with a similar automation in V8 before and I found it amazing. I am not sure if BuildPause supports that though. If it can only collect information based on past test failures but cannot do actual bisects, that’s still not much of an improvement compared to what we already have.

siddhantdange · 2023-10-27T18:34:43Z

Hi everyone - read through the comments so far, and would like to address some concerns.

I agree that detection on it's own is not sufficient in solving the underlying problem - your GitHub issues track a subset of known flaky tests already. So what additional data can we provide?

We can show (among others, these just directly address identified concerns):

disruptiveness of flaky tests
time consumed by flaky test
when the flaky test was first and last observed (with links to commits + builds)
error messages when the test flaked
charts & reporting around history of these metrics to track trends

These metrics, alerting, and PR comments make sure flaky tests are tracked, prioritized correctly, and fixed.

BuildPulse determines test flakiness if we see a passing and failing result for that test, but no code has changed. We determine if the code has changed by comparing git tree SHAs. Because of this continuous monitoring, we can identify flaky tests as they come up.

mhdawson · 2023-11-21T15:35:57Z

I would not replace the current infrastructure with a 3rd party system straight away, but I'm not opposed to add it in a non-destructive way, evaluate it, and see how it reacts.

+1 from me on that approach.

I think the main question is if we have somebody from build or who has jenkins access who will volunteer to help with the effort as my guess is that we'll need update build jobs to send the data.

joyeecheung · 2023-11-21T17:50:43Z

So far I think no one objects to try it out in addition to what we already have. So we just need volunteers to integrate it with Jenkins and/or GitHub actions and see how it goes.

shnooshnoo · 2024-01-14T11:03:45Z

hey all, i'd like to improve Windows builds of Node CI and fix flaky tests where possible, and really curious how BuildPulse can help here. So was just wondering what's the current status of this?

shnooshnoo · 2024-01-18T11:09:03Z

@siddhantdange what would we need to have some kind of MVP of this? Anything I can help you with here?

siddhantdange · 2024-01-23T18:43:07Z

@shnooshnoo 3 steps!

Sign up to generate keys
Configure the tests to export a Junit XML file
Use our github action or binary to send results to BuildPulse using the keys from Step 1

Additionally, our onboarding docs can be found here.

Happy to help here as well, if you can point me to the correct workflow!

StefanStojanovic · 2024-02-04T10:20:55Z

FYI @nodejs/build I will meet with @siddhantdange on Tuesday, February 6 at 16:00 CET. Anyone who wants is welcome to join. I'll post a meeting summary here afterward.

StefanStojanovic · 2024-02-07T09:51:51Z

A quick summary after the meeting:

@siddhantdange did a quick demo for me, showcasing the tools. It looks good and it could potentially be useful to us.
Since I've already tried the tool, we went over some questions I had and cleared them up (eg. how to keep additional data with results like platform, machine, etc.).
Since for my PoC, I'm using my local fork, I'm missing some things such as the settings panel which is only available for organization-owned repos. After I set up everything, we can easily change to nodejs/node and see additional settings
The plan, for now, is to set up part of Windows machines to send results to BuildPulse and let it run for some time so we can see what the dashboards will look like. This will require changing node-test-binary-windows-js-suites and node-test-binary-windows-native-suites but only to add additional code to check if the machine is configured to send results and send them if it is. After making a PoC we will reevaluate. The way I see it, if this moves forward, we'd probably want to upload results from jenkins-workspace machines as that's where all of the results accumulate in the end and we'd not have to change all of the sub-jobs generating the junit.xml files.

This commit adds initial BuildPulse support by optionally uploading test results there. If machine is not configured to connect to BuildPulse, or if any condition is not met, the data will not be sent. If the upload fails, the test job will just continue. For now, this is a PoC and only Windows machines can upload results. If we decide to move forward with it, other platforms will be added. Refs: nodejs#3575

StefanStojanovic · 2024-03-20T08:34:29Z

@nodejs/build a PR enabling BuildPulse integration on part of Windows machines in PRs and daily builds is open here.

This commit adds initial BuildPulse support by optionally uploading test results there. If machine is not configured to connect to BuildPulse, or if any condition is not met, the data will not be sent. If the upload fails, the test job will just continue. For now, this is a PoC and only Windows machines can upload results. If we decide to move forward with it, other platforms will be added. Refs: #3575

anonrig added the tsc-agenda label Oct 24, 2023

nodejs deleted a comment from jessbennett Oct 26, 2023

This was referenced Nov 6, 2023

Node.js Technical Steering Committee (TSC) Meeting 2023-11-08 nodejs/TSC#1468

Closed

Node.js Technical Steering Committee (TSC) Meeting 2023-11-15 nodejs/TSC#1471

Closed

mhdawson mentioned this issue Nov 20, 2023

Node.js Technical Steering Committee (TSC) Meeting 2023-11-22 nodejs/TSC#1473

Closed

mhdawson mentioned this issue Nov 27, 2023

Node.js Technical Steering Committee (TSC) Meeting 2023-11-29 nodejs/TSC#1476

Closed

mcollina removed the tsc-agenda label Nov 29, 2023

mcollina transferred this issue from nodejs/TSC Nov 29, 2023

StefanStojanovic mentioned this issue Mar 18, 2024

jenkins,win: add buildpulse support #3653

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automating Flaky Test Detection #3575

Automating Flaky Test Detection #3575

siddhantdange commented Oct 24, 2023 •

edited

Loading

benjamingr commented Oct 24, 2023

tniessen commented Oct 24, 2023

anonrig commented Oct 25, 2023

mcollina commented Oct 25, 2023

joyeecheung commented Oct 26, 2023 •

edited

Loading

joyeecheung commented Oct 27, 2023 •

edited

Loading

siddhantdange commented Oct 27, 2023 •

edited

Loading

mhdawson commented Nov 21, 2023

joyeecheung commented Nov 21, 2023

shnooshnoo commented Jan 14, 2024

shnooshnoo commented Jan 18, 2024

siddhantdange commented Jan 23, 2024 •

edited

Loading

StefanStojanovic commented Feb 4, 2024

StefanStojanovic commented Feb 7, 2024

StefanStojanovic commented Mar 20, 2024

Automating Flaky Test Detection #3575

Automating Flaky Test Detection #3575

Comments

siddhantdange commented Oct 24, 2023 • edited Loading

Context:

benjamingr commented Oct 24, 2023

tniessen commented Oct 24, 2023

anonrig commented Oct 25, 2023

mcollina commented Oct 25, 2023

joyeecheung commented Oct 26, 2023 • edited Loading

joyeecheung commented Oct 27, 2023 • edited Loading

siddhantdange commented Oct 27, 2023 • edited Loading

mhdawson commented Nov 21, 2023

joyeecheung commented Nov 21, 2023

shnooshnoo commented Jan 14, 2024

shnooshnoo commented Jan 18, 2024

siddhantdange commented Jan 23, 2024 • edited Loading

StefanStojanovic commented Feb 4, 2024

StefanStojanovic commented Feb 7, 2024

StefanStojanovic commented Mar 20, 2024

siddhantdange commented Oct 24, 2023 •

edited

Loading

joyeecheung commented Oct 26, 2023 •

edited

Loading

joyeecheung commented Oct 27, 2023 •

edited

Loading

siddhantdange commented Oct 27, 2023 •

edited

Loading

siddhantdange commented Jan 23, 2024 •

edited

Loading