Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI runs tests across a disproportionate variety of Windows versions #30367

Closed
fearful-symmetry opened this issue Feb 11, 2022 · 11 comments
Closed
Labels
automation ci Team:Automation Label for the Observability productivity team

Comments

@fearful-symmetry
Copy link
Contributor

CC @cmacknz and I think @elastic/observablt-robots ? Not sure.

I was tracking down some flaky tests yesterday when I noticed that the extended stage of tests in metricbeat runs across a bizarrely wide array of windows releases:

~ grep -B 2 -A 1 "\- \"windows-" Jenkinsfile.yml
        mage: "mage build unitTest"
        platforms:             ## override default labels in this specific stage.
            - "windows-2016"
        stage: extended
--
        mage: "mage build unitTest"
        platforms:             ## override default labels in this specific stage.
            - "windows-2012-r2"
        stage: extended
--
        mage: "mage build unitTest"
        platforms:             ## override default labels in this specific stage.
            - "windows-10"
        stage: extended
--
        mage: "mage build unitTest"
        platforms:             ## override default labels in this specific stage.
            - "windows-8"
        stage: extended
--
        mage: "mage build unitTest"
        platforms:             ## override default labels in this specific stage.
            - "windows-2008-r2"
        stage: extended
--
        mage: "mage build unitTest"
        platforms:             ## override default labels in this specific stage.
            - "windows-7"
        stage: extended
--
        mage: "mage build unitTest"
        platforms:             ## override default labels in this specific stage.
            - "windows-7-32-bit"
        stage: extended

As far as I can tell, these tests are all running. Do we even support 32-bit windows 7? This seems pretty unnecessary to run as part of the normal CI workflow for pull requests, particularly considering that we only run metricbeat's PR CI across one version of Ubuntu and Darwin.

Can we somehow make these extended tests run nightly instead of on every PR? This probably isn't helping CI's overall runtime, not to mention resource usage.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 11, 2022
@fearful-symmetry fearful-symmetry added the Team:Automation Label for the Observability productivity team label Feb 11, 2022
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 11, 2022
@fearful-symmetry
Copy link
Contributor Author

Just kinda guessing on the teams here, apologies if I ping the wrong people.

@cmacknz
Copy link
Member

cmacknz commented Feb 11, 2022

There is a parallel ask from the robots team on what we actually need from the support matrix: https://github.com/elastic/observability-robots/issues/1038

There are two questions here:

  1. What is the complete list of operating systems we need to support? This is covered by the support matrix (owned by product I believe). We have to test on all these in some capacity.
  2. When should we test each of these operating system? This is where most of the discussion should be.

My personal opinion is that PRs should only test against the single most common or important version of each major operating system: Windows, Linux, and Mac. The full support matrix could be covered by some other scheduled job. This could be nightly, but it can be more frequent then that (every 2-4 hours for example so we get multiple builds during working hours for everyone). Others may have a different opinion, or there may be beats that need exhaustive OS testing that this doesn't work for.

@fearful-symmetry
Copy link
Contributor Author

My personal opinion is that PRs should only test against the single most common or important version of each major operating system: Windows, Linux, and Mac. The full support matrix could be covered by some other scheduled job.

Strong agree here, yah. It looks like windows is the one outlier here, oddly enough. Linux is just running on the 18.04 Ubuntu LTS, and darwin on...whatever version of Darwin we have.

@v1v
Copy link
Member

v1v commented Feb 14, 2022

Old windows versions are now ready to be removed -> #30373

Regarding what to test and when, there were some genuine reasons why those extended stages were added.

We tried different approaches:

  1. Fine granularity with GitHub labels/commands. But was not used much, so people tend to use /test for simplicity. And identifying what to test with GitHub labels was not used at all.

  2. Run everything in parallel, didn't scale well and consume a lot of resources.

  3. Run everything in sequential stages, this tend to be the best of the above scenarios.

In my experience, developers tend to think if their PR passes in the CI then it's completed. That was one of the main reasons to enforce mandatory stages to test different OS versions. Might not be ideal, but it reduces the chances of discovering breaking changes when merging into main which will definitely affect other's PRs. I've been one of those people chasing PR owners, quite a lot, regarding some breaking changes in the main branches related to their changes.

I'd say, this process is owned by the product team, you can decide what to test and when, but the above point is important to be considered as it might require some changes in managing build failures. We can help on the notifications and what data are important to be shown.

Thanks for raising this conversation :)

@ph
Copy link
Contributor

ph commented Feb 16, 2022

If we move the others windows into a scheduled job, we would notify both slack channels?

@v1v
Copy link
Member

v1v commented Feb 21, 2022

If we move the others windows into a scheduled job, we would notify both slack channels?

That's the current approach for the builds running on main, 8.x, but IIRC, @jlind23 was recently tidying up the existing unused slack channels, and the one with the build status was proposed to be deleted since it was not that useful.

@ph
Copy link
Contributor

ph commented Mar 2, 2022

Looking at our matrix we can remove window 7, it's unsupported since 7.14.
Windows 2016 and Windows 2022 can run on every PR.

The remaining Windows can weekly as a regression? WDYT @craig?

@cmacknz
Copy link
Member

cmacknz commented Mar 2, 2022

Running the oldest (2016) and the newest (2022) on every PR sounds reasonable.

I would rather see the entire set of Windows versions run as a nightly regression test to give us a chance to see and find flaky tests.

@ph
Copy link
Contributor

ph commented Mar 2, 2022

Okay to have it nightly/daily.

@cmacknz
Copy link
Member

cmacknz commented Mar 14, 2022

There is some work in progress on this now: #30781

@v1v
Copy link
Member

v1v commented Mar 15, 2022

I think we can close this now:

  • main and 8.1 are now validating on windows-2016 and windows-2022 by default. Other Windows versions only on merged commits. Therefore a slack notification is sent if a build failure.
  • 7.17 and 8.0 are using the previous behaviour.

Feel free to reopen this if needed

@v1v v1v closed this as completed Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation ci Team:Automation Label for the Observability productivity team
Projects
None yet
Development

No branches or pull requests

4 participants