Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test load balancing - parallelize jest tests #58

Closed
CMCDragonkai opened this issue Jun 6, 2022 · 6 comments · Fixed by #65
Closed

Test load balancing - parallelize jest tests #58

CMCDragonkai opened this issue Jun 6, 2022 · 6 comments · Fixed by #65
Assignees
Labels
development Standard development

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jun 6, 2022

Specification

Since we are expanding our testing to include other platforms like macos and windows, this creates an issue for VERY large tests that we have in PK.

In PK we are spinning up a test job for each subdirectory under tests, this is not efficient under the 10 minute startup routines for windows and mac runners. This is also done via child pipelines, which relies on a script that generates the child pipeline jobs.

The only way to do this would be to do test load balancing instead. This is where we create N number of linux, mac, windows runners, and then distribute tests across the N runners. The N number is determined based a tradeoff between fastest pipeline completion time vs total minutes being used. That is we want highest possible pipeline completion with the lowest total minutes being used. The intersection between the 2 lines is the sweet spot. This would have to be figured out manually & iteratively. The test suite here is very small though, so this should be done on PK instead.

However upon using load balancing, we lose track of individual jobs tackling test directories. Right now we are able to see that only certain areas fail, while other areas succeed. To regain this ability, we have to use the test reporters (which is currently integrated already as junit). There are junit options that can be utilised to maximise the utility of these reports:

  • classNameTemplate: '{classname}' - sets the suite column on gitlab to the name of the top-level describe of the test file
  • titleTemplate: '{title}' - sets the name field on gitlab to the name of the test
  • addFileAttribute: 'true' - adds the filename to the junit reports
  • includeConsoleOutput: 'true' - includes console output in the report

Our test reporters now can tell us which test/tests are failing without us having to dig through the logs.

But jest is not capturing STDOUT or STDERR anymore.

The alternative is to change our logger to use the console.error instead, which is intercepted by jest and buffered up and should be shown to the test reporters. The includeConsoleOutout option only affects stdout though, so we need a way to work around this.

The buffering causes an issue during development though, it's nice to be able to see the logs immediately when we are debugging test issues, especially for the networking code where the ordering matters.

If we can get jest to unbuffer the logs, or change back to process.stderr when the --ci flag is not set, then we can see the logs unbuffered when running the tests. Right now the --ci flag is set when running the tests in the CI/CD.

Finally within our test runners, because each runner only 1 CPU, we can use --runInBand to keep it simple.

Additional context

Tasks

  1. Investigate and implement sharding
  2. Optimise junit reports to include as much useful info as possible on gitlab
@CMCDragonkai CMCDragonkai added the development Standard development label Jun 6, 2022
@CMCDragonkai
Copy link
Member Author

I believe we can reuse the child pipeline script generator, which would end up dynamically generating N number of parallel jobs.

It's also possible that child pipelines won't be necessary, and instead we can just use the parallel keyword too https://about.gitlab.com/blog/2021/01/20/using-run-parallel-jobs/.

However I quite like the child pipeline script, it should be useful for any kind of dynamic pipeline creation. Perhaps something to quickly calculate/estimate what the N should be.

@CMCDragonkai
Copy link
Member Author

Note that there's a commercial service for this called knapsack (https://hackernoon.com/how-to-run-parallel-rspec-specs-on-gitlab-ci-in-ruby-and-jestcypress-parallel-tests-a913346x). However I believe we don't need to use this. I believe there's a jest scheduler script that we can write inspired by Gitlab's prior work, that ultimately takes all the tests and splits them up.

Queuing mode would be more efficient, as that means, each job can read in the job when they are finished, that would probably definitely require a child pipeline, where a scheduler is handing out jobs like a queue, and each runner can atomically read one off.

@CMCDragonkai
Copy link
Member Author

Note that resource limits for windows and mac runners are different:

  1. MacOS has 4 CPUs - https://docs.gitlab.com/ee/ci/runners/saas/macos/environment.html
  2. Windows has 2 CPUs - https://docs.gitlab.com/ee/ci/runners/saas/windows_saas_runner.html

At this time there is only one available machine type offered, gbc-macos-large.

Instance type vCPUS Memory (GB)
gbc-macos-large 4 10

Windows runners on GitLab.com autoscale by launching virtual machines on the Google Cloud Platform. This solution uses an autoscaling driver developed by GitLab for the custom executor. Windows runners execute your CI/CD jobs on n1-standard-2 instances with 2 vCPUs and 7.5 GB RAM. You can find a full list of available Windows packages in the package documentation.

This means we don't need as many of these runners during load balancing. Also --runInBand shouldn't be used here.

@CMCDragonkai
Copy link
Member Author

Jest 28 supports sharding natively:

https://medium.com/@mfreundlich1/speed-up-your-jest-tests-with-shards-776e9f02f637

I believe the scheduling ultimately relies on file size as a proxy for the test time.

However I read in circle CI https://circleci.com/docs/2.0/parallelism-faster-jobs/, that they actually keep track of test times. If you had something like this, you could re-use timing data to better optimise how to split the tests later.

The best would be a queue system, but that doesn't yet exist for work-stealing.

Furthermore once you shard the tests, the coverage report being generated have to be put together again. We would use istanbul to merge the coverage reports back together: https://github.com/istanbuljs/nyc, see the medium post too: https://medium.com/@mfreundlich1/speed-up-your-jest-tests-with-shards-776e9f02f637

@CMCDragonkai
Copy link
Member Author

Actually our coverage reports are XML cobertura reports, so I'm not sure if istanbul is sufficient for that. Low priority though.

@CMCDragonkai
Copy link
Member Author

It might be actually quite easy, no need to merge, just put all the files into one location to push out as a report: https://gitlab.com/gitlab-org/gitlab/-/issues/328772#note_769205384

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development
Development

Successfully merging a pull request may close this issue.

2 participants