Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tests performance #1027

Closed
MattiSG opened this issue Jul 27, 2021 · 9 comments
Closed

Improve tests performance #1027

MattiSG opened this issue Jul 27, 2021 · 9 comments
Assignees
Labels
kind:discovery Issue requires discovery: value, ux and tech

Comments

@MattiSG
Copy link
Member

MattiSG commented Jul 27, 2021

It would be useful for all country packages to improve the speed of tests, and to improve the speed of computations in general. Currently, running all tests in the most complex tax and benefits system (OpenFisca-France) takes 10 to 20 minutes in CI, and users report the same duration locally (#1025).

This issue aims at consolidating potential venues for performance improvements, associated costs, risks and expected benefits, in order to define the best implementation strategy for a team with limited resources that cannot tackle all at once.

Option Implementation Gain Risk
Tests parallelisation #1025 Only locally, depends on cores count Decrease in test output readability
Parameters caching openfisca/openfisca-france#1311 1 load instead of 4 in OF-FR, so about 5 seconds out of 12 minutes Cache mysteries
Test files caching openfisca/openfisca-france#1311 None, as they are loaded once in a test run Cache mysteries
Native YAML parser Already in place for most platforms with libyaml None in most cases Platform-specific incompatibilities
Install libyaml in CI Already in place

Details

Native YAML parser

The “Native YAML parser” option consists of assessing whether libyaml, currently suggested as an optional dependency as it speeds up parameter and tests loading by using a native compiled extension instead of the native Python YAML reader, could not be added as a systematic dependency. The last assessment dates back from a few years and revealed platform-specific incompatibilities. Maybe this has been fixed in upstream. There might be also be an alternative option that fits the same purpose with less incompatibilities.

Strategy & contributions

  • @HAEKADI will consolidate here results of exploring the “Native YAML parser” and “Install libyaml in CI” options.
  • All other suggestions to explore are welcome!
  • Benchmarks for current tests durations in different country packages are welcome!
  • Declarations of interest for benchmarking your own country package against potential options are welcome!
  • Analyses assessing which option(s) might be most impactful are welcome.
@bonjourmauko
Copy link
Member

Great initiative! 🙌

I share with you a couple of learnings in the hope it helps.

I think what's good about the options proposed is that they not mutually exclusive.

Tests parallelisation

Seems to me one of the the most impactful in terms of raw performance, and scales horizontally.

However, I'm still blocked in trying to make it work, as there's some black magic:

  • xdist launches independent processes to collect and run the tests, which are then invisible to the one that launched pytest in the first place, and
  • it replaces openfisca test tests collectors with its own.

To make it work, we need to find a way to tell xdist to not ignore OpenFiscaPlugin.

At this point I'm yet not sure xdist it is compatible with the way we use pytest (pytest.run).

There are other options like https://github.com/browsertron/pytest-parallel that could be worth the test.

Parameters caching

Parameters seem to be loaded just once per test suite.

If that's the case, it could humbly improve performance by a fixed amount, independent of the number of tests to be run, at load time (right after launching openfisca test and before collecting the tests).

If it's not, then it could be actually interesting.

Still, big questions for me would then be:

  • how to cache/expire —for example atomically vs the whole of the parameters each time
  • when to cache/expire —after a commit? when instantiating a tax benefit system?
Native YAML parser

Currently, if you have libyaml installed, it is being used for parameter loading and for test collection.

Performance gain is substantial, however, as with the caching, by an almost fixed amount, as those seem to be operations that are run just once per test suite.

However having a native strategy as a default could indeed simplify things.

Install libyaml in CI

It seems to already be the case, at least for core: we're using Docker's Python 3.7 container, which depends on buildpack-deps's, which installs libyaml.

All other suggestions to explore are welcome!

Out of my head some things we've at some point considered, that could need of course further exploring.

All of them are based on the hypothesis that the most expensive operation is calculate.

  • Optimise calculations
    • Reduce non-native operations (Numpy's optimised for +-*/, with object arrays is not much better that Python)
    • Extract all native and non-native operations to separate functions
    • Opmise native functions —some libraries help here like Numba, Parakreet, Numexpr...
  • Reduce calculations complexity:
    • Create lookup tables or do binary search for all expensive operations, when possible —extending/refactoring Holders
    • Reduce recursion —for example doing Split-Apply-Combine as explored here, here, and here
  • Use a Python interpreter faster than CPython, like PyPy
  • Use a Python compiler faster than CPython, like Cython

Hope it helps! 😃

@HAEKADI
Copy link
Contributor

HAEKADI commented Aug 3, 2021

Native YAML parser

  • I concur with @maukoquiroga, libyaml is already used for parsing in OpenFisca-Core.

  • Currently, if you have pyYAML installed, libYAML bindings are used (CLoader in our case).

  • No warning is generated locally nor on the CI leading me to believe that libYAML is loaded correctly.

  • I've tried running the tests with and without CLoader (for parameter loading and test collection) on the same code base (OpenFisca-France):

Without CLoader:

  1. 11m 23s
  2. 14m 02s
  3. 11m 33s

Average = 12m 19s

With CLoader:

  1. 12m 25s
  2. 12m 38s
  3. 10m 18s

Average = 11m 46s

The outcome is not what I expected, consequently I tested SafeLoader and CLoader on a simpleYAML test file:

from yaml import CLoader, SafeLoader
import yaml
import os
import time


def benchmark_strategy(loader):
    file_size = os.path.getsize("aah.yaml")
    iterations = 1000

    start_time = time.time()

    for rep in range(iterations):
        with open("aah.yaml", 'r') as stream:
            file_parsed = yaml.load(stream, Loader=loader)

    duration = time.time() - start_time

    print('--------------------------------------------------------------------------------------------')
    print('                                         ')
    print(f"******** YAML file size: {file_size / (10 ** 3)} Kb ********")
    print('                                         ')

    print('--------------------------------------------------------------------------------------------')
    print('                                         ')
    print(f"******** %s seconds using {loader}: ******** {duration} seconds ******** ")

    print('                                         ')
    print('--------------------------------------------------------------------------------------------')
    print('                                         ')

    flow = (file_size * iterations) / (duration * 10 ** 3)
    print(f"******** Flow using {loader}: ******** {flow} Kb/seconds ******** ")


benchmark_strategy(CLoader)
benchmark_strategy(SafeLoader)

I got the following :

--------------------------------------------------------------------------------------------

******** YAML file size: 18.002 kb ********

--------------------------------------------------------------------------------------------

******** %s seconds using <class 'yaml.cyaml.CLoader'>: ******** 8.838051080703735 seconds ********

--------------------------------------------------------------------------------------------

******** Flow using <class 'yaml.cyaml.CLoader'>: ******** 2036.8744008850624 kb/seconds ********
--------------------------------------------------------------------------------------------

******** YAML file size: 18.002 kb ********

--------------------------------------------------------------------------------------------

******** %s seconds using <class 'yaml.loader.SafeLoader'>: ******** 80.71783518791199 seconds ********

--------------------------------------------------------------------------------------------

******** Flow using <class 'yaml.loader.SafeLoader'>: ******** 223.0238206722362 kb/seconds ********

CLoader is almost ten times faster than SafeLoader. Obviously this test is far from being statistically significant, however CLoader did show superior performance each and every time.

CircleCi parallelism

  • Currently, with the free plan, we have 4 parallel runs available on CircleCI.

  • In OpenFisca-France, we currently use 2 out of 4 (it used to be 3 out of 4) runs for the build job to speed up the testing and the remaining 2 runs for the other builds.

  • I tested giving 4, 3 and then 2 containers to the build job.

With 4 containers :

  1. 6m 21s
  2. 6m 03s
  3. 6m 12s

Average : 6m 12s

With 3 containers :

  1. 8m 58s
  2. 8m 44s
  3. 8m 53s

Average : 8m 52s

With 2 containers :

  1. 13m 12s
  2. 9m 21s
  3. 9m 41s

Average : 10m 44s

Please note that these tests ran on almost perfect conditions with no other builds running simultaneously.

Obviously the more containers we use, the shorter the test time. However, the bandwidth can only be stretched so far and all the other builds will be waitlisted. A possible high cost low effort solution would be to upgrade the CircleCI plan adding more levels of parallelism

Tests caching:

Inspired by PR #1311.

It might be interesting, as a first step, to try caching the tests before caching the parameters, since the parameter usage is more complex than that of the tests.

To see if has any effect, I tested using Pickle and Marshal to cache the same YAML test file:

from yaml import CLoader, SafeLoader
import yaml
import time
import pickle
import os
import marshal


def benchmark_strategy(loader):
    file_size = os.path.getsize("aah.yaml")
    iterations = 1000

    modules = [pickle, marshal]

    for module in modules:

        with open("aah.yaml", 'r') as stream:
            file_parsed = yaml.load(stream, Loader=loader)

        module_name = 'pickle' if module == pickle else 'marshal'

        stream_file = f"aah.{module_name}"

        with open(stream_file, "wb+") as module_handle:
            module.dump(file_parsed, module_handle)  # protocol=pickle.HIGHEST_PROTOCOL

        start_time = time.time()

        for rep in range(iterations):
            with open(stream_file, "rb") as module_handle:
                file_parsed = module.load(module_handle)

        duration = time.time() - start_time

        print('--------------------------------------------------------------------------------------------')
        print('                                         ')
        print(f"******** YAML file size : {file_size / (10 ** 3)} Kb  ********")
        print('                                         ')

        print('--------------------------------------------------------------------------------------------')
        print('                                         ')
        print(f"******** %s seconds using {module_name} with {loader} : ******** {duration} seconds ******** ")

        print('                                         ')
        print('--------------------------------------------------------------------------------------------')
        print('                                         ')

        flow = (file_size * iterations) / (duration * 10 ** 6)
        print(f"******** Flow using {module_name} with {loader} : ******** {flow} Mb/seconds ******** ")
        print('                                         ')
        print('--------------------------------------------------------------------------------------------')
        print('                                         ')


benchmark_strategy(CLoader)
# benchmark_strategy(SafeLoader)

And I got:

--------------------------------------------------------------------------------------------

******** YAML file size : 18.002 Kb  ********

--------------------------------------------------------------------------------------------

******** %s seconds using pickle with <class 'yaml.cyaml.CLoader'> : ******** 0.19371914863586426 seconds ********

--------------------------------------------------------------------------------------------

******** Flow using pickle with <class 'yaml.cyaml.CLoader'> : ******** 92.92834563215293 Mb/seconds ********

--------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------

******** YAML file size : 18.002 Kb  ********

--------------------------------------------------------------------------------------------

******** %s seconds using marshal with <class 'yaml.cyaml.CLoader'> : ******** 0.8459770679473877 seconds ********

--------------------------------------------------------------------------------------------

******** Flow using marshal with <class 'yaml.cyaml.CLoader'> : ******** 21.279536623467393 Mb/seconds ********

--------------------------------------------------------------------------------------------

Pickle performs better than Marshal and they are both faster than reloading the YAML file every time.

Additional profiling is needed to determine how much testing time is actually spent loading the test files.

@MattiSG
Copy link
Member Author

MattiSG commented Aug 4, 2021

Thanks a lot @maukoquiroga for your insights and references, super useful! And thank you so much @HAEKADI, this is great data!

Conclusions

  1. Libyaml's CLoader is used systematically, so there is no performance gain to be expected.
  2. pyyaml's doc states that CLoader is “available only if you build LibYAML bindings”, so we have to keep the warning and alternative loading code. A way to outsource that would be to use pylibyaml, but our code is already written, let's not add an external dependency for something so small.
  3. We should assess the monetary cost of increasing parallelism, and potentially assess other CI providers.
  4. Caching is an efficient strategy, but will impact tests only if parameters and / or tests are loaded repeatedly in a test session. This has to be confirmed.
  5. Test parallelisation would only improve performance for local testing, not on CI. It would in any case need a major refactor of the way tests are executed, since parallelisation provided by third-party tools depend on standard ways to execute Python tests, which we currently don't; an alternative would be to write our own parallelisation code, but I believe this is a slippery slope.

Uncertainties left to explore

  • How many times parameters are loaded in a test run?
  • How many times test files are loaded in a test run?
  • How much would it cost to add 2 / 4 / 8 / 16 containers from CircleCI?
  • How long do the tests take when executed with GitHub Actions?
  • Explore simplifying the test runner code so that standard parallelisation libraries can be used.
  • What is the performance impact of using a Python interpreter faster than CPython?
  • What is the performance impact of using a Python compiler faster than CPython?

I set aside all the calculations optimisation to focus on Core-level actionable improvements, as I believe such solutions forward complexity onto contributors. It is however clear that at some point this will be the way forward, and providing guidelines for performance would be useful. This would need a strong investment in profiling.

Non-technical options

  • Paying for additional containers.
  • Create separate organisations for each country system. This way, each of these organisations could benefit from CircleCI's free plan for open-source and get 4 containers each. It seems however unlikely that this would impact massively performance, since overlap on active work between countries and Core is rare; what is most observed in CI queues is active teams from one country using resources in a continuous streak. This setup, however, would also simplify governance, as each country would be responsible for its privileges management and team maintenance.

@MattiSG
Copy link
Member Author

MattiSG commented Aug 4, 2021

@HAEKADI could you please have a look at these first 4 items? 🙂
Exploring the test runner code refactor, interpreter and compiler seem too big an investment at this stage, we'll reconvene after this first series of uncertainties are clarified 🙂

@HAEKADI
Copy link
Contributor

HAEKADI commented Aug 10, 2021

How many times parameters are loaded in a test run?

In an OpenFisca France test run, the parameters are loaded 4 times, averaging 1.15 seconds to load each time.
I used a logger to detect each time the parameters are loaded.

 1 WARNING:root:***************************************************Parameters loaded in 1.196788 seconds
 1 WARNING:root:***************************************************Parameters loaded in 1.3190979999999999 seconds
 1 WARNING:root:***************************************************Parameters loaded in 0.9405570000000054 seconds
 1 WARNING:root:***************************************************Parameters loaded in 1.139006000000002 seconds

How many times test files are loaded in a test run?

With openfisca test in both the Country-Template and OpenFisca-France, the test files are loaded once. However, the make test command in Country-Template executes the tests once, whereas the same command in OpenFisca-Franceruns the Python tests twice, and that is due to a second call to pytest (see issue).

How much would it cost to add 2 / 4 / 8 / 16 containers from CircleCI?

With our current free plan, we run 4 concurrent job runs. The next tier plan is the performance plan, which offers 80 concurrent job runs. With a monthly fee of 30$ (minimum), we get 50,000 credits. I have contacted the support team at CircleCI to ask for more details pertaining to our case. I am still awaiting their reply.

How long do the tests take when executed with GitHub Actions?

I've tried running a simple configuration on GitHub Actions to test OpenFisca-France, without any parallelisation. Caching the environment has also sped up the run time.

I only ran the build and test jobs, since the latter is the most problematic and time consuming. I did not add the linting jobs.

The resulting times:

  1. 15m 11s
  2. 10m 35s
  3. 10m 52s

Average: 12min 12s

GitHub Actions does not keep a history of successful runs, so I can only link the last run.

To compare, CircleCi ran the build job in 8m 50s.

With the free GitHub Actions plan, we get 20 concurrent jobs.

GitHub Actions does not offer a parallelism option like CircleCI. However, jobs run in parallel by default. Consequently, in order to optimise performance, we need many jobs running simultaneously.

Since yaml testing is the black sheep of the bunch, I used a strategy matrix to divide the yaml tests into 10 and then 20 jobs (20 being the limit offered by the free plan).

The resulting run times:

With 10 jobs:

  1. 3min 1s
  2. 3min 25s
  3. 2min 51s

Average: 3min 5s

Just to compare, on CircleCI the build run took 8min 15s.

With 20 jobs:

  1. 2min 42s
  2. 2min 35s
  3. 2min 43s

Average: 2min 40s

On CircleCI the build run took 7min 49s.

The downside of this split is that the yaml tests will not be split based on test duration, which means that some instances run longer than the others. CircleCI does store historical data of each tests duration and splits the test files based on running duration.

From this initial benchmark, the results are promising. It does seem that GitHub Actions runs faster than CircleCI. And so there is a real decision to be made as to which one should be chosen.

GitHub Actions hasn't been around as long as CircleCI, consequently the level of documentation and community size is not the same.

Obviously, there is a lot to be said for having the code and the CI on the same platform.

However we should keep in mind that with the free plan we are limited to:

  • 500 MB of storage
  • 2000 min per month
  • I don't know if it could potentially be an issue, but the log files are only kept up to 90 days, and the history of successful runs is not kept, only the last run, contrarily to CircleCI.

@MattiSG
Copy link
Member Author

MattiSG commented Aug 11, 2021

Thank you very much @HAEKADI! I updated the table at the beginning of this discussion.

Caching parameters could thus save us 4 to 5 seconds out of a 10 minutes run in a large codebase such as OpenFisca France.

If I understand correctly, parallelisation seems the most relevant option. GitHub Actions would enable more parallelisation for free, but with a loss of history after 90 days, and a maximum of 2000 free minutes per month. It is also not extendable beyond 20 workers, even if paying. On the other hand, CircleCI has the advantages of years of accumulated experience and practically no limits to expansion as long as we pay accordingly. Both services seem comparable in terms of raw performance.

In order to present the arbitrage properly, we need to know:

  • How much would CircleCI cost for 16 containers.
  • How many minutes tests run for on a typical month.
  • If we store any data, and if so roughly how much.

@HAEKADI can please you collect this information? 🙂

@bonjourmauko bonjourmauko added the kind:discovery Issue requires discovery: value, ux and tech label Aug 11, 2021
@HAEKADI
Copy link
Contributor

HAEKADI commented Aug 19, 2021

    The following table depicts the main differences, that I could gather, between CircleCI and GitHub Actions:
    CircleCI GitHub Actions
    Free Plan Parallelism 1 for free plan

    4 for open source

    20
    Storage No available data. 500 MB

    No limit for public repos

    Computation 2500 credits / week

    (about 10 credits per minute so 250 minutes/week for free plan)

    For open source:

    400 000 credits/ month

    2,000 / month

    => 500 minutes/week

    No limit for public repos

    We are consequently not limited by the minute usage.

    Persisting data

    Artifacts are stored for up to 30 days.

    Workspace for 15 days

    Caches created via the save_cache step are stored for up to 15 days

    See here.

    Artifacts and logs are stored for up to 90 days
    History runs Not clear.

    But there appears to be no limit.

    Not clear on the docs.

    I asked the question on GitHub community.

    The workflow run and status are kept but not the details.

    Example.

    Current average time for a test run With 4 runners:

    8min 15s

    With 10 runners:

    3min 5s

    Paid Plan Name Performance Team
    Price 30$/month

    (I still have not received an answer from the CircleCI support team regarding additional details for a custom plan.)

    4$ per user
    Parallelism Up to 80 concurrency 60
    Computation Starts at 50,000 credits for $30 3,000 minutes/month

    No limit for public repos

    Current

    usage

    Computation 5,669 minutes / month

    57,530 credits/ month in the last 3 months

    0 (as seen with @sandcha)

    Most likely because we have unlimited usage

    Storage For July 2021 : 220 GB (mostly France)

    See the screenshot below.

    0
    • The current free CircleCI plan details are not updated to the open source plan on the plan overview page, and consequently do not show the actual included credits and runners.

    • July 2021 data usage on CircleCI:

    Screenshot 2021-08-19 at 15 36 30

    • Comparing the two free plans, GitHub Actions seems to offer more if we were willing to overlook losing the logs after 90 days.

    @MattiSG
    Copy link
    Member Author

    MattiSG commented Aug 19, 2021

    Thank you so much for this detailed, sourced comparison @HAEKADI! This is excellent work and the best decision basis we could wish for 👏

    Based on it, the only downside I see with GitHub Actions is that we lose logs after 3 months. We regularly reference build runs in issues or pull requests, and some decisions are taken based on them, most of all regarding performance.

    I believe it can be acceptable to have only worflow-level metadata (which is kept indefinitely), and to learn to copy-paste in the issue body the subset of the logs when we make a decision based on them. This would allow us to increase parallelism (and thus test speeds) five fold. This would also increase vendor lock-in with GitHub, but this seems acceptable to me considering we can always revert to some other provider if we ever decide to migrate away from GitHub, and that this is not currently planned.

    I'll open an RFC for switching from CircleCI to GitHub Actions.

    Meanwhile, do you feel like investigating the two remaining elements:

    • What is the performance impact of using a Python interpreter faster than CPython?
    • What is the performance impact of using a Python compiler faster than CPython?

    If this is not obvious and you need more context, I unfortunately don't have enough technical knowledge to provide much more, so we can also wait for @maukoquiroga to handle this at some point in the future 🙂

    @MattiSG
    Copy link
    Member Author

    MattiSG commented May 5, 2022

    The switch to GitHub Actions seems to have alleviated the concern with tests speed on CI. Further possible improvements, i.e. changing Python compiler and interpreter, seem to have way stronger technical consequences. While they might be considered in the future in case they are deemed necessary, the issue of tests performance seems to have been enough under control that no community member complained about tests speed since implementation of #1030. I will thus close this issue 🙂 Thanks to all the people who made this dramatic improvement a reality 👏

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    kind:discovery Issue requires discovery: value, ux and tech
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants