-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A/B tests with package sync + repeats #355
Conversation
A/B tests evidence: https://github.com/coiled/coiled-runtime/actions/runs/3091876943 |
This is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these changes look reasonable to me, though I haven't gone through in detail. My main question is whether we should just further simplify the tests workflow matrix rather than pushing these lumpy include
blocks around.
@@ -22,19 +22,65 @@ defaults: | |||
shell: bash -l {0} | |||
|
|||
jobs: | |||
runtime: | |||
name: Runtime - ${{ matrix.os }}, Python ${{ matrix.python-version }}, Runtime ${{ matrix.runtime-version }} | |||
tests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Over in #279 I proposed doing something similar to this, but without a new category
matrix item. What would you think about just running the tests as a single job, and letting xdist do the rest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it's reasonable, and it would save some time. Some non-trivial engineering is needed - mind if I do it in a successive PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I don't think anything here makes that refactor harder.
python-version: ["3.9"] | ||
runtime-version: ["upstream", "latest", "0.0.4", "0.1.0"] | ||
category: [runtime, benchmarks, stability] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe we just don't do this and all of the extra include
logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if you run all tests together, you'll still need a wordy include
paragraph.
Instead of this:
matrix:
os: [ubuntu-latest]
python-version: ["3.9"]
category: [runtime, benchmarks, stability]
runtime-version: [upstream, latest, "0.0.4", "0.1.0"]
include:
# Run stability tests on Python 3.8
- category: stability
python-version: "3.8"
runtime-version: upstream
os: ubuntu-latest
...
it will look like this:
matrix:
os: [ubuntu-latest]
python-version: ["3.9"]
pytest_args: [tests]
runtime-version: [upstream, latest, "0.0.4", "0.1.0"]
include:
# Run stability tests on Python 3.8
- pytest_args: tests/stability
python-version: "3.8"
runtime-version: upstream
os: ubuntu-latest
...
In the next PR, I want to merge ab_tests.yaml into tests.yaml. In that PR I'll dynamically generate the whole matrix with discover_ab_environments.py
(to be renamed); the matrix for non-A/B tests will be generated from parameters in ci/config.yaml
(now AB_environments/config.yaml
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was proposing just running the whole test suite on every matrix value. Perhaps overkill, however.
In the next PR, I want to merge ab_tests.yaml into tests.yaml. In that PR I'll dynamically generate the whole matrix with
discover_ab_environments.py
(to be renamed); the matrix for non-A/B tests will be generated from parameters inci/config.yaml
(nowAB_environments/config.yaml
)
I'm a little concerned about the complexity of generating bespoke test matrices, and what it will mean for local testing. I was hoping to make the test matrix way simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gone through in great detail, but this looks good to me from a high level
repeat: N
setting, which causes every A/B test runtime to be rerun N timestest_null_hypothesis: true
setting, which creates a verbatim clone of AB_baselineOut of scope: