Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] Remove hardcoded test shards #10743

Merged
merged 1 commit into from
Apr 8, 2022
Merged

Conversation

driazati
Copy link
Member

@driazati driazati commented Mar 23, 2022

This moves the sharding logic from being inlined in the Jenkinsfile to templated, so we can change just the number of shards and the test allocation in conftest.py and the Jenkinsfile will work to match. This also changes the test allocation from a manual balancing before to be random between shards. Each shard needs to know only its shard number and the total number of shards, then it hashes each test and skips it unless that hash falls within its allocated tests. This breaks up related tests across shards but has the downside that any change to the number of shards will shuffle around where the tests end up (but ideally this is rare as we settle on a good number of shards to use). Some tests are also manually allocated via round-robin to different shards to ensure that long-running tests run on different shards as much as possible.

This only does this for the GPU frontend tests but eventually we could expand it to more.

cc @areusch

@driazati driazati force-pushed the generate_shard branch 5 times, most recently from 62aed5c to 9ee0878 Compare March 23, 2022 21:06
@driazati driazati changed the title generate shard [ci] Remove hardcoded test shards Mar 23, 2022
@driazati driazati force-pushed the generate_shard branch 16 times, most recently from 1676f79 to 05cf7cc Compare March 25, 2022 22:59
@driazati driazati marked this pull request as ready for review March 25, 2022 23:00
@github-actions github-actions bot requested a review from areusch March 25, 2022 23:03
@driazati driazati force-pushed the generate_shard branch 2 times, most recently from ea246e1 to e8dcd19 Compare March 29, 2022 20:25
@driazati driazati marked this pull request as draft March 29, 2022 20:25
@driazati driazati force-pushed the generate_shard branch 3 times, most recently from e44ff6e to a26b0ce Compare March 29, 2022 20:46
@driazati driazati force-pushed the generate_shard branch 4 times, most recently from ebc594c to 920b6d7 Compare March 31, 2022 00:21
Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @driazati this looks pretty good! just a couple small suggestions, could also defer if you feel strongly against them.


{% macro sharded_test_step(name, num_shards, node, ws) %}
{% for shard_index in range(1, num_shards + 1) %}
'{{ name }} {{ shard_index }} of {{ num_shards }}': {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: want to 0-pad shard_index and num_shards here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meh, this is intended for humans and everything is text-align: center-ed anyways so 0-padding won't make it easier to read IMO (also we are at like 3 shards max not 10 so we can revisit later)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sg. the main intent was that Jenkins sorts these alphabetically iiuc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, we should still be ok on that front but would probably need to pad if we shard more than n=9

jenkins/macros.j2 Outdated Show resolved Hide resolved
jenkins/macros.j2 Outdated Show resolved Hide resolved
@driazati driazati force-pushed the generate_shard branch 3 times, most recently from 49ba711 to a69f079 Compare April 6, 2022 18:34
@driazati driazati marked this pull request as ready for review April 6, 2022 18:40
@driazati driazati requested a review from areusch April 6, 2022 18:40
driazati added a commit to driazati/tvm that referenced this pull request Apr 6, 2022
@driazati driazati force-pushed the generate_shard branch 2 times, most recently from 4b2b890 to 7b79899 Compare April 6, 2022 21:32
driazati added a commit to driazati/tvm that referenced this pull request Apr 6, 2022
@driazati driazati force-pushed the generate_shard branch 3 times, most recently from 4ef21fa to 4c84028 Compare April 7, 2022 19:33
@driazati driazati marked this pull request as draft April 7, 2022 19:33
@driazati driazati force-pushed the generate_shard branch 2 times, most recently from aa568d0 to 040be72 Compare April 7, 2022 22:06
This moves the sharding logic from being inlined in the Jenkinsfile to templated, so we can change just the number of shards and the test allocation in `conftest.py` and the Jenkinsfile will work to match. This also changes the test allocation from a manual balancing before to be random between shards. Each shard needs to know only its shard number and the total number of shards, then it hashes each test and skips it unless that hash falls within its allocated tests. This breaks up related tests across shards but has the downside that any change to the number of shards will shuffle around where the tests end up (but ideally this is rare as we settle on a good number of shards to use).

This only does this for the GPU frontend tests but eventually we could expand it to more.
@driazati driazati marked this pull request as ready for review April 8, 2022 00:58
@driazati
Copy link
Member Author

driazati commented Apr 8, 2022

I also verified the tests ran in the various shards vs the tests from a recent main run, the overall result of tests ran is the same so the sharding is working correctly. Code is here https://gist.github.com/f636700cd68b5717350c107a0eaaee4e for the curious.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @driazati , couple small comments here and there but could defer them as they're less important than reducing CI time


def pytest_collection_modifyitems(config, items):
if not all(k in os.environ for k in ["CI", "TVM_NUM_SHARDS", "TVM_SHARD_INDEX"]):
# Only apportion tests if in CI and in a job that is set up for it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could log here if CI is present

"tests/python/topi/python/test_topi_conv2d_winograd.py::test_conv2d_nchw",
"tests/python/relay/test_py_converter.py::test_global_recursion",
]
HARDCODED_ALLOCATIONS = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could do with dict comprehension: HARDCODED_ALLOCATIONS = {i: v for i, v in enumerate(_slowest_tests)}

@areusch areusch merged commit 0c17f07 into apache:main Apr 8, 2022
mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Apr 11, 2022
This moves the sharding logic from being inlined in the Jenkinsfile to templated, so we can change just the number of shards and the test allocation in `conftest.py` and the Jenkinsfile will work to match. This also changes the test allocation from a manual balancing before to be random between shards. Each shard needs to know only its shard number and the total number of shards, then it hashes each test and skips it unless that hash falls within its allocated tests. This breaks up related tests across shards but has the downside that any change to the number of shards will shuffle around where the tests end up (but ideally this is rare as we settle on a good number of shards to use).

This only does this for the GPU frontend tests but eventually we could expand it to more.

Co-authored-by: driazati <driazati@users.noreply.github.com>
Lucien0 pushed a commit to Lucien0/tvm that referenced this pull request Apr 19, 2022
This moves the sharding logic from being inlined in the Jenkinsfile to templated, so we can change just the number of shards and the test allocation in `conftest.py` and the Jenkinsfile will work to match. This also changes the test allocation from a manual balancing before to be random between shards. Each shard needs to know only its shard number and the total number of shards, then it hashes each test and skips it unless that hash falls within its allocated tests. This breaks up related tests across shards but has the downside that any change to the number of shards will shuffle around where the tests end up (but ideally this is rare as we settle on a good number of shards to use).

This only does this for the GPU frontend tests but eventually we could expand it to more.

Co-authored-by: driazati <driazati@users.noreply.github.com>
altanh pushed a commit to altanh/tvm that referenced this pull request Apr 28, 2022
This moves the sharding logic from being inlined in the Jenkinsfile to templated, so we can change just the number of shards and the test allocation in `conftest.py` and the Jenkinsfile will work to match. This also changes the test allocation from a manual balancing before to be random between shards. Each shard needs to know only its shard number and the total number of shards, then it hashes each test and skips it unless that hash falls within its allocated tests. This breaks up related tests across shards but has the downside that any change to the number of shards will shuffle around where the tests end up (but ideally this is rare as we settle on a good number of shards to use).

This only does this for the GPU frontend tests but eventually we could expand it to more.

Co-authored-by: driazati <driazati@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants