-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HPC] Proposal: Allow throughput extrapolation to large system size #508
Comments
My comments
|
It is essential to set realistic expectations on the largest scale on a system that can be run while reporting extrapolated numbers. Else it may reflect a theoretical metric. |
Wouldn't this encourage people to just measure on a single node and neglect network entirely? Seems like it encourages 'system scale' cherry picking... |
The comments focus primarily on the technical correctness of the proposed rules. I would like to remind all of us that we set a goal to increase participation, competition, and popularity. Assuming this is the goal we care to optimize for, other aspects of the benchmark need to compromise. We are not excited about it either, but we strongly believe that prioritizing participation is worth the cost on benchmark quality in the long run. @TheKanter correct - this excludes network and IO impact from the measurement, which is a significant part of the score with today's rules. Ideally, this proposal should be combined with proposal #507 (remove data movement from score). Assuming #507 is approved, there's no reason to force submitters to actually do the "Throughput" runs because the score is predictable and accurate without those runs. Obviously, this is not ideal since it excludes the impact of certain system's components and it results in a theoretical peak performance measurement (similar to HPL). On the other hand, it significantly reduces the investment a potential submitter needs to put in, in order to have a submission (which was by far the loudest feedback we received). Given that the primary goal of these proposals is to increase participation, competition and popularity, proposals #507 and #508 can make a huge difference. FYI, I learned today that one of our partners, when asked if they will submit on MLPerf-HPC v3.0, said they can't do it and cited only budget constraints. #507 along with #508 reduce the cost of submission to its minimum. Some thoughts about the points raised by @sparticlesteve earlier:
|
Proposal depends on #507.
Introduction:
After collecting feedback from engineers, clients, and press, NVIDIA presented a list of proposals that aim to improve the popularity of the MLPerf HPC benchmark suite. Please see our slide deck for more information on our feedback gathering process and insights.
Proposal: Allow throughput extrapolation to large system size
Slide 15 in proposals slide deck.
Since the FS is no longer part of the score (as per proposal #507), there is no reason to continue running the “Throughput” (was: "weakly-scaled" - proposal #511 ) benchmark in the same way. Score extrapolation becomes sufficient.
Under this proposal, submitters only submit TTS (was: "strong scaling" - proposal #511 ). They can have multiple TTS submissions at different scales etc.
This proposal aims to improve the popularity of the MLPerf HPC benchmark suite by improving on the following aspects:
Note: We have previously supported the current rule because it is more technically robust – but now we think it was a mistake because we can only ask so much of HPC submitters and participation is an issue.
Discussion
Pros:
Cons:
The text was updated successfully, but these errors were encountered: