-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ReFrame's CPU autodetect in test step #682
Use ReFrame's CPU autodetect in test step #682
Conversation
…ke sure it gets autodetected _for every job we run_
Instance
|
Instance
|
Instance
|
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
…o auto-detect once per partition. Then, the file is cached and available to be used in the next run!
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
…sing local spawner. But right now, it doesn't seem to detect anything
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
bot: build repo:eessi.io-2023.06-software arch:x86_64/generic |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
New job on instance
|
Ah, I didn't realize, the topology files won't be stored in the homedir of the
Although that means it needs to autodetect every time, that's actually a plus. We'll always have a topology file that is up-to-date, even if at some point we would change the node types or something. It also means that technically we can skip replacing the partition name, as the detected Anyway, I checked that
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
I've figured out the way we can use the CPU autodetection of ReFrame with the local spawner. We just inject the partition name for the current SLURM partition in which we are running into the ReFrame configuration file. This ensures that we get one topology file per SLURM partition that is autodetected. Note that the autodetection only needs to happen once for each architecture, and then it's there "forever" in the
.reframe
in the homedir of the bot.It's good to use the autodetection, as it guarantees all the CPU info we potentially rely on in the EESSI test suite is present. This is preferable over hard-coding it, and actually recommended according to our own documentation :D
To explain a bit: prior to this PR, we would create the ReFrame config file from a template. The template would contain placeholders like
__NUM_CPUS__
,__NUM_SOCKETS__
,__NUM_CPUS_PER_CORE
and__NUM_CPUS_PER_SOCKET__
. In thetest_suite.sh
we would detect this information from the output oflscpu
. The reason we did it this way is that we thought we couldn't rely on CPU autodetection by ReFrame: ReFrame stores this in$HOME/.reframe/topology/<system_name>-<partition_name>/processor.json
. Since the partition name here would always be the same (default
) since we use the local spawner, it would autodetect once, and never again. That's problematic, since actually our tests run on different partitions (the different types of build nodes:zen2
,zen3
,zen4
,haswell
, etc), with different node configurations.The downside of this approach is that automatically detecting processor config ensures that we have a predictable set of processor keywords defined in the ReFrame config file. E.g. in #585 it turned out that we were missing
num_cores_per_numa_node
. This is the reason our documentation recommends using CPU autodetection. When I hit that issue, I realized: we can actually use CPU autodetection if, instead of the processor information, we simply detect which SLURM partition we are running on - and use that in a template replacement. This means that when running on e.g. thex86-64-amd-zen2-node
partition,__RFM_PARTITION__
get's replaced byx86-64-amd-zen2-node
. ReFrame's CPU autodetect will then do CPU autodetection, and put the topology file in$HOME/.reframe/topology/BotBuildTests-x86-64-amd-zen2-node/topology.json
. Then, when it runs a next time on e.g.x86-64-intel-haswell-node
, it will again do CPU autodetection, and store it in$HOME/.reframe/topology/BotBuildTests-x86-64-intel-haswell-node/topology.json
.That's perfect: that's exactly how it would work with a non-local spawner. It also means CPU info only needs to be detected once. Next time the bot runs (on that partition), the info is already there and the CPU autodetection step can (and will) be skipped by ReFrame.