Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ARM Neoverse N1 platform #381

Merged
merged 11 commits into from
Jun 5, 2023
Merged

Conversation

amarathe84
Copy link
Collaborator

@amarathe84 amarathe84 commented Feb 7, 2023

Description

This WIP PR introduces code contributions to support the Ampere Neoverse N1 ARM SoC platform. The code changes will introduce ARM CPU version check in order to bind the appropriate low-level functionality to the higher-level API. This WIP PR will implement the low-level functions in Variorum to expose the power features supported by the Neoverse N1 platform. Code refactoring will be done to split the code base for ARM based on the specific ARM platform.

Fixes #378, #379

Task checklist

  • Code contributions to support Ampere Neoverse N1 telemetry
  • Integration/regression testing
    • ARM Neoverse N1 tests
    • ARM Juno r2 telemetry APIs
    • ARM Juno r2 control API
  • Documentation update
  • Build/CI update

Testing

Unit and component testing will be done using Variorum example programs on the following systems

  • Neoverse N1 tests on NVHPC1 in powerlab. Examples testing telemetry functions will be tested.
  • Arm Juno r2 tests on Juno in powerlab. Examples previously tested for the telemetry and capping APIs will be tested.

Nvhpc1 system integration tests (Ampere Neoverse N1)

Print power usage

$ examples/variorum-print-power-example
_ARM_POWER Host CPU_mW I/O_mW
_ARM_POWER nvhpc1 9583.00 32763.00
_ARM_POWER nvhpc1 9583.00 32763.00


$ examples/variorum-print-verbose-power-example
_ARM_POWER Host: nvhpc1, CPU: 9988.00 mW, I/O: 32623.00 mW
_ARM_POWER Host: nvhpc1, CPU: 9988.00 mW, I/O: 32623.00 mW

Print CPU temperature

$ examples/variorum-print-thermals-example
_ARM_TEMPERATURE Host LOC1 SoC
_ARM_TEMPERATURE nvhpc1 38.00 29.00

$ examples/variorum-print-verbose-thermals-example
_ARM_TEMPERATURE Host: nvhpc1,LOC1: 38.00 C, SoC: 29.00 C

Print CPU frequency

$ examples/variorum-print-frequency-example
_ARM_CLOCKS Host Socket Clock_MHz
_ARM_CLOCKS nvhpc1 0 2777

$ examples/variorum-print-verbose-frequency-example
_ARM_CLOCKS Host: nvhpc1, Socket: 0, Clock: 2777 MHz
_ARM_CLOCKS Host: nvhpc1, Socket: 0, Clock: 2777 MHz

Cap CPU frequency

$ examples/variorum-cap-socket-frequency-limit-example -f 1000
Capping CPU 0 to 1000 MHz.
_ARM_CLOCKS Host Socket Clock_MHz
_ARM_CLOCKS nvhpc1 0 1000

Juno system integration and regression tests (Arm Juno r2)

Print power usage

$ examples/variorum-print-power-example
_ARM_POWER Host Sys_mW Big_mW Little_mW GPU_mW
_ARM_POWER genericarmv8 806.95 41.55 233.75 93.36
_ARM_POWER genericarmv8 772.48 43.70 228.30 93.17

$ examples/variorum-print-verbose-power-example
_ARM_POWER Host: genericarmv8, Sys: 805.21 mW, Big: 43.73 mW, Little: 236.41 mW, GPU: 95.63 mW
_ARM_POWER Host: genericarmv8, Sys: 775.39 mW, Big: 41.58 mW, Little: 192.58 mW, GPU: 93.11 mW

Print CPU temperature

$ examples/variorum-print-thermals-example
_ARM_TEMPERATURE Host Sys_C Big_C Little_C GPU_C
_ARM_TEMPERATURE genericarmv8 37.85 23.12 24.16 24.62

$ examples/variorum-print-verbose-thermals-example
_ARM_TEMPERATURE Host: genericarmv8, Sys: 37.85 C, Big: 23.30 C, Little: 24.47 C, GPU: 24.89 C

Print CPU frequency

$ examples/variorum-print-frequency-example
_ARM_CLOCKS Host CPU Socket Clock_MHz
_ARM_CLOCKS genericarmv8 Big 0 950
_ARM_CLOCKS genericarmv8 Little 1 600

$ examples/variorum-print-verbose-frequency-example
_ARM_CLOCKS Host: genericarmv8, CPU: Big, Socket: 0, Clock: 950 MHz
_ARM_CLOCKS Host: genericarmv8, CPU: Little, Socket: 1, Clock: 600 MHz
_ARM_CLOCKS Host: genericarmv8, CPU: Big, Socket: 0, Clock: 950 MHz
_ARM_CLOCKS Host: genericarmv8, CPU: Little, Socket: 1, Clock: 600 MHz

Cap CPU frequency

# examples/variorum-cap-socket-frequency-limit-example -i 1 -f 1000
Capping CPU 1 to 1000 MHz.

_ARM_CLOCKS Host CPU Socket Clock_MHz
_ARM_CLOCKS genericarmv8 Big 0 950
_ARM_CLOCKS genericarmv8 Little 1 1000

@amarathe84 amarathe84 changed the title WIP: Support for ARM Neoverse N1 platform Support for ARM Neoverse N1 platform Mar 28, 2023
@amarathe84 amarathe84 requested a review from slabasan March 28, 2023 18:33
@amarathe84 amarathe84 marked this pull request as ready for review March 28, 2023 18:33
@tpatki tpatki added the status-ready-for-review Formatted, and tested on multiple systems. label Mar 28, 2023
@amarathe84 amarathe84 requested a review from tpatki March 28, 2023 23:43
Copy link
Member

@tpatki tpatki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For documentation, some updates are missing:

uint64_t *model = (uint64_t *) malloc(sizeof(uint64_t));
*model = ARMV8;
unsigned long *model = (unsigned long *) malloc(sizeof(uint64_t));
asm volatile(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested this change, @amarathe84, I assume you have thoroughly tested the part where it picks up the model. Is there documentation on this that we can add somewhere?

Also, I noticed that for Juno r2, you now have a different model for the big and the little processors, as opposed to just one which we had before. Can you elaborate why? Should we be representing them as the same model as we have been viewing the big.Little as a single entity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR description with the integration/regression tests on nhvpc1 and Juno r2. Please take a look at the test outcomes to see if they look okay.

Let me also post the description for the updated model ID check here shortly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the existing ARM implementation we assumed a generic ARMV8 (constant) model. To distinguish between ARM implementations, we need to look at the 'Primary part number' [15:4] bit fields of the MIDR_EL1 (Main ID) register which is defined per ARM CPU implementation (as opposed to a combined SoC like Juno r2). Here are the links to the MIDR_EL1 register for the three ARM CPU architectures we support:

Cortex A53:
https://developer.arm.com/documentation/ddi0500/j/System-Control/AArch64-register-descriptions/Main-ID-Register--EL1

Cortex A72:
https://developer.arm.com/documentation/100095/0003/System-Control/AArch64-register-descriptions/Main-ID-Register--EL1?lang=en

Neoverse N1:
https://developer.arm.com/documentation/100616/0301/register-descriptions/aarch64-system-registers/midr-el1--main-id-register--el1

Based on the model ID we set up the lower-level interfaces.

Juno r2 is a big.LITTLE implementation with both Cortex A72 and Cortex A53 in a single SoC. There are systems (e.g. revisions of Raspberry Pi) with either one of these two but with the same interfaces to lower-level functionality so the same code should work with them, but we haven't tested on such systems.

There's also a filesystem interface for MIDR_EL1 but I couldn't confirm if that's always available on an Arm implementation, so I went with the MRS instruction to get the model ID instead.

I'll add a subsection in the ARM Overview about model identification along with the links to ARM documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for explaining @amarathe84 ! This is very helpful for me to understand the detail of the model/arch_id. And yes, this would be great to document somewhere outside of this issue too in the ARM documentation in some way. I didn't know about the MIDR_EL1 register.

Comment on lines +80 to +81
ARM_CORTEX_A72 = 0xd08, //ARM Cortex-A72 MPCore processor
ARM_CORTEX_A53 = 0xd03, //ARM Cortex-A53 MPCore processor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these separate now, shouldn't we be representing the big.Little Juno r2 device as a single entity (we did this in the past)?

Copy link
Collaborator Author

@amarathe84 amarathe84 May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: The model check in config_arm.c is for either 0xd08 or 0xd03 is because the mrs %0, MIDR_EL1 =r<model> may return either of the values depending on which CPU runs it at runtime by the OS scheduler. So as long as we pick up one of these model IDs, we know that it's not Neoverse N1 and proceed with using the sysfs interface.

Specifically for Juno r2 we could check for both A53 and A72 CPUs in the big.LITTLE SoC, and not for either one of the ARM CPUs but the change may be non-trivial since detect_arm_arch() needs topological information (i.e. a list of CPUs present on the system) to run on both CPUs sequentially. Should I explore that or does the existing check suffice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this would address my concern about checking it as a single entity. Thanks @amarathe84 !

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the detect_arm_arch() function to check the CPU ID of big and LITTLE clusters using the sysfs interface for midr_el register. The fix works for both Neoverse N1 and Juno r2. I did notice that the file I/O has slowed down the architecture check but that's the only way to simplify the logic. All tests worked as expected.

@tpatki
Copy link
Member

tpatki commented May 10, 2023

@amarathe84 Can you look at the comments here, and then we can merge?

@tpatki tpatki linked an issue May 11, 2023 that may be closed by this pull request
@amarathe84
Copy link
Collaborator Author

For documentation, some updates are missing:

Updated both README.md and variorum.h to indicate Neoverse N1 supported APIs.

@tpatki
Copy link
Member

tpatki commented May 13, 2023

Looks good to me @amarathe84! I am curious about the model description check, but the PR is good to go I think.
@slabasan can merge after she's had time to review.

@slabasan slabasan merged commit 349746f into LLNL:dev Jun 5, 2023
@slabasan slabasan added this to the Production: v0.7.0 Release milestone Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants