Minor Performance Regression Fixes #314

insertinterestingnamehere · 2024-11-07T18:44:33Z

These are some minor performance fixes to help fix regressions downstream

In particular, this removes an unnecessary thread fence in the IO code, and updates the MACHINE_FENCE macro to use a stronger type of memory fence (since this somehow results in better performance in practice).

The trylock patch is just a simplification that I wrote while attempting to debug the performance regressions. That should be performance neutral.

MACHINE_FENCE macro. Somehow this results in better performance for some chapel benchmarks.

janciesko · 2024-11-08T21:13:53Z

include/qthread/qthread.h

@@ -87,7 +87,7 @@ using std::memory_order_relaxed;

 #include "macros.h"

-#define MACHINE_FENCE atomic_thread_fence(memory_order_acq_rel);


Is change is needed for correctness or for performance?

Surprisingly, performance. I'm still confused as to how/why this is happening, but for the chapel benchmark I'm using to track the regression there's a modest but consistent performance improvement from using the sequentially consistent fences instead of just acquire release consistent. This matches what the old assembly based version was doing prior to b122234 anyway, but it's still weird.

I'm not aware of anything in our codebase that ought to actually require sequential consistency for correctness' sake. The fact that sequential consistency is beneficial here seems to indicate that there's something suboptimal going on internally. This patch at least corrects the downstream performance regression though.

@insertinterestingnamehere

Backporting two more performance fixes from sandialabs/qthreads#314. [Contributed by @insertinterestingnamehere. Reviewed and merged by @jabraham17]

insertinterestingnamehere added 3 commits October 29, 2024 14:35

Remove unnecessary fence in io code.

cd9c77b

Switch back to doing a sequential consistency memory fence for the

19b4f6f

MACHINE_FENCE macro. Somehow this results in better performance for some chapel benchmarks.

Use standard atomics more directly in the trylock implementation.

59b2236

insertinterestingnamehere force-pushed the perf branch from 26f1469 to 59b2236 Compare November 7, 2024 18:59

janciesko reviewed Nov 8, 2024

View reviewed changes

Slight cleanup and restructuring in the hashmap code.

4f0d25c

insertinterestingnamehere merged commit af98f00 into sandialabs:release-1.22-pre Nov 26, 2024
293 of 295 checks passed

insertinterestingnamehere mentioned this pull request Nov 26, 2024

Backport More Qthreads Performance Fixes chapel-lang/chapel#26328

Merged

insertinterestingnamehere deleted the perf branch December 5, 2024 18:38

insertinterestingnamehere mentioned this pull request Dec 17, 2024

Additional Performance Fixes #315

Merged

insertinterestingnamehere added this to the 1.22 Release milestone Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor Performance Regression Fixes #314

Minor Performance Regression Fixes #314

insertinterestingnamehere commented Nov 7, 2024

janciesko Nov 8, 2024

insertinterestingnamehere Nov 8, 2024 •

edited

Loading

insertinterestingnamehere Nov 8, 2024

		@@ -87,7 +87,7 @@ using std::memory_order_relaxed;

		#include "macros.h"

		#define MACHINE_FENCE atomic_thread_fence(memory_order_acq_rel);

Minor Performance Regression Fixes #314

Minor Performance Regression Fixes #314

Conversation

insertinterestingnamehere commented Nov 7, 2024

janciesko Nov 8, 2024

Choose a reason for hiding this comment

insertinterestingnamehere Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

insertinterestingnamehere Nov 8, 2024

Choose a reason for hiding this comment

insertinterestingnamehere Nov 8, 2024 •

edited

Loading