Perl_sv_setsv_flags - IV/NV & cold code optimisation #22725

richardleach · 2024-11-06T23:55:04Z

Perl_sv_setsv_flags is one of the hottest functions in the interpreter,
at least when looking at the coverage from running the test harness.

This PR basically does two things:

The "fast number" code early in the function now handles one SV being
an IV and the other a NV, if both are "bodyless" types. This has the
upshot that an IV will only be upgraded to an NV, not a PVNV, to store
an NV value. An NV will be downgraded to an IV, not upgraded to a PVIV,
to store an IV value. This saves having to allocate/free actual bodies.

Having done this, subsequent code paths later in the function are
rendered completely unreachable and have been excised.

These changes should be transparent to most Perl users, only anyone who
is actually looking at SV types - presumably in test code - might notice.
The croak-ing code within extremely cold fail-safe code paths has been
moved out into a helper function. (I'd have put this in cold.c if we
had one.) This new function need not be optimised.

This change should be entirely transparent.

This set of changes requires a perldelta entry, and it is included.

richardleach · 2024-11-07T00:01:26Z

Note:I tried to create a fast path for PV->PV assignments, the most popular path after the fast NULL/IV/NV code. It was easy enough to make modifications to reduce the number of instructions and branches, but the number of actual cycles reported by perf either stayed identical or increased!

This seemed to be due to much greater front-end and back-end stalls. That could be down to coincidental unfavourable alignment, so might be something to revisit,

Here's a section of the gcov output produced by running make test on a patched gcov build , with comments for context.

264440315: 4304:    case SVt_PV: /* Hit 67% of the time when running Perl's test suite */
264440315: 4305:        if (LIKELY(dtype == SVt_PV)) { ~70% of the time in this _case_
186543132: 4306:          fast_pv:
245826680: 4307:            sflags = SvFLAGS(ssv);
245826680: 4308:            if ( (sflags & (SVf_ROK|SVp_POK)) == SVp_POK) /*taken >90% of the time >
245201806: 4309:                goto pv_pok;
 77897183: 4310:        } else if (dtype < SVt_PV) { ~22% of the time in this _case_
 59283548: 4311:            sv_upgrade(dsv, SVt_PV);
 59283548: 4312:            goto fast_pv;
        -: 4313:        }
 19238509: 4314:        break; 7% of the time in this _case_

sv.c

tonycoz · 2024-11-07T00:39:36Z

Note:I tried to create a fast path for PV->PV assignments, the most popular path after the fast NULL/IV/NV code. It was easy enough to make modifications to reduce the number of instructions and branches, but the number of actual cycles reported by perf either stayed identical or increased!

With regards to the performance questions, I believe Intel vTune is intended for answering these types of question, and unlike years ago is part of the free oneAPI toolkit. I haven't had a need for it myself (yet) but it might worth trying here if you have the time/inclination.

sv.c

tonycoz · 2024-11-12T04:18:46Z

One other thing here: this is apparently a performance optimization, the results of benchmarks would be useful, and the conditions they were measured under.

richardleach · 2024-11-13T23:51:33Z

Thanks for the reminder, @tonycoz. I measured a few different scenarios using perf stat against blead and with this PR applied ("patched"). Both builds had ivsize=8 and nvsize=8, so both IV and NV types used the "bodyless" mechanisms. There was minor noise between runs, as usual, so multiple runs were done to account for this.

No measurable change was expected or observed on:

my $x = 1; for (1..100_000_000){ $x = 1; } (IV into IV assignment)
my $x = 1.1; for (1..100_000_000){ $x = 1.1; } (NV into NV assignment)

There seemed to be a tiny improvement to the following, but nothing significant:

my $x = ""; for (1..100_000_000){ $x = "Just another Perl Hacker" } (PV into PV)

(The fast-NULL/IV/NV block at the top of the function, plus the PV->PV path, account
for ~70% of calls to Perl_sv_setsv_flags, according to gcov when running the test harness.)

Noticeable changes were expected and observed when blead upgraded to a type
with an allocated body, but the patched version did not.

my $x = 1.1; for (1..100_000_000){ $x = 1.1; $x = 1;} (alternative IV/NV assignments)

blead:

          2,897.24 msec task-clock                       #    0.993 CPUs utilized          
                 4      context-switches                 #    1.381 /sec                   
                 0      cpu-migrations                   #    0.000 /sec                   
               189      page-faults                      #   65.234 /sec                   
    12,670,977,088      cycles                           #    4.373 GHz                    
        43,543,251      stalled-cycles-frontend          #    0.34% frontend cycles idle   
           910,784      stalled-cycles-backend           #    0.01% backend cycles idle    
    42,807,329,051      instructions                     #    3.38  insn per cycle         
                                                  #    0.00  stalled cycles per insn
     9,001,709,241      branches                         #    3.107 G/sec

patched:

          2,575.52 msec task-clock                       #    0.988 CPUs utilized          
                 5      context-switches                 #    1.941 /sec                   
                 0      cpu-migrations                   #    0.000 /sec                   
               186      page-faults                      #   72.218 /sec                   
    11,289,082,929      cycles                           #    4.383 GHz                    
        17,195,387      stalled-cycles-frontend          #    0.15% frontend cycles idle   
         1,761,258      stalled-cycles-backend           #    0.02% backend cycles idle    
    41,206,945,181      instructions                     #    3.65  insn per cycle         
                                                  #    0.00  stalled cycles per insn
     8,501,624,007      branches                         #    3.301 G/sec

for (1..100_000_000) { my @nums = (1); $nums[0] = 1.1 } (setting an IV, assigning an NV, freeing the SV)

blead:

          7,901.93 msec task-clock                       #    0.998 CPUs utilized          
                 5      context-switches                 #    0.633 /sec                   
                 0      cpu-migrations                   #    0.000 /sec                   
               189      page-faults                      #   23.918 /sec                   
    36,490,011,259      cycles                           #    4.618 GHz                    
       200,534,959      stalled-cycles-frontend          #    0.55% frontend cycles idle   
         2,269,549      stalled-cycles-backend           #    0.01% backend cycles idle    
   109,313,808,352      instructions                     #    3.00  insn per cycle         
                                                  #    0.00  stalled cycles per insn
    21,903,161,401      branches                         #    2.772 G/sec

patched:

          6,445.95 msec task-clock                       #    0.993 CPUs utilized          
                43      context-switches                 #    6.671 /sec                   
                 0      cpu-migrations                   #    0.000 /sec                   
               187      page-faults                      #   29.010 /sec                   
    28,437,333,228      cycles                           #    4.412 GHz                    
       282,558,412      stalled-cycles-frontend          #    0.99% frontend cycles idle   
           100,532      stalled-cycles-backend           #    0.00% backend cycles idle    
    94,612,222,136      instructions                     #    3.33  insn per cycle         
                                                  #    0.00  stalled cycles per insn
    18,802,825,054      branches                         #    2.917 G/sec

my @nums = (1) x 100_000_000; for (1..100_000_000) { $nums[$_] = 1.1 } (NVs into a large array of IVs)

blead:

         12,682.24 msec task-clock                       #    0.900 CPUs utilized          
            19,272      context-switches                 #    1.520 K/sec                  
                 0      cpu-migrations                   #    0.000 /sec                   
         1,674,061      page-faults                      #  132.000 K/sec                  
    47,687,795,920      cycles                           #    3.760 GHz                    
       356,547,911      stalled-cycles-frontend          #    0.75% frontend cycles idle   
     1,659,808,928      stalled-cycles-backend           #    3.48% backend cycles idle    
    90,426,683,028      instructions                     #    1.90  insn per cycle         
                                                  #    0.02  stalled cycles per insn
    19,795,775,732      branches                         #    1.561 G/sec

patched:

          5,468.95 msec task-clock                       #    0.995 CPUs utilized          
                67      context-switches                 #   12.251 /sec                   
                 0      cpu-migrations                   #    0.000 /sec                   
           595,020      page-faults                      #  108.800 K/sec                  
    24,397,472,622      cycles                           #    4.461 GHz                    
        73,737,773      stalled-cycles-frontend          #    0.30% frontend cycles idle   
     1,388,156,279      stalled-cycles-backend           #    5.69% backend cycles idle    
    67,612,961,402      instructions                     #    2.77  insn per cycle         
                                                  #    0.02  stalled cycles per insn
    14,661,104,730      branches                         #    2.681 G/sec

/usr/bin/time -v was used to measure the Maximum resident set size (kbytes) for the above large array allocation runs:

blead: 7738900 kbytes
patched: 4871548 kbytes

richardleach · 2024-12-12T22:17:20Z

Note:I tried to create a fast path for PV->PV assignments, the most popular path after the fast NULL/IV/NV code. It was easy enough to make modifications to reduce the number of instructions and branches, but the number of actual cycles reported by perf either stayed identical or increased!

With regards to the performance questions, I believe Intel vTune is intended for answering these types of question, and unlike years ago is part of the free oneAPI toolkit. I haven't had a need for it myself (yet) but it might worth trying here if you have the time/inclination.

I'll add doing this to my list for 2025. :)

When the fast code at the start of Perl_sv_setsv_flags was modified to also support bodyless NVs, the simplest possible change was made. However, this meant that there was no fast handling when one SV was an IV and the other a NV. Actually having this seems desirable since it avoids the need to allocate (and later release) an XPVIV or XPVNV body.

The fast code at the top of Perl_sv_setsv_flags now handles all cases where both SVs are < SVt_NV / SVt_IV, depending on the size of NVs. This means that the subsequent code paths involving those combinations are unreachable and can be removed to streamline there function. Note: Doing this actually made a difference with gcc 12.2.0, which didn't seem to figure out that this was possible by itself. Similarly, sprinking some ASSUME() statements around didn't help.

Perl_sv_setsv_flags has a number of fail-safe checks which will croak if triggered. However, these code paths are *really* cold - they aren't even hit by the test harness. Since they are so cold and always result in an immediate croak, they can be pulled out into an unoptimized helper function. This leaves Perl_sv_setsv_flags smaller and therefore more cache friendly.

richardleach · 2024-12-12T22:29:34Z

Rebased. Please let me know if there are any further changes needed before it is good to merge.

tonycoz reviewed Nov 7, 2024

View reviewed changes

sv.c Outdated Show resolved Hide resolved

tonycoz reviewed Nov 7, 2024

View reviewed changes

sv.c Outdated Show resolved Hide resolved

tonycoz reviewed Nov 7, 2024

View reviewed changes

sv.c Show resolved Hide resolved

richardleach force-pushed the hydahy/sv_setsv_flags_unreach branch from 8458306 to 7cf872e Compare November 7, 2024 00:49

bulk88 reviewed Nov 8, 2024

View reviewed changes

sv.c Show resolved Hide resolved

bulk88 reviewed Nov 8, 2024

View reviewed changes

sv.c Outdated Show resolved Hide resolved

bulk88 reviewed Nov 8, 2024

View reviewed changes

sv.c Show resolved Hide resolved

bulk88 reviewed Nov 8, 2024

View reviewed changes

sv.c Show resolved Hide resolved

github-actions bot added hasConflicts and removed hasConflicts labels Nov 19, 2024

github-actions bot added the hasConflicts label Nov 28, 2024

richardleach added 4 commits December 12, 2024 22:21

perldelta entry for Perl_sv_setsv_flags changes

c0d317f

richardleach force-pushed the hydahy/sv_setsv_flags_unreach branch 2 times, most recently from a867572 to c0d317f Compare December 12, 2024 22:28

jkeenan removed the hasConflicts label Dec 13, 2024

tonycoz approved these changes Dec 15, 2024

View reviewed changes

richardleach merged commit 0877c09 into Perl:blead Dec 16, 2024
33 checks passed

richardleach deleted the hydahy/sv_setsv_flags_unreach branch December 16, 2024 21:33

jkeenan mentioned this pull request Dec 19, 2024

BBC: Blead Breaks B::Utils #22866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perl_sv_setsv_flags - IV/NV & cold code optimisation #22725

Perl_sv_setsv_flags - IV/NV & cold code optimisation #22725

richardleach commented Nov 6, 2024

richardleach commented Nov 7, 2024

tonycoz commented Nov 7, 2024

tonycoz commented Nov 12, 2024

richardleach commented Nov 13, 2024

richardleach commented Dec 12, 2024

richardleach commented Dec 12, 2024

Perl_sv_setsv_flags - IV/NV & cold code optimisation #22725

Perl_sv_setsv_flags - IV/NV & cold code optimisation #22725

Conversation

richardleach commented Nov 6, 2024

richardleach commented Nov 7, 2024

tonycoz commented Nov 7, 2024

tonycoz commented Nov 12, 2024

richardleach commented Nov 13, 2024

richardleach commented Dec 12, 2024

richardleach commented Dec 12, 2024