Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perl_sv_setsv_flags - IV/NV & cold code optimisation #22725

Merged
merged 4 commits into from
Dec 16, 2024

Conversation

richardleach
Copy link
Contributor

Perl_sv_setsv_flags is one of the hottest functions in the interpreter,
at least when looking at the coverage from running the test harness.

This PR basically does two things:

  • The "fast number" code early in the function now handles one SV being
    an IV and the other a NV, if both are "bodyless" types. This has the
    upshot that an IV will only be upgraded to an NV, not a PVNV, to store
    an NV value. An NV will be downgraded to an IV, not upgraded to a PVIV,
    to store an IV value. This saves having to allocate/free actual bodies.

    Having done this, subsequent code paths later in the function are
    rendered completely unreachable and have been excised.

    These changes should be transparent to most Perl users, only anyone who
    is actually looking at SV types - presumably in test code - might notice.

  • The croak-ing code within extremely cold fail-safe code paths has been
    moved out into a helper function. (I'd have put this in cold.c if we
    had one.) This new function need not be optimised.

    This change should be entirely transparent.


  • This set of changes requires a perldelta entry, and it is included.

@richardleach
Copy link
Contributor Author

Note:I tried to create a fast path for PV->PV assignments, the most popular path after the fast NULL/IV/NV code. It was easy enough to make modifications to reduce the number of instructions and branches, but the number of actual cycles reported by perf either stayed identical or increased!

This seemed to be due to much greater front-end and back-end stalls. That could be down to coincidental unfavourable alignment, so might be something to revisit,

Here's a section of the gcov output produced by running make test on a patched gcov build , with comments for context.

264440315: 4304:    case SVt_PV: /* Hit 67% of the time when running Perl's test suite */
264440315: 4305:        if (LIKELY(dtype == SVt_PV)) { ~70% of the time in this _case_
186543132: 4306:          fast_pv:
245826680: 4307:            sflags = SvFLAGS(ssv);
245826680: 4308:            if ( (sflags & (SVf_ROK|SVp_POK)) == SVp_POK) /*taken >90% of the time >
245201806: 4309:                goto pv_pok;
 77897183: 4310:        } else if (dtype < SVt_PV) { ~22% of the time in this _case_
 59283548: 4311:            sv_upgrade(dsv, SVt_PV);
 59283548: 4312:            goto fast_pv;
        -: 4313:        }
 19238509: 4314:        break; 7% of the time in this _case_

sv.c Outdated Show resolved Hide resolved
sv.c Outdated Show resolved Hide resolved
sv.c Show resolved Hide resolved
@tonycoz
Copy link
Contributor

tonycoz commented Nov 7, 2024

Note:I tried to create a fast path for PV->PV assignments, the most popular path after the fast NULL/IV/NV code. It was easy enough to make modifications to reduce the number of instructions and branches, but the number of actual cycles reported by perf either stayed identical or increased!

With regards to the performance questions, I believe Intel vTune is intended for answering these types of question, and unlike years ago is part of the free oneAPI toolkit. I haven't had a need for it myself (yet) but it might worth trying here if you have the time/inclination.

@richardleach richardleach force-pushed the hydahy/sv_setsv_flags_unreach branch from 8458306 to 7cf872e Compare November 7, 2024 00:49
sv.c Show resolved Hide resolved
sv.c Outdated Show resolved Hide resolved
sv.c Show resolved Hide resolved
sv.c Show resolved Hide resolved
@tonycoz
Copy link
Contributor

tonycoz commented Nov 12, 2024

One other thing here: this is apparently a performance optimization, the results of benchmarks would be useful, and the conditions they were measured under.

@richardleach
Copy link
Contributor Author

Thanks for the reminder, @tonycoz. I measured a few different scenarios using perf stat against blead and with this PR applied ("patched"). Both builds had ivsize=8 and nvsize=8, so both IV and NV types used the "bodyless" mechanisms. There was minor noise between runs, as usual, so multiple runs were done to account for this.

No measurable change was expected or observed on:

  • my $x = 1; for (1..100_000_000){ $x = 1; } (IV into IV assignment)
  • my $x = 1.1; for (1..100_000_000){ $x = 1.1; } (NV into NV assignment)

There seemed to be a tiny improvement to the following, but nothing significant:

  • my $x = ""; for (1..100_000_000){ $x = "Just another Perl Hacker" } (PV into PV)

(The fast-NULL/IV/NV block at the top of the function, plus the PV->PV path, account
for ~70% of calls to Perl_sv_setsv_flags, according to gcov when running the test harness.)

Noticeable changes were expected and observed when blead upgraded to a type
with an allocated body, but the patched version did not.

  • my $x = 1.1; for (1..100_000_000){ $x = 1.1; $x = 1;} (alternative IV/NV assignments)

blead:

          2,897.24 msec task-clock                       #    0.993 CPUs utilized          
                 4      context-switches                 #    1.381 /sec                   
                 0      cpu-migrations                   #    0.000 /sec                   
               189      page-faults                      #   65.234 /sec                   
    12,670,977,088      cycles                           #    4.373 GHz                    
        43,543,251      stalled-cycles-frontend          #    0.34% frontend cycles idle   
           910,784      stalled-cycles-backend           #    0.01% backend cycles idle    
    42,807,329,051      instructions                     #    3.38  insn per cycle         
                                                  #    0.00  stalled cycles per insn
     9,001,709,241      branches                         #    3.107 G/sec        

patched:

          2,575.52 msec task-clock                       #    0.988 CPUs utilized          
                 5      context-switches                 #    1.941 /sec                   
                 0      cpu-migrations                   #    0.000 /sec                   
               186      page-faults                      #   72.218 /sec                   
    11,289,082,929      cycles                           #    4.383 GHz                    
        17,195,387      stalled-cycles-frontend          #    0.15% frontend cycles idle   
         1,761,258      stalled-cycles-backend           #    0.02% backend cycles idle    
    41,206,945,181      instructions                     #    3.65  insn per cycle         
                                                  #    0.00  stalled cycles per insn
     8,501,624,007      branches                         #    3.301 G/sec
  • for (1..100_000_000) { my @nums = (1); $nums[0] = 1.1 } (setting an IV, assigning an NV, freeing the SV)

blead:

          7,901.93 msec task-clock                       #    0.998 CPUs utilized          
                 5      context-switches                 #    0.633 /sec                   
                 0      cpu-migrations                   #    0.000 /sec                   
               189      page-faults                      #   23.918 /sec                   
    36,490,011,259      cycles                           #    4.618 GHz                    
       200,534,959      stalled-cycles-frontend          #    0.55% frontend cycles idle   
         2,269,549      stalled-cycles-backend           #    0.01% backend cycles idle    
   109,313,808,352      instructions                     #    3.00  insn per cycle         
                                                  #    0.00  stalled cycles per insn
    21,903,161,401      branches                         #    2.772 G/sec 

patched:

          6,445.95 msec task-clock                       #    0.993 CPUs utilized          
                43      context-switches                 #    6.671 /sec                   
                 0      cpu-migrations                   #    0.000 /sec                   
               187      page-faults                      #   29.010 /sec                   
    28,437,333,228      cycles                           #    4.412 GHz                    
       282,558,412      stalled-cycles-frontend          #    0.99% frontend cycles idle   
           100,532      stalled-cycles-backend           #    0.00% backend cycles idle    
    94,612,222,136      instructions                     #    3.33  insn per cycle         
                                                  #    0.00  stalled cycles per insn
    18,802,825,054      branches                         #    2.917 G/sec
  • my @nums = (1) x 100_000_000; for (1..100_000_000) { $nums[$_] = 1.1 } (NVs into a large array of IVs)

blead:

         12,682.24 msec task-clock                       #    0.900 CPUs utilized          
            19,272      context-switches                 #    1.520 K/sec                  
                 0      cpu-migrations                   #    0.000 /sec                   
         1,674,061      page-faults                      #  132.000 K/sec                  
    47,687,795,920      cycles                           #    3.760 GHz                    
       356,547,911      stalled-cycles-frontend          #    0.75% frontend cycles idle   
     1,659,808,928      stalled-cycles-backend           #    3.48% backend cycles idle    
    90,426,683,028      instructions                     #    1.90  insn per cycle         
                                                  #    0.02  stalled cycles per insn
    19,795,775,732      branches                         #    1.561 G/sec

patched:

          5,468.95 msec task-clock                       #    0.995 CPUs utilized          
                67      context-switches                 #   12.251 /sec                   
                 0      cpu-migrations                   #    0.000 /sec                   
           595,020      page-faults                      #  108.800 K/sec                  
    24,397,472,622      cycles                           #    4.461 GHz                    
        73,737,773      stalled-cycles-frontend          #    0.30% frontend cycles idle   
     1,388,156,279      stalled-cycles-backend           #    5.69% backend cycles idle    
    67,612,961,402      instructions                     #    2.77  insn per cycle         
                                                  #    0.02  stalled cycles per insn
    14,661,104,730      branches                         #    2.681 G/sec

/usr/bin/time -v was used to measure the Maximum resident set size (kbytes) for the above large array allocation runs:

blead: 7738900 kbytes
patched: 4871548 kbytes

@richardleach
Copy link
Contributor Author

Note:I tried to create a fast path for PV->PV assignments, the most popular path after the fast NULL/IV/NV code. It was easy enough to make modifications to reduce the number of instructions and branches, but the number of actual cycles reported by perf either stayed identical or increased!

With regards to the performance questions, I believe Intel vTune is intended for answering these types of question, and unlike years ago is part of the free oneAPI toolkit. I haven't had a need for it myself (yet) but it might worth trying here if you have the time/inclination.

I'll add doing this to my list for 2025. :)

When the fast code at the start of Perl_sv_setsv_flags was modified to
also support bodyless NVs, the simplest possible change was made.
However, this meant that there was no fast handling when one SV was an
IV and the other a NV. Actually having this seems desirable since it
avoids the need to allocate (and later release) an XPVIV or XPVNV body.
The fast code at the top of Perl_sv_setsv_flags now handles all
cases where both SVs are < SVt_NV / SVt_IV, depending on the size of
NVs. This means that the subsequent code paths involving those
combinations are unreachable and can be removed to streamline there
function.

Note: Doing this actually made a difference with gcc 12.2.0, which
didn't seem to figure out that this was possible by itself. Similarly,
sprinking some ASSUME() statements around didn't help.
Perl_sv_setsv_flags has a number of fail-safe checks which will croak
if triggered. However, these code paths are *really* cold - they aren't
even hit by the test harness. Since they are so cold and always result
in an immediate croak, they can be pulled out into an unoptimized helper
function. This leaves Perl_sv_setsv_flags smaller and therefore more
cache friendly.
@richardleach richardleach force-pushed the hydahy/sv_setsv_flags_unreach branch 2 times, most recently from a867572 to c0d317f Compare December 12, 2024 22:28
@richardleach
Copy link
Contributor Author

Rebased. Please let me know if there are any further changes needed before it is good to merge.

@richardleach richardleach merged commit 0877c09 into Perl:blead Dec 16, 2024
33 checks passed
@richardleach richardleach deleted the hydahy/sv_setsv_flags_unreach branch December 16, 2024 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants