Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Light up String.Manipulation APIs with Vector512 codepath #93043

Merged
merged 2 commits into from
Dec 22, 2023

Conversation

khushal1996
Copy link
Contributor

@khushal1996 khushal1996 commented Oct 5, 2023

Optimizing the following String APIs

  1. String.Split --> Optimizing MakeSeparatorListVectorized
  2. String.Replace(char oldChar, char newChar) --> Optimizing for a single iteration. Although we have measured perf on this API, it just represents optimizing a single iteration and not all.

PERF on ICX


Below tables show a result comparison output by ResultComparer in the performance repo.

Base = No changes
Diff = With the PR changes

1. Split


A Vector128 code path already exists for this API. We are adding a similar Vector256 and Vector512 code path.

base = Diff Vector256 code path vs diff = Diff Vector512 code path

Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.09 25083.88 27315.47
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.07 216.23 231.68
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.06 527.11 561.25
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.06 21021.31 22223.51
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.04 292.20 304.67
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.04 5308.70 5499.70
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.02 663.91 678.31
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.43 100.27 69.98
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.37 99.78 72.70
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.06 47539.44 44884.05
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.03 2787.30 2701.91
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.03 38497.32 37374.38
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.02 1089.85 1073.03

base = Base Vector128 code path vs diff = Diff Vector256 code path

Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.77 176.76 99.78
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.76 176.33 100.27
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.47 56625.27 38497.32
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.43 67789.59 47539.44
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.27 1151.90 908.21
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.20 6348.06 5308.70
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.19 5407.90 4549.70
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.19 1293.97 1089.85
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.18 2753.15 2337.50
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.16 24393.60 21021.31
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.16 609.65 527.11
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.16 28984.70 25083.88
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.15 3213.50 2787.30
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.11 239.76 216.23
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.10 320.35 292.20
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.09 721.48 663.91

This is one of the issues where Avx512 is not that performance because of the issue with using multiple Vector512.Equals(). I ran a couple of iterations using StopWatch method and below are the results.

15

As you can see, for each iteration, Vector512 is almost the same as Vector256. Let me know if there are any suggestions for further optimizing Vector512 code path. We have to decide whether this can ne merged or not since there are already Vector128 code path for both the APIs. Also, the Vector256 and Vector512 code path provide a significant speed up over Vector128 code path.

2. Replace_Char


A Vector128 code path already exists for this API. We are just adding a single iteration of Vector512 or Vector256.

base = Diff, Vector256 code path vs diff = Diff Vector512 code path

Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.12 24.74 27.63
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.06 4003.77 4246.22
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldC 1.03 18.05 18.56
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.25 173.57 139.21
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.15 1486.21 1292.79
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.15 722.97 630.87
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldC 1.09 3.90 3.57
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.09 2216.98 2029.28

base = Base Vector128 code path vs diff = Diff Vector256 code path

Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.12 24.74 27.63
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.06 4003.77 4246.22
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldC 1.03 18.05 18.56
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.25 173.57 139.21
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.15 1486.21 1292.79
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.15 722.97 630.87
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldC 1.09 3.90 3.57
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.09 2216.98 2029.28

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Oct 5, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 5, 2023
@khushal1996 khushal1996 marked this pull request as ready for review October 9, 2023 18:41
@khushal1996
Copy link
Contributor Author

@tannergooding Just sending out a reminder to review this PR.

@adamsitnik adamsitnik added area-System.Runtime and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Nov 3, 2023
@ghost
Copy link

ghost commented Nov 3, 2023

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Optimizing the following String APIs

  1. String.Split --> Optimizing MakeSeparatorListVectorized
  2. String.Replace(char oldChar, char newChar) --> Optimizing for a single iteration. Although we have measured perf on this API, it just represents optimizing a single iteration and not all.

PERF on ICX


Below tables show a result comparison output by ResultComparer in the performance repo.

Base = No changes
Diff = With the PR changes

1. Split


A Vector128 code path already exists for this API. We are adding a similar Vector256 and Vector512 code path.

base = Diff Vector256 code path vs diff = Diff Vector512 code path

Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.09 25083.88 27315.47
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.07 216.23 231.68
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.06 527.11 561.25
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.06 21021.31 22223.51
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.04 292.20 304.67
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.04 5308.70 5499.70
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.02 663.91 678.31
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.43 100.27 69.98
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.37 99.78 72.70
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.06 47539.44 44884.05
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.03 2787.30 2701.91
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.03 38497.32 37374.38
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.02 1089.85 1073.03

base = Base Vector128 code path vs diff = Diff Vector256 code path

Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.77 176.76 99.78
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.76 176.33 100.27
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.47 56625.27 38497.32
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.43 67789.59 47539.44
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.27 1151.90 908.21
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.20 6348.06 5308.70
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.19 5407.90 4549.70
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.19 1293.97 1089.85
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.18 2753.15 2337.50
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.16 24393.60 21021.31
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.16 609.65 527.11
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.16 28984.70 25083.88
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.15 3213.50 2787.30
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.11 239.76 216.23
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.10 320.35 292.20
System.Tests.Perf_String.Split(s: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab 1.09 721.48 663.91

This is one of the issues where Avx512 is not that performance because of the issue with using multiple Vector512.Equals(). I ran a couple of iterations using StopWatch method and below are the results.

15

As you can see, for each iteration, Vector512 is almost the same as Vector256. Let me know if there are any suggestions for further optimizing Vector512 code path. We have to decide whether this can ne merged or not since there are already Vector128 code path for both the APIs. Also, the Vector256 and Vector512 code path provide a significant speed up over Vector128 code path.

2. Replace_Char


A Vector128 code path already exists for this API. We are just adding a single iteration of Vector512 or Vector256.

base = Diff, Vector256 code path vs diff = Diff Vector512 code path

Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.12 24.74 27.63
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.06 4003.77 4246.22
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldC 1.03 18.05 18.56
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.25 173.57 139.21
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.15 1486.21 1292.79
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.15 722.97 630.87
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldC 1.09 3.90 3.57
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.09 2216.98 2029.28

base = Base Vector128 code path vs diff = Diff Vector256 code path

Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.12 24.74 27.63
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.06 4003.77 4246.22
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldC 1.03 18.05 18.56
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.25 173.57 139.21
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.15 1486.21 1292.79
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.15 722.97 630.87
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldC 1.09 3.90 3.57
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgbl 1.09 2216.98 2029.28
Author: khushal1996
Assignees: -
Labels:

area-System.Runtime, community-contribution, needs-area-label

Milestone: -

@adamsitnik adamsitnik added the tenet-performance Performance related issue label Nov 3, 2023
@khushal1996
Copy link
Contributor Author

@tannergooding just sending out a reminder for this pending review.

@tannergooding
Copy link
Member

CC. @stephentoub, @GrabYourPitchforks, @adamsitnik

Could one of you give this a secondary review.

@khushal1996
Copy link
Contributor Author

CC. @stephentoub, @GrabYourPitchforks, @adamsitnik

Could one of you give this a secondary review.

@stephentoub @GrabYourPitchforks @adamsitnik sending a reminder for review. Can you please review this PR.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@khushal1996
Copy link
Contributor Author

@tannergooding @kunalspathak can you please help with merging this PR? It has been approved for quite some time now.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kunalspathak
Copy link
Member

CI seems red, so I will kick off another round once again to make sure there is nothing related to this PR.

/azp run runtime

@kunalspathak kunalspathak merged commit 14127ea into dotnet:main Dec 22, 2023
170 of 180 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jan 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants