Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging slices from the labels APIs using more than 1 cpu and k-way #5785

Merged
merged 4 commits into from
Feb 26, 2024

Conversation

alanprot
Copy link
Member

@alanprot alanprot commented Feb 22, 2024

What this PR does:
This PR changes the strategy we use to merge sorted slices containing the labels values/keys on the GetLabels and GetLabelValues APIs.

Before we were getting the response from each ingesters, adding in a map and sorting at the end to dedup/sort the response.

Now we are using Loser Tree for merge (same as used on prometheus/prometheus#12878) and also using more than one core to merge those slices.

Benchmark from the GetLabelsValues APIs with different duplication factors (Ex: 0.67 duplication ratio means that the resulting slice will have 33% of the sum of the strings on the input slices):

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
cpu: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
                                                                                                    │   /tmp/old   │              /tmp/new               │
                                                                                                    │    sec/op    │    sec/op     vs base               │
Distributor_GetLabelsValues/numIngesters16,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32    1.985m ± ∞ ¹   1.204m ± ∞ ¹  -39.35% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters16,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32    825.2µ ± ∞ ¹   572.9µ ± ∞ ¹  -30.57% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters150,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32   23.33m ± ∞ ¹   13.31m ± ∞ ¹  -42.96% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters150,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32   6.381m ± ∞ ¹   3.264m ± ∞ ¹  -48.84% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters500,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32   93.35m ± ∞ ¹   50.86m ± ∞ ¹  -45.52% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters500,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32   23.02m ± ∞ ¹   12.47m ± ∞ ¹  -45.82% (p=0.008 n=5)
geomean                                                                                               8.979m         5.166m        -42.47%
¹ need >= 6 samples for confidence interval at level 0.95

                                                                                                    │    /tmp/old    │                /tmp/new                │
                                                                                                    │      B/op      │      B/op       vs base                │
Distributor_GetLabelsValues/numIngesters16,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32     439.2Ki ± ∞ ¹   1024.4Ki ± ∞ ¹  +133.26% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters16,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32     111.0Ki ± ∞ ¹    429.8Ki ± ∞ ¹  +287.09% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters150,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32    3.484Mi ± ∞ ¹   11.340Mi ± ∞ ¹  +225.53% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters150,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32    446.0Ki ± ∞ ¹    949.8Ki ± ∞ ¹  +112.96% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters500,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32    12.95Mi ± ∞ ¹    35.34Mi ± ∞ ¹  +172.95% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters500,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32    995.0Ki ± ∞ ¹   2222.2Ki ± ∞ ¹  +123.33% (p=0.008 n=5)
geomean                                                                                               1003.9Ki          2.640Mi        +169.31%
¹ need >= 6 samples for confidence interval at level 0.95

                                                                                                    │   /tmp/old   │              /tmp/new               │
                                                                                                    │  allocs/op   │  allocs/op    vs base               │
Distributor_GetLabelsValues/numIngesters16,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32     221.0 ± ∞ ¹    199.0 ± ∞ ¹   -9.95% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters16,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32     124.0 ± ∞ ¹    188.0 ± ∞ ¹  +51.61% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters150,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32   2514.0 ± ∞ ¹    907.0 ± ∞ ¹  -63.92% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters150,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32    739.0 ± ∞ ¹    856.0 ± ∞ ¹  +15.83% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters500,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32   6.985k ± ∞ ¹   2.662k ± ∞ ¹  -61.89% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters500,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32   2.266k ± ∞ ¹   2.606k ± ∞ ¹  +15.00% (p=0.008 n=5)
geomean                                                                                                964.7          765.7        -20.63%

Benchmark with the merge implementation in isolation:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/util
cpu: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
BenchmarkMergeSlicesParallel/usingMap,inputSize:100,stringsPerInput:100,duplicateRatio:0.3-32         	     484	   2308154 ns/op	  791212 B/op	     217 allocs/op
BenchmarkMergeSlicesParallel/parallelism:1,inputSize:100,stringsPerInput:100,duplicateRatio:0.3-32    	     850	   1313718 ns/op	  372744 B/op	     109 allocs/op
BenchmarkMergeSlicesParallel/parallelism:8,inputSize:100,stringsPerInput:100,duplicateRatio:0.3-32    	    1147	    973408 ns/op	  868365 B/op	     209 allocs/op
BenchmarkMergeSlicesParallel/usingMap,inputSize:100,stringsPerInput:100,duplicateRatio:0.8-32         	    1327	    830787 ns/op	  212041 B/op	      61 allocs/op
BenchmarkMergeSlicesParallel/parallelism:1,inputSize:100,stringsPerInput:100,duplicateRatio:0.8-32    	    1024	   1130711 ns/op	   94216 B/op	     106 allocs/op
BenchmarkMergeSlicesParallel/parallelism:8,inputSize:100,stringsPerInput:100,duplicateRatio:0.8-32    	    1842	    620375 ns/op	  383621 B/op	     200 allocs/op
BenchmarkMergeSlicesParallel/usingMap,inputSize:100,stringsPerInput:100,duplicateRatio:0.95-32        	    2479	    450711 ns/op	   52476 B/op	      21 allocs/op
BenchmarkMergeSlicesParallel/parallelism:1,inputSize:100,stringsPerInput:100,duplicateRatio:0.95-32   	    1136	   1020612 ns/op	   45064 B/op	     105 allocs/op
BenchmarkMergeSlicesParallel/parallelism:8,inputSize:100,stringsPerInput:100,duplicateRatio:0.95-32   	    3314	    341552 ns/op	  134249 B/op	     186 allocs/op
BenchmarkMergeSlicesParallel/usingMap,inputSize:150,stringsPerInput:300,duplicateRatio:0.3-32         	      86	  12017615 ns/op	 3187257 B/op	    1043 allocs/op
BenchmarkMergeSlicesParallel/parallelism:1,inputSize:150,stringsPerInput:300,duplicateRatio:0.3-32    	     171	   6776256 ns/op	 2434920 B/op	     161 allocs/op
BenchmarkMergeSlicesParallel/parallelism:8,inputSize:150,stringsPerInput:300,duplicateRatio:0.3-32    	     238	   4916036 ns/op	 4608115 B/op	     273 allocs/op
BenchmarkMergeSlicesParallel/usingMap,inputSize:150,stringsPerInput:300,duplicateRatio:0.8-32         	     283	   4055139 ns/op	  831591 B/op	     213 allocs/op
BenchmarkMergeSlicesParallel/parallelism:1,inputSize:150,stringsPerInput:300,duplicateRatio:0.8-32    	     216	   5447690 ns/op	  354152 B/op	     156 allocs/op
BenchmarkMergeSlicesParallel/parallelism:8,inputSize:150,stringsPerInput:300,duplicateRatio:0.8-32    	     418	   2764875 ns/op	 1953886 B/op	     258 allocs/op
BenchmarkMergeSlicesParallel/usingMap,inputSize:150,stringsPerInput:300,duplicateRatio:0.95-32        	     543	   2135988 ns/op	  212043 B/op	      61 allocs/op
BenchmarkMergeSlicesParallel/parallelism:1,inputSize:150,stringsPerInput:300,duplicateRatio:0.95-32   	     241	   4911898 ns/op	  165736 B/op	     155 allocs/op
BenchmarkMergeSlicesParallel/parallelism:8,inputSize:150,stringsPerInput:300,duplicateRatio:0.95-32   	     835	   1355215 ns/op	  572329 B/op	     239 allocs/op

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…r tree

Signed-off-by: Alan Protasio <alanprot@gmail.com>
@alanprot alanprot force-pushed the opmize-slices-merge branch from d0143fc to 1342145 Compare February 22, 2024 22:42
@alanprot alanprot marked this pull request as ready for review February 22, 2024 22:43
Signed-off-by: Alan Protasio <alanprot@gmail.com>
@alanprot alanprot requested a review from yeya24 February 22, 2024 23:08
pkg/util/strings.go Outdated Show resolved Hide resolved

// mergeSlicesParallelism is a constant of how much go routines we should use to merge slices, and
// it was based on empirical observation: See BenchmarkMergeSlicesParallel
mergeSlicesParallelism = 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am fine with it. Just one question, what if we make batch size a constant and get parallelism based on the input size? Then for small input we can still use 1 core and we use more cores for larger batch size.
We can still cap the concurrency at 8

Copy link
Member Author

@alanprot alanprot Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda done inside of the function:

p := min(parallelism, len(a)/2)

So, we will only use parallelism if the input is > 4 (and we will use only 2 cores in this case)

Increasing parallelism to > 8 i think does not make much difference as the last merge will be using one core anyway.. so we are at the end of the day increasing the number of slices being merged at the end of the function.

make sense?

Copy link
Contributor

@yeya24 yeya24 Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for the explanation. Input length is basically the same as number of ingesters for the user. Then if the user has at least 16 ingesters, then we will use 8 parallelism for it?

16 ingesters sounds a small number to me. I am worried about having very small batch sizes per goroutine. What about using a larger x below like 16 or even 32? Is it better than 2 or you think it doesn't make much difference

p := min(parallelism, len(a)/x)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case both solutions will be pretty fast.. but lemme create a benchmak

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description updated with 16 ingesters and test case added.

WDYT?

Signed-off-by: Alan Protasio <alanprot@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@yeya24 yeya24 merged commit 14d9b7b into cortexproject:master Feb 26, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants