Merging slices from the labels APIs using more than 1 cpu and k-way #5785

alanprot · 2024-02-22T22:42:15Z

What this PR does:
This PR changes the strategy we use to merge sorted slices containing the labels values/keys on the GetLabels and GetLabelValues APIs.

Before we were getting the response from each ingesters, adding in a map and sorting at the end to dedup/sort the response.

Now we are using Loser Tree for merge (same as used on prometheus/prometheus#12878) and also using more than one core to merge those slices.

Benchmark from the GetLabelsValues APIs with different duplication factors (Ex: 0.67 duplication ratio means that the resulting slice will have 33% of the sum of the strings on the input slices):

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/distributor
cpu: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
                                                                                                    │   /tmp/old   │              /tmp/new               │
                                                                                                    │    sec/op    │    sec/op     vs base               │
Distributor_GetLabelsValues/numIngesters16,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32    1.985m ± ∞ ¹   1.204m ± ∞ ¹  -39.35% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters16,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32    825.2µ ± ∞ ¹   572.9µ ± ∞ ¹  -30.57% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters150,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32   23.33m ± ∞ ¹   13.31m ± ∞ ¹  -42.96% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters150,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32   6.381m ± ∞ ¹   3.264m ± ∞ ¹  -48.84% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters500,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32   93.35m ± ∞ ¹   50.86m ± ∞ ¹  -45.52% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters500,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32   23.02m ± ∞ ¹   12.47m ± ∞ ¹  -45.82% (p=0.008 n=5)
geomean                                                                                               8.979m         5.166m        -42.47%
¹ need >= 6 samples for confidence interval at level 0.95

                                                                                                    │    /tmp/old    │                /tmp/new                │
                                                                                                    │      B/op      │      B/op       vs base                │
Distributor_GetLabelsValues/numIngesters16,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32     439.2Ki ± ∞ ¹   1024.4Ki ± ∞ ¹  +133.26% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters16,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32     111.0Ki ± ∞ ¹    429.8Ki ± ∞ ¹  +287.09% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters150,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32    3.484Mi ± ∞ ¹   11.340Mi ± ∞ ¹  +225.53% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters150,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32    446.0Ki ± ∞ ¹    949.8Ki ± ∞ ¹  +112.96% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters500,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32    12.95Mi ± ∞ ¹    35.34Mi ± ∞ ¹  +172.95% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters500,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32    995.0Ki ± ∞ ¹   2222.2Ki ± ∞ ¹  +123.33% (p=0.008 n=5)
geomean                                                                                               1003.9Ki          2.640Mi        +169.31%
¹ need >= 6 samples for confidence interval at level 0.95

                                                                                                    │   /tmp/old   │              /tmp/new               │
                                                                                                    │  allocs/op   │  allocs/op    vs base               │
Distributor_GetLabelsValues/numIngesters16,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32     221.0 ± ∞ ¹    199.0 ± ∞ ¹   -9.95% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters16,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32     124.0 ± ∞ ¹    188.0 ± ∞ ¹  +51.61% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters150,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32   2514.0 ± ∞ ¹    907.0 ± ∞ ¹  -63.92% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters150,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32    739.0 ± ∞ ¹    856.0 ± ∞ ¹  +15.83% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters500,lblValuesPerIngester1000,lblValuesDuplicateRatio0.67-32   6.985k ± ∞ ¹   2.662k ± ∞ ¹  -61.89% (p=0.008 n=5)
Distributor_GetLabelsValues/numIngesters500,lblValuesPerIngester1000,lblValuesDuplicateRatio0.98-32   2.266k ± ∞ ¹   2.606k ± ∞ ¹  +15.00% (p=0.008 n=5)
geomean                                                                                                964.7          765.7        -20.63%

Benchmark with the merge implementation in isolation:

goos: linux
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/util
cpu: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
BenchmarkMergeSlicesParallel/usingMap,inputSize:100,stringsPerInput:100,duplicateRatio:0.3-32         	     484	   2308154 ns/op	  791212 B/op	     217 allocs/op
BenchmarkMergeSlicesParallel/parallelism:1,inputSize:100,stringsPerInput:100,duplicateRatio:0.3-32    	     850	   1313718 ns/op	  372744 B/op	     109 allocs/op
BenchmarkMergeSlicesParallel/parallelism:8,inputSize:100,stringsPerInput:100,duplicateRatio:0.3-32    	    1147	    973408 ns/op	  868365 B/op	     209 allocs/op
BenchmarkMergeSlicesParallel/usingMap,inputSize:100,stringsPerInput:100,duplicateRatio:0.8-32         	    1327	    830787 ns/op	  212041 B/op	      61 allocs/op
BenchmarkMergeSlicesParallel/parallelism:1,inputSize:100,stringsPerInput:100,duplicateRatio:0.8-32    	    1024	   1130711 ns/op	   94216 B/op	     106 allocs/op
BenchmarkMergeSlicesParallel/parallelism:8,inputSize:100,stringsPerInput:100,duplicateRatio:0.8-32    	    1842	    620375 ns/op	  383621 B/op	     200 allocs/op
BenchmarkMergeSlicesParallel/usingMap,inputSize:100,stringsPerInput:100,duplicateRatio:0.95-32        	    2479	    450711 ns/op	   52476 B/op	      21 allocs/op
BenchmarkMergeSlicesParallel/parallelism:1,inputSize:100,stringsPerInput:100,duplicateRatio:0.95-32   	    1136	   1020612 ns/op	   45064 B/op	     105 allocs/op
BenchmarkMergeSlicesParallel/parallelism:8,inputSize:100,stringsPerInput:100,duplicateRatio:0.95-32   	    3314	    341552 ns/op	  134249 B/op	     186 allocs/op
BenchmarkMergeSlicesParallel/usingMap,inputSize:150,stringsPerInput:300,duplicateRatio:0.3-32         	      86	  12017615 ns/op	 3187257 B/op	    1043 allocs/op
BenchmarkMergeSlicesParallel/parallelism:1,inputSize:150,stringsPerInput:300,duplicateRatio:0.3-32    	     171	   6776256 ns/op	 2434920 B/op	     161 allocs/op
BenchmarkMergeSlicesParallel/parallelism:8,inputSize:150,stringsPerInput:300,duplicateRatio:0.3-32    	     238	   4916036 ns/op	 4608115 B/op	     273 allocs/op
BenchmarkMergeSlicesParallel/usingMap,inputSize:150,stringsPerInput:300,duplicateRatio:0.8-32         	     283	   4055139 ns/op	  831591 B/op	     213 allocs/op
BenchmarkMergeSlicesParallel/parallelism:1,inputSize:150,stringsPerInput:300,duplicateRatio:0.8-32    	     216	   5447690 ns/op	  354152 B/op	     156 allocs/op
BenchmarkMergeSlicesParallel/parallelism:8,inputSize:150,stringsPerInput:300,duplicateRatio:0.8-32    	     418	   2764875 ns/op	 1953886 B/op	     258 allocs/op
BenchmarkMergeSlicesParallel/usingMap,inputSize:150,stringsPerInput:300,duplicateRatio:0.95-32        	     543	   2135988 ns/op	  212043 B/op	      61 allocs/op
BenchmarkMergeSlicesParallel/parallelism:1,inputSize:150,stringsPerInput:300,duplicateRatio:0.95-32   	     241	   4911898 ns/op	  165736 B/op	     155 allocs/op
BenchmarkMergeSlicesParallel/parallelism:8,inputSize:150,stringsPerInput:300,duplicateRatio:0.95-32   	     835	   1355215 ns/op	  572329 B/op	     239 allocs/op

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…r tree Signed-off-by: Alan Protasio <alanprot@gmail.com>

Signed-off-by: Alan Protasio <alanprot@gmail.com>

pkg/util/strings.go

yeya24 · 2024-02-24T02:45:15Z

pkg/distributor/distributor.go

+
+	// mergeSlicesParallelism is a constant of how much go routines we should use to merge slices, and
+	// it was based on empirical observation: See BenchmarkMergeSlicesParallel
+	mergeSlicesParallelism = 8


I think I am fine with it. Just one question, what if we make batch size a constant and get parallelism based on the input size? Then for small input we can still use 1 core and we use more cores for larger batch size.
We can still cap the concurrency at 8

This is kinda done inside of the function:

p := min(parallelism, len(a)/2)

So, we will only use parallelism if the input is > 4 (and we will use only 2 cores in this case)

Increasing parallelism to > 8 i think does not make much difference as the last merge will be using one core anyway.. so we are at the end of the day increasing the number of slices being merged at the end of the function.

make sense?

I see. Thanks for the explanation. Input length is basically the same as number of ingesters for the user. Then if the user has at least 16 ingesters, then we will use 8 parallelism for it?

16 ingesters sounds a small number to me. I am worried about having very small batch sizes per goroutine. What about using a larger x below like 16 or even 32? Is it better than 2 or you think it doesn't make much difference

p := min(parallelism, len(a)/x)

I think in this case both solutions will be pretty fast.. but lemme create a benchmak

Description updated with 16 ingesters and test case added.

WDYT?

Signed-off-by: Alan Protasio <alanprot@gmail.com>

yeya24

Thanks

pull-request-size bot added the size/L label Feb 22, 2024

Merging sorted slices using more than 1 cpu and k-way merge with lose…

1342145

…r tree Signed-off-by: Alan Protasio <alanprot@gmail.com>

alanprot force-pushed the opmize-slices-merge branch from d0143fc to 1342145 Compare February 22, 2024 22:42

alanprot marked this pull request as ready for review February 22, 2024 22:43

Fix go mod

146fa36

Signed-off-by: Alan Protasio <alanprot@gmail.com>

alanprot requested a review from yeya24 February 22, 2024 23:08

yeya24 reviewed Feb 24, 2024

View reviewed changes

alanprot added 2 commits February 26, 2024 09:38

Adding case with only 16 ingesters

34a50c0

Signed-off-by: Alan Protasio <alanprot@gmail.com>

Rename MergeSlicesV2 to MergeSortedSlices

71a45a3

Signed-off-by: Alan Protasio <alanprot@gmail.com>

yeya24 approved these changes Feb 26, 2024

View reviewed changes

yeya24 merged commit 14d9b7b into cortexproject:master Feb 26, 2024
16 checks passed

alanprot mentioned this pull request Mar 22, 2024

Opmize GetLabels and GetLabels values from store gateway #5820

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging slices from the labels APIs using more than 1 cpu and k-way #5785

Merging slices from the labels APIs using more than 1 cpu and k-way #5785

alanprot commented Feb 22, 2024 •

edited

Loading

yeya24 Feb 24, 2024

alanprot Feb 24, 2024 •

edited

Loading

yeya24 Feb 24, 2024 •

edited

Loading

alanprot Feb 26, 2024

alanprot Feb 26, 2024

yeya24 left a comment

Merging slices from the labels APIs using more than 1 cpu and k-way #5785

Merging slices from the labels APIs using more than 1 cpu and k-way #5785

Conversation

alanprot commented Feb 22, 2024 • edited Loading

yeya24 Feb 24, 2024

Choose a reason for hiding this comment

alanprot Feb 24, 2024 • edited Loading

Choose a reason for hiding this comment

yeya24 Feb 24, 2024 • edited Loading

Choose a reason for hiding this comment

alanprot Feb 26, 2024

Choose a reason for hiding this comment

alanprot Feb 26, 2024

Choose a reason for hiding this comment

yeya24 left a comment

Choose a reason for hiding this comment

alanprot commented Feb 22, 2024 •

edited

Loading

alanprot Feb 24, 2024 •

edited

Loading

yeya24 Feb 24, 2024 •

edited

Loading