Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cherry-pick](branch-2.0) Pick "[Fix](bloom filter) Fix bloom filter memory leak (#34871)" #37824

Closed
wants to merge 3 commits into from

Conversation

Yukang-Lian
Copy link
Collaborator

@Yukang-Lian Yukang-Lian commented Jul 15, 2024

Proposed changes

Pick #34871

* Issue: Doris occasionally encounters an issue where memory usage becomes exceptionally high and does not decrease. The leaked memory is occupied by Bloom filters stored in memory.

Reason: The segment cache stores segment objects read from files into memory. It functions as an LRU cache with an eviction strategy: when the number of segments exceeds the maximum number, or the total memory size of segment objects in the cache exceeds the maximum usage, it evicts the older segments. However, there is a piece of logic in the code that first reads the segment object into memory, assuming it occupies memory size A, then places the read segment object into the cache (at this point, the cache considers the segment object size to be A). It then reads the segment's Bloom filter from the file and assigns it to the segment's Bloom filter member variable, assuming the Bloom filter occupies memory size B. Thus, the total size of the segment object at this point is A+B. However, the cache does not update this size, leading to the actual size of the segment object stored in the cache (A+B) being larger than the size considered by the cache (A). When the number of segment objects in the cache increases to a certain extent, the used memory will surge dramatically. However, the cache does not perceive the size as reaching the eviction limit, so it does not evict the segment objects. In such cases, a memory leak issue arises.

Solution: Since each segment object only reads the Bloom filter once, the issue can be resolved by changing the logic from reading the segment, placing it into the cache, and then reading the Bloom filter to reading the segment, reading the Bloom filter, and then placing it into the cache.
@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@Yukang-Lian
Copy link
Collaborator Author

run buildall

@Yukang-Lian
Copy link
Collaborator Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

be/src/olap/segment_loader.h Show resolved Hide resolved
be/test/olap/segment_cache_test.cpp Show resolved Hide resolved
be/test/olap/segment_cache_test.cpp Show resolved Hide resolved
be/test/olap/segment_cache_test.cpp Show resolved Hide resolved
@doris-robot
Copy link

TPC-H: Total hot run time: 50190 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e4eef77f951087f35bf6184618e82a865ba093a9, data reload: false

------ Round 1 ----------------------------------
q1	17649	4392	4410	4392
q2	2071	162	148	148
q3	10264	1870	1959	1870
q4	10113	1269	1346	1269
q5	8575	3954	3930	3930
q6	261	125	126	125
q7	2106	1608	1632	1608
q8	9542	2766	2725	2725
q9	14291	10520	10560	10520
q10	8640	3531	3510	3510
q11	424	248	255	248
q12	473	307	306	306
q13	18375	3933	4054	3933
q14	363	328	334	328
q15	506	466	463	463
q16	661	573	564	564
q17	1142	980	966	966
q18	7354	6794	6826	6794
q19	1808	1669	1619	1619
q20	519	310	311	310
q21	4431	4124	4130	4124
q22	513	438	439	438
Total cold run time: 120081 ms
Total hot run time: 50190 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4310	4301	4483	4301
q2	321	227	232	227
q3	4194	4150	4130	4130
q4	2759	2745	2752	2745
q5	7195	7160	7144	7144
q6	239	121	120	120
q7	3258	2829	2792	2792
q8	4375	4475	4482	4475
q9	17404	17183	17100	17100
q10	4211	4300	4281	4281
q11	762	703	688	688
q12	1032	850	856	850
q13	7216	3732	3754	3732
q14	464	433	421	421
q15	515	453	457	453
q16	736	694	678	678
q17	3773	3887	3866	3866
q18	8815	8686	8713	8686
q19	1764	1673	1693	1673
q20	2384	2149	2081	2081
q21	8470	8592	8426	8426
q22	1066	1008	992	992
Total cold run time: 85263 ms
Total hot run time: 79861 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.86% (8120/21448)
Line Coverage: 29.53% (66570/225413)
Region Coverage: 29.00% (34310/118290)
Branch Coverage: 24.88% (17629/70856)
Coverage Report: http://coverage.selectdb-in.cc/coverage/e4eef77f951087f35bf6184618e82a865ba093a9_e4eef77f951087f35bf6184618e82a865ba093a9/report/index.html

@doris-robot
Copy link

TPC-DS: Total hot run time: 204292 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e4eef77f951087f35bf6184618e82a865ba093a9, data reload: false

query1	934	427	380	380
query2	6579	2749	2697	2697
query3	6923	216	203	203
query4	20340	18090	17954	17954
query5	19725	6550	6581	6550
query6	298	223	230	223
query7	4154	300	313	300
query8	404	406	414	406
query9	3146	2662	2597	2597
query10	423	296	291	291
query11	11404	10707	10767	10707
query12	132	78	75	75
query13	5606	706	724	706
query14	17762	13169	13661	13169
query15	361	248	244	244
query16	6465	295	269	269
query17	1682	1468	869	869
query18	2307	419	412	412
query19	199	151	160	151
query20	83	80	81	80
query21	189	96	95	95
query22	5270	5144	5058	5058
query23	32480	31854	31990	31854
query24	6912	6507	6494	6494
query25	517	445	437	437
query26	533	168	161	161
query27	1886	300	307	300
query28	6115	2397	2320	2320
query29	2827	2583	2612	2583
query30	250	165	167	165
query31	915	722	775	722
query32	65	64	63	63
query33	417	259	268	259
query34	846	476	482	476
query35	1153	946	951	946
query36	1303	1274	1364	1274
query37	98	62	64	62
query38	3054	2901	2993	2901
query39	1370	1327	1320	1320
query40	216	96	104	96
query41	46	44	42	42
query42	82	83	79	79
query43	764	655	650	650
query44	1138	718	723	718
query45	249	238	237	237
query46	1222	965	985	965
query47	1884	1887	1660	1660
query48	1033	726	691	691
query49	625	379	378	378
query50	868	624	608	608
query51	4824	4676	4719	4676
query52	97	81	88	81
query53	447	324	317	317
query54	2671	2464	2493	2464
query55	87	90	86	86
query56	241	220	215	215
query57	1344	1172	1088	1088
query58	220	194	206	194
query59	4066	4061	3957	3957
query60	228	200	228	200
query61	103	104	108	104
query62	921	443	533	443
query63	485	354	343	343
query64	2539	1557	1481	1481
query65	3655	3553	3561	3553
query66	842	380	380	380
query67	18201	17463	15894	15894
query68	7915	658	648	648
query69	583	380	339	339
query70	1580	1482	1523	1482
query71	394	306	320	306
query72	6630	3578	3549	3549
query73	745	319	329	319
query74	6259	5825	5850	5825
query75	4596	3706	3680	3680
query76	4462	1140	1186	1140
query77	542	253	264	253
query78	12807	11571	12053	11571
query79	8054	639	654	639
query80	1955	417	398	398
query81	520	237	234	234
query82	1504	100	97	97
query83	168	133	132	132
query84	262	70	71	70
query85	1413	346	344	344
query86	364	298	284	284
query87	3244	2991	3029	2991
query88	5174	2304	2317	2304
query89	391	289	301	289
query90	1776	220	217	217
query91	173	141	143	141
query92	64	57	55	55
query93	4878	540	604	540
query94	897	215	210	210
query95	1105	1069	1050	1050
query96	654	317	324	317
query97	6489	6454	6497	6454
query98	193	186	176	176
query99	2869	865	901	865
Total cold run time: 312885 ms
Total hot run time: 204292 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.92 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit e4eef77f951087f35bf6184618e82a865ba093a9, data reload: false

query1	0.02	0.03	0.02
query2	0.07	0.02	0.02
query3	0.25	0.05	0.04
query4	1.79	0.06	0.06
query5	0.53	0.53	0.51
query6	1.28	0.67	0.63
query7	0.02	0.01	0.01
query8	0.03	0.03	0.02
query9	0.52	0.49	0.48
query10	0.54	0.56	0.53
query11	0.12	0.09	0.09
query12	0.11	0.08	0.08
query13	0.63	0.62	0.60
query14	0.78	0.77	0.78
query15	0.78	0.76	0.75
query16	0.36	0.38	0.37
query17	0.99	1.02	1.01
query18	0.24	0.24	0.24
query19	1.94	1.83	1.76
query20	0.01	0.01	0.02
query21	15.47	0.57	0.56
query22	2.00	2.58	1.60
query23	16.50	1.13	0.93
query24	6.06	1.56	1.40
query25	0.39	0.12	0.05
query26	0.70	0.15	0.16
query27	0.03	0.04	0.05
query28	6.34	0.75	0.72
query29	12.65	2.26	2.13
query30	0.60	0.48	0.54
query31	2.82	0.38	0.38
query32	3.37	0.50	0.49
query33	3.06	3.08	3.07
query34	15.25	4.82	4.84
query35	4.86	4.85	4.85
query36	1.04	1.01	1.02
query37	0.06	0.04	0.04
query38	0.03	0.02	0.02
query39	0.02	0.01	0.01
query40	0.16	0.14	0.14
query41	0.07	0.01	0.02
query42	0.02	0.02	0.01
query43	0.02	0.01	0.02
Total cold run time: 102.53 s
Total hot run time: 30.92 s

@doris-robot
Copy link

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit e4eef77f951087f35bf6184618e82a865ba093a9 with default session variables
Stream load json:         20 seconds loaded 2358488459 Bytes, about 112 MB/s
Stream load orc:          58 seconds loaded 1101869774 Bytes, about 18 MB/s
Stream load parquet:      32 seconds loaded 861443392 Bytes, about 25 MB/s
Insert into select:       21.7 seconds inserted 10000000 Rows, about 460K ops/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants