Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](load data) decommission replica don't load data when it misses versions #38198

Merged

Conversation

yujun777
Copy link
Collaborator

@yujun777 yujun777 commented Jul 22, 2024

BUG: when a decommission replica misses versions (last failed version > 0), it will cause a bug:

  1. It cann't do compaction due to it has miss versions;
  2. It cann't cloning the missing rowsets because it's decommissioned.

So if it continue to load data, its rowset will increase util it cause error -235(TOO MANY VERSION).

Fix: don't write data to the decommission replica if it miss rowsets.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@yujun777
Copy link
Collaborator Author

run buildall

2 similar comments
@yujun777
Copy link
Collaborator Author

run buildall

@yujun777
Copy link
Collaborator Author

run buildall

Copy link
Contributor

@deardeng deardeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-H: Total hot run time: 39874 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit fdb446422972f30b0a75036348a5d9dd57a985ee, data reload: false

------ Round 1 ----------------------------------
q1	17905	4355	4304	4304
q2	2040	191	198	191
q3	10499	1222	1076	1076
q4	10191	847	907	847
q5	7639	2706	2676	2676
q6	218	133	135	133
q7	948	604	599	599
q8	9216	2061	2089	2061
q9	8916	6530	6563	6530
q10	8845	3781	3802	3781
q11	445	238	236	236
q12	405	219	225	219
q13	17770	2987	2947	2947
q14	279	229	233	229
q15	523	502	495	495
q16	497	378	380	378
q17	955	613	652	613
q18	8086	7490	7397	7397
q19	8171	1350	1308	1308
q20	668	323	315	315
q21	4928	3422	3261	3261
q22	360	287	278	278
Total cold run time: 119504 ms
Total hot run time: 39874 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4388	4428	4241	4241
q2	383	274	261	261
q3	3080	2903	2900	2900
q4	2011	1773	1744	1744
q5	5528	5513	5445	5445
q6	224	133	131	131
q7	2265	1812	1874	1812
q8	3265	3438	3402	3402
q9	8806	8824	8829	8824
q10	4152	3779	3823	3779
q11	595	517	499	499
q12	830	638	640	638
q13	17254	3166	3145	3145
q14	316	281	300	281
q15	526	485	494	485
q16	488	443	433	433
q17	1826	1541	1506	1506
q18	8280	7927	7799	7799
q19	1727	1484	1600	1484
q20	2114	1872	1844	1844
q21	7670	4772	4733	4733
q22	585	496	510	496
Total cold run time: 76313 ms
Total hot run time: 55882 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 174310 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit fdb446422972f30b0a75036348a5d9dd57a985ee, data reload: false

query1	915	374	368	368
query2	6449	1985	1929	1929
query3	6633	207	216	207
query4	26543	17432	17251	17251
query5	3611	471	468	468
query6	260	168	159	159
query7	4575	298	290	290
query8	232	198	193	193
query9	8475	2435	2421	2421
query10	435	279	275	275
query11	11673	10040	10115	10040
query12	119	88	82	82
query13	1647	373	388	373
query14	10233	7814	8155	7814
query15	225	173	175	173
query16	7112	477	471	471
query17	1587	581	546	546
query18	1213	291	287	287
query19	202	152	152	152
query20	91	85	84	84
query21	204	135	128	128
query22	4379	4068	3884	3884
query23	34187	33709	33675	33675
query24	10782	2987	3038	2987
query25	646	407	409	407
query26	1178	158	158	158
query27	2424	340	278	278
query28	7063	2059	2021	2021
query29	877	664	620	620
query30	250	174	166	166
query31	961	748	773	748
query32	95	55	56	55
query33	729	328	330	328
query34	892	512	484	484
query35	856	772	761	761
query36	1161	979	998	979
query37	155	87	80	80
query38	2960	2937	2800	2800
query39	903	836	830	830
query40	207	117	119	117
query41	45	45	45	45
query42	113	99	96	96
query43	509	475	470	470
query44	1197	706	707	706
query45	197	164	163	163
query46	1088	720	729	720
query47	1840	1771	1772	1771
query48	364	288	298	288
query49	844	420	410	410
query50	770	393	388	388
query51	6716	6634	6733	6634
query52	105	93	90	90
query53	358	292	286	286
query54	823	450	450	450
query55	76	72	72	72
query56	287	254	273	254
query57	1125	1077	1036	1036
query58	252	258	247	247
query59	2884	2711	2716	2711
query60	307	274	292	274
query61	100	97	95	95
query62	794	665	665	665
query63	320	286	295	286
query64	9467	2220	1682	1682
query65	3161	3124	3128	3124
query66	779	325	325	325
query67	15450	15120	14977	14977
query68	4550	522	533	522
query69	698	425	352	352
query70	1172	1103	1095	1095
query71	450	276	274	274
query72	8736	5565	5760	5565
query73	770	323	323	323
query74	6218	5700	5650	5650
query75	3969	2636	2679	2636
query76	3110	959	933	933
query77	720	309	304	304
query78	9711	9113	8993	8993
query79	2249	518	508	508
query80	2490	470	470	470
query81	588	223	217	217
query82	858	133	137	133
query83	293	166	162	162
query84	256	89	86	86
query85	2157	320	355	320
query86	466	286	299	286
query87	3322	3118	3187	3118
query88	4182	2450	2443	2443
query89	479	389	385	385
query90	2000	192	191	191
query91	131	100	101	100
query92	60	50	48	48
query93	2686	485	483	483
query94	1373	290	274	274
query95	406	313	317	313
query96	603	270	271	270
query97	3231	3038	3059	3038
query98	229	201	195	195
query99	1544	1269	1282	1269
Total cold run time: 282817 ms
Total hot run time: 174310 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.82 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit fdb446422972f30b0a75036348a5d9dd57a985ee, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.03	0.04
query3	0.22	0.05	0.05
query4	1.68	0.07	0.08
query5	0.48	0.48	0.47
query6	1.14	0.73	0.73
query7	0.02	0.02	0.01
query8	0.05	0.04	0.05
query9	0.54	0.49	0.48
query10	0.55	0.56	0.52
query11	0.16	0.11	0.11
query12	0.16	0.12	0.12
query13	0.60	0.58	0.58
query14	0.76	0.77	0.81
query15	0.85	0.82	0.82
query16	0.36	0.37	0.35
query17	0.97	1.01	1.01
query18	0.23	0.22	0.22
query19	1.90	1.73	1.79
query20	0.01	0.00	0.01
query21	15.41	0.74	0.64
query22	4.25	7.27	2.09
query23	18.28	1.39	1.24
query24	2.10	0.22	0.23
query25	0.15	0.08	0.09
query26	0.31	0.22	0.21
query27	0.45	0.24	0.23
query28	13.25	1.01	0.98
query29	12.61	3.38	3.36
query30	0.25	0.06	0.05
query31	2.88	0.38	0.38
query32	3.29	0.48	0.47
query33	2.86	2.96	2.91
query34	16.86	4.35	4.36
query35	4.45	4.40	4.40
query36	0.65	0.47	0.49
query37	0.19	0.15	0.16
query38	0.16	0.16	0.15
query39	0.04	0.04	0.03
query40	0.15	0.12	0.12
query41	0.09	0.06	0.05
query42	0.05	0.05	0.05
query43	0.04	0.05	0.04
Total cold run time: 109.57 s
Total hot run time: 30.82 s

@yujun777
Copy link
Collaborator Author

run feut

@yujun777
Copy link
Collaborator Author

run cloud_p1

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 23, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@dataroaring dataroaring merged commit 1d315b2 into apache:master Jul 23, 2024
28 of 30 checks passed
dataroaring pushed a commit that referenced this pull request Jul 23, 2024
dataroaring pushed a commit that referenced this pull request Jul 23, 2024
dataroaring pushed a commit that referenced this pull request Jul 24, 2024
…versions (#38198)

BUG: when a decommission replica misses versions (last failed version >
0), it will cause a bug:
1. It cann't do compaction due to it has miss versions;
2. It cann't  cloning the missing rowsets because it's decommissioned.

So if it continue to load data, its rowset will increase util it cause
error -235(TOO MANY VERSION).

Fix:  don't write data to the decommission replica if it miss rowsets.
@wm1581066 wm1581066 added the usercase Important user case type label label Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.0.14-merged dev/2.1.6-merged dev/3.0.1-merged reviewed usercase Important user case type label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants