Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix](delete) Fix delete job timeout when executing delete from ... #37363

Merged
merged 3 commits into from
Jul 6, 2024

Conversation

bobhan1
Copy link
Contributor

@bobhan1 bobhan1 commented Jul 5, 2024

Proposed changes

Currently, when FE execute delete job, it will send REALTIME_PUSH task to all affected replicas and wait for all asynchronous tasks sent to the backend to return successful status results or until timeout(which is at least 30s for delete job). If some replica failed to do the job and report an error for the task to FE, FE will retry the task to that replica. However, for some errors like DELETE_INVALID_CONDITION/DELETE_INVALID_PARAMETERS, we should fail and abort the delete job on FE directly and report the errors to users, rather than keep retrying in vain.
So this PR let the delete job fail and abort directly on FE and report the errors to users if FE receives an above error from BE.

branch-2.1-pick: #37374

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@bobhan1
Copy link
Contributor Author

bobhan1 commented Jul 5, 2024

run buildall

@bobhan1 bobhan1 changed the title [Fix](delete) FIx delete job time when executing delete from ... [Fix](delete) Fix delete job time when executing delete from ... Jul 5, 2024
Copy link
Contributor

github-actions bot commented Jul 5, 2024

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

github-actions bot commented Jul 5, 2024

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring
Copy link
Contributor

run buildall

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 5, 2024
Copy link
Contributor

github-actions bot commented Jul 5, 2024

PR approved by at least one committer and no changes requested.

Copy link
Contributor

github-actions bot commented Jul 5, 2024

PR approved by anyone and no changes requested.

@bobhan1 bobhan1 changed the title [Fix](delete) Fix delete job time when executing delete from ... [Fix](delete) Fix delete job timeout when executing delete from ... Jul 5, 2024
@doris-robot
Copy link

TPC-H: Total hot run time: 40016 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ae828d49acf6b9adb44c4fb6ce52ebc1b1059aaf, data reload: false

------ Round 1 ----------------------------------
q1	12848	4878	4242	4242
q2	1430	199	189	189
q3	2507	1141	1110	1110
q4	5545	759	868	759
q5	3985	2828	2647	2647
q6	223	143	143	143
q7	955	610	603	603
q8	4767	2039	2090	2039
q9	6681	6494	6501	6494
q10	3861	3733	3766	3733
q11	434	245	256	245
q12	444	232	227	227
q13	17428	2998	2975	2975
q14	267	237	228	228
q15	511	490	501	490
q16	504	381	391	381
q17	971	655	732	655
q18	8090	7565	7513	7513
q19	1668	1545	1476	1476
q20	695	318	345	318
q21	6019	3211	3259	3211
q22	394	343	338	338
Total cold run time: 80227 ms
Total hot run time: 40016 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4273	4200	4225	4200
q2	374	268	264	264
q3	2985	2765	2691	2691
q4	1830	1600	1517	1517
q5	5254	5306	5271	5271
q6	217	130	133	130
q7	2127	1750	1704	1704
q8	3191	3342	3272	3272
q9	8325	8322	8292	8292
q10	3929	3678	3667	3667
q11	576	484	484	484
q12	797	588	655	588
q13	6946	3005	2969	2969
q14	307	256	257	256
q15	513	471	477	471
q16	472	426	414	414
q17	1776	1495	1480	1480
q18	7692	7533	7549	7533
q19	1668	1543	1563	1543
q20	1984	1817	1772	1772
q21	4857	4808	4988	4808
q22	631	548	536	536
Total cold run time: 60724 ms
Total hot run time: 53862 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 174526 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ae828d49acf6b9adb44c4fb6ce52ebc1b1059aaf, data reload: false

query1	919	393	373	373
query2	6496	2491	2378	2378
query3	6655	217	221	217
query4	28760	17606	17381	17381
query5	4166	470	489	470
query6	280	182	166	166
query7	4607	304	289	289
query8	344	302	285	285
query9	8507	2477	2492	2477
query10	598	303	288	288
query11	12481	10103	10097	10097
query12	133	87	82	82
query13	1634	373	377	373
query14	10205	7687	7542	7542
query15	233	179	184	179
query16	7820	307	321	307
query17	1809	578	533	533
query18	1953	283	275	275
query19	192	148	150	148
query20	90	84	81	81
query21	205	130	125	125
query22	4318	4168	4157	4157
query23	33932	33412	33110	33110
query24	11895	2786	2796	2786
query25	634	369	369	369
query26	1746	156	154	154
query27	2921	328	324	324
query28	7547	2128	2123	2123
query29	1057	637	605	605
query30	280	153	147	147
query31	972	742	740	740
query32	97	54	53	53
query33	782	316	312	312
query34	922	497	506	497
query35	782	639	609	609
query36	1075	910	948	910
query37	145	75	74	74
query38	2871	2713	2711	2711
query39	829	805	806	805
query40	282	124	123	123
query41	56	47	49	47
query42	123	101	108	101
query43	572	548	572	548
query44	1177	752	738	738
query45	195	167	162	162
query46	1077	743	744	743
query47	1870	1780	1795	1780
query48	382	298	299	298
query49	1153	424	401	401
query50	770	410	411	410
query51	6784	6762	6716	6716
query52	105	91	97	91
query53	363	297	300	297
query54	953	463	460	460
query55	78	79	76	76
query56	294	282	275	275
query57	1165	1068	1050	1050
query58	266	249	258	249
query59	3275	3319	3144	3144
query60	340	305	302	302
query61	117	115	116	115
query62	664	503	476	476
query63	321	290	289	289
query64	10507	2197	1641	1641
query65	3216	3082	3088	3082
query66	1375	339	329	329
query67	15509	15023	15205	15023
query68	4913	563	550	550
query69	650	453	346	346
query70	1074	1087	1042	1042
query71	436	273	285	273
query72	7161	5349	5376	5349
query73	774	334	336	334
query74	6069	5550	5542	5542
query75	3626	2706	2665	2665
query76	3549	1100	895	895
query77	680	323	310	310
query78	10213	10932	9786	9786
query79	3073	518	531	518
query80	925	491	486	486
query81	585	226	220	220
query82	288	110	113	110
query83	334	173	174	173
query84	282	89	87	87
query85	738	312	302	302
query86	483	327	320	320
query87	3268	3115	3099	3099
query88	4317	2502	2462	2462
query89	495	387	377	377
query90	1810	199	191	191
query91	136	108	105	105
query92	61	48	48	48
query93	1523	524	515	515
query94	1082	216	212	212
query95	412	333	409	333
query96	580	269	277	269
query97	3232	3057	3048	3048
query98	219	208	199	199
query99	1187	857	842	842
Total cold run time: 288056 ms
Total hot run time: 174526 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.25 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ae828d49acf6b9adb44c4fb6ce52ebc1b1059aaf, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.04	0.03
query3	0.22	0.05	0.04
query4	1.68	0.07	0.08
query5	0.51	0.49	0.49
query6	1.15	0.74	0.73
query7	0.02	0.01	0.01
query8	0.05	0.04	0.05
query9	0.55	0.47	0.49
query10	0.55	0.55	0.54
query11	0.14	0.11	0.12
query12	0.15	0.12	0.13
query13	0.59	0.59	0.58
query14	0.78	0.79	0.77
query15	0.86	0.82	0.83
query16	0.37	0.37	0.36
query17	0.99	1.01	1.02
query18	0.22	0.26	0.23
query19	1.92	1.81	1.81
query20	0.01	0.00	0.01
query21	15.40	0.74	0.65
query22	3.69	8.27	1.63
query23	18.32	1.34	1.23
query24	2.09	0.22	0.23
query25	0.14	0.08	0.09
query26	0.30	0.20	0.22
query27	0.46	0.24	0.22
query28	13.29	1.02	1.00
query29	12.64	3.27	3.26
query30	0.25	0.06	0.06
query31	2.87	0.39	0.38
query32	3.26	0.48	0.46
query33	2.87	2.92	2.86
query34	17.08	4.32	4.38
query35	4.40	4.45	4.40
query36	0.65	0.46	0.49
query37	0.19	0.16	0.15
query38	0.16	0.14	0.15
query39	0.05	0.04	0.03
query40	0.15	0.12	0.13
query41	0.09	0.05	0.04
query42	0.05	0.05	0.05
query43	0.04	0.03	0.04
Total cold run time: 109.31 s
Total hot run time: 30.25 s

@bobhan1
Copy link
Contributor Author

bobhan1 commented Jul 6, 2024

run external

@bobhan1
Copy link
Contributor Author

bobhan1 commented Jul 6, 2024

run cloud_p1

@bobhan1 bobhan1 force-pushed the fix-delete-timeout branch from ae828d4 to 8fed24e Compare July 6, 2024 03:08
@bobhan1
Copy link
Contributor Author

bobhan1 commented Jul 6, 2024

run buildall

Copy link
Contributor

github-actions bot commented Jul 6, 2024

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 39709 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8fed24e3d8261ecc32a10c80579b319a5dcc15b4, data reload: false

------ Round 1 ----------------------------------
q1	17617	4368	4286	4286
q2	2024	189	182	182
q3	10455	1214	1108	1108
q4	10198	789	714	714
q5	7497	2661	2698	2661
q6	219	137	136	136
q7	947	593	604	593
q8	9227	2082	2061	2061
q9	8633	6464	6473	6464
q10	8923	3674	3693	3674
q11	456	230	237	230
q12	483	234	230	230
q13	17775	2985	3008	2985
q14	262	218	226	218
q15	538	482	472	472
q16	526	370	376	370
q17	957	609	653	609
q18	8012	7404	7440	7404
q19	2897	1506	1510	1506
q20	654	309	320	309
q21	4904	3223	3161	3161
q22	392	338	336	336
Total cold run time: 113596 ms
Total hot run time: 39709 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4451	4204	4227	4204
q2	361	279	278	278
q3	2951	2713	2687	2687
q4	1965	1644	1664	1644
q5	5586	5692	5461	5461
q6	225	137	133	133
q7	2214	1842	1804	1804
q8	3270	3452	3399	3399
q9	8566	8714	8633	8633
q10	4171	3864	3890	3864
q11	607	497	501	497
q12	795	627	578	578
q13	16118	3190	3173	3173
q14	307	280	282	280
q15	539	492	482	482
q16	495	434	425	425
q17	1790	1531	1490	1490
q18	8078	7909	7889	7889
q19	1752	1690	1621	1621
q20	2131	1893	1859	1859
q21	5146	4876	4935	4876
q22	665	537	532	532
Total cold run time: 72183 ms
Total hot run time: 55809 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 171308 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8fed24e3d8261ecc32a10c80579b319a5dcc15b4, data reload: false

query1	916	368	369	368
query2	6463	2433	2349	2349
query3	6639	206	216	206
query4	28197	17547	17206	17206
query5	3651	487	490	487
query6	276	189	175	175
query7	4605	292	293	292
query8	326	298	292	292
query9	8448	2365	2352	2352
query10	579	319	289	289
query11	10425	10082	9936	9936
query12	119	86	83	83
query13	1662	382	387	382
query14	9404	7706	7574	7574
query15	230	184	182	182
query16	7941	304	298	298
query17	1829	529	522	522
query18	2054	270	271	270
query19	185	143	149	143
query20	86	81	79	79
query21	215	141	124	124
query22	4394	4183	4018	4018
query23	33985	33605	33447	33447
query24	9372	2933	2897	2897
query25	607	386	368	368
query26	730	162	163	162
query27	2324	321	324	321
query28	5853	2132	2116	2116
query29	866	640	636	636
query30	251	158	155	155
query31	1004	757	763	757
query32	95	56	58	56
query33	655	301	311	301
query34	929	519	493	493
query35	725	653	612	612
query36	1123	970	995	970
query37	133	81	82	81
query38	2970	2846	2800	2800
query39	890	859	848	848
query40	204	131	122	122
query41	55	49	52	49
query42	120	102	108	102
query43	580	521	547	521
query44	1075	741	732	732
query45	190	157	159	157
query46	1068	720	713	713
query47	1849	1745	1750	1745
query48	370	293	305	293
query49	860	403	406	403
query50	769	384	385	384
query51	6777	6822	6737	6737
query52	105	93	93	93
query53	357	293	295	293
query54	855	454	449	449
query55	72	77	71	71
query56	275	255	258	255
query57	1141	1069	1061	1061
query58	253	263	249	249
query59	3303	3799	3757	3757
query60	297	276	275	275
query61	97	93	97	93
query62	596	438	427	427
query63	315	295	295	295
query64	9140	2240	1661	1661
query65	3157	3099	3120	3099
query66	760	326	330	326
query67	15503	15232	15074	15074
query68	4439	533	532	532
query69	469	314	329	314
query70	1172	1124	1116	1116
query71	347	281	275	275
query72	7042	4244	2630	2630
query73	748	324	322	322
query74	5926	5526	5463	5463
query75	3378	2671	2673	2671
query76	2124	924	941	924
query77	430	301	298	298
query78	9462	8976	9651	8976
query79	2358	527	511	511
query80	1944	470	463	463
query81	606	222	219	219
query82	786	109	103	103
query83	281	170	172	170
query84	268	144	90	90
query85	1880	298	306	298
query86	492	316	275	275
query87	3332	3083	3072	3072
query88	4079	2436	2447	2436
query89	469	397	394	394
query90	1759	185	191	185
query91	135	104	101	101
query92	62	48	52	48
query93	2432	507	504	504
query94	1142	209	212	209
query95	402	316	312	312
query96	595	271	264	264
query97	3225	3021	3027	3021
query98	222	208	194	194
query99	1239	835	849	835
Total cold run time: 274544 ms
Total hot run time: 171308 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.23 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 8fed24e3d8261ecc32a10c80579b319a5dcc15b4, data reload: false

query1	0.04	0.03	0.03
query2	0.09	0.04	0.04
query3	0.23	0.06	0.05
query4	1.68	0.07	0.07
query5	0.51	0.48	0.49
query6	1.13	0.72	0.72
query7	0.02	0.02	0.01
query8	0.05	0.04	0.04
query9	0.55	0.48	0.48
query10	0.54	0.53	0.54
query11	0.15	0.11	0.11
query12	0.16	0.12	0.13
query13	0.60	0.59	0.60
query14	0.75	0.77	0.79
query15	0.85	0.80	0.82
query16	0.36	0.38	0.37
query17	1.03	0.97	1.03
query18	0.23	0.24	0.24
query19	1.82	1.74	1.72
query20	0.01	0.01	0.01
query21	15.41	0.74	0.65
query22	4.29	8.12	1.61
query23	18.28	1.37	1.23
query24	2.17	0.21	0.22
query25	0.15	0.10	0.08
query26	0.28	0.20	0.20
query27	0.47	0.23	0.23
query28	13.25	1.02	0.99
query29	12.59	3.36	3.34
query30	0.25	0.06	0.05
query31	2.86	0.39	0.38
query32	3.28	0.48	0.47
query33	2.95	2.86	2.92
query34	17.01	4.33	4.37
query35	4.40	4.44	4.43
query36	0.65	0.46	0.48
query37	0.17	0.14	0.16
query38	0.15	0.14	0.14
query39	0.04	0.04	0.03
query40	0.14	0.12	0.12
query41	0.09	0.04	0.05
query42	0.06	0.05	0.05
query43	0.05	0.04	0.04
Total cold run time: 109.79 s
Total hot run time: 30.23 s

@bobhan1
Copy link
Contributor Author

bobhan1 commented Jul 6, 2024

run p0

@dataroaring dataroaring merged commit 4fd46e2 into apache:master Jul 6, 2024
26 of 29 checks passed
dataroaring pushed a commit that referenced this pull request Jul 7, 2024
dataroaring pushed a commit that referenced this pull request Jul 17, 2024
…#37363)

## Proposed changes

Currently, when FE execute delete job, it will send `REALTIME_PUSH` task
to all affected replicas and **wait for all asynchronous tasks sent to
the backend to return successful status results** or until timeout(which
is at least 30s for delete job). If some replica failed to do the job
and report an error for the task to FE, FE will retry the task to that
replica. However, for some errors like
`DELETE_INVALID_CONDITION`/`DELETE_INVALID_PARAMETERS`, we should fail
and abort the delete job on FE directly and report the errors to users,
rather than keep retrying in vain.
So this PR let the delete job fail and abort directly on FE and report
the errors to users if FE receives an above error from BE.
dataroaring pushed a commit that referenced this pull request Jul 17, 2024
…_INVALID_XXX` (#37834)

## Proposed changes

fix #37363, delete job should fail
and abort for DELETE_INVALID_CONDITION/DELETE_INVALID_PARAMETERS and
retry for other failures.
dataroaring pushed a commit that referenced this pull request Jul 19, 2024
…_INVALID_XXX` (#37834)

## Proposed changes

fix #37363, delete job should fail
and abort for DELETE_INVALID_CONDITION/DELETE_INVALID_PARAMETERS and
retry for other failures.
dataroaring pushed a commit that referenced this pull request Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants