Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](query cancel) Fix query is cancelled when it comes from follower FE #37662

Merged
merged 2 commits into from
Jul 12, 2024

Conversation

zhiqiang-hhhh
Copy link
Contributor

@zhiqiang-hhhh zhiqiang-hhhh commented Jul 11, 2024

In some rear cases, the rpc port of follower FE is not updated in time, the value of rpc port of this follower in heartbeat will be 0, but actually it is still running. Query from the follower FE will be cancelled by be until rpc port is updated correctly on BE.

This pr fixes the problem on BE by detecting above situation, and avoid cancel query in this situation.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 40180 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9c9cee84134a83646fc2d172e3fa8ef5905cd58e, data reload: false

------ Round 1 ----------------------------------
q1	17590	4434	4295	4295
q2	2002	194	189	189
q3	10476	1196	1085	1085
q4	10193	816	781	781
q5	7531	2834	2725	2725
q6	222	139	143	139
q7	961	606	619	606
q8	9220	2068	2103	2068
q9	8879	6546	6564	6546
q10	8948	3850	3816	3816
q11	472	247	246	246
q12	455	233	240	233
q13	17765	2993	2995	2993
q14	283	225	236	225
q15	515	491	479	479
q16	492	384	378	378
q17	966	629	689	629
q18	8171	7527	7402	7402
q19	7712	1479	1506	1479
q20	662	319	329	319
q21	4961	3240	3216	3216
q22	379	331	339	331
Total cold run time: 118855 ms
Total hot run time: 40180 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4434	4230	4271	4230
q2	370	288	272	272
q3	3066	2933	2981	2933
q4	1986	1767	1716	1716
q5	5607	5542	5490	5490
q6	243	142	147	142
q7	2261	1855	1887	1855
q8	3287	3458	3439	3439
q9	8818	8909	8828	8828
q10	4222	3837	3849	3837
q11	606	524	523	523
q12	848	657	655	655
q13	16603	3174	3193	3174
q14	313	294	282	282
q15	523	500	476	476
q16	506	457	442	442
q17	1821	1540	1504	1504
q18	8288	7937	7864	7864
q19	1806	1665	1705	1665
q20	2530	1861	1871	1861
q21	5028	4815	4920	4815
q22	629	552	567	552
Total cold run time: 73795 ms
Total hot run time: 56555 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 175123 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9c9cee84134a83646fc2d172e3fa8ef5905cd58e, data reload: false

query1	914	364	374	364
query2	6395	2363	2357	2357
query3	6653	211	225	211
query4	28352	17623	17350	17350
query5	3745	483	478	478
query6	287	183	178	178
query7	4590	291	283	283
query8	307	303	296	296
query9	8538	2483	2475	2475
query10	451	285	277	277
query11	11553	10122	10086	10086
query12	121	94	84	84
query13	1655	375	377	375
query14	9981	7030	7606	7030
query15	234	193	189	189
query16	7746	322	324	322
query17	1345	550	545	545
query18	1959	290	291	290
query19	203	150	151	150
query20	91	88	88	88
query21	213	126	132	126
query22	4333	4127	3977	3977
query23	33990	34245	33766	33766
query24	10887	2994	2965	2965
query25	626	406	410	406
query26	726	160	156	156
query27	2273	283	287	283
query28	6259	2181	2176	2176
query29	897	659	650	650
query30	257	156	154	154
query31	974	760	770	760
query32	97	57	60	57
query33	682	320	325	320
query34	901	516	506	506
query35	684	607	636	607
query36	1138	985	1000	985
query37	154	88	96	88
query38	2922	2828	2884	2828
query39	898	814	827	814
query40	211	122	121	121
query41	56	53	58	53
query42	116	99	101	99
query43	585	526	531	526
query44	1103	734	722	722
query45	196	163	162	162
query46	1091	776	726	726
query47	1879	1766	1786	1766
query48	381	303	306	303
query49	854	509	416	416
query50	788	390	404	390
query51	6972	6859	6842	6842
query52	112	96	98	96
query53	363	286	292	286
query54	874	470	464	464
query55	76	75	78	75
query56	291	264	264	264
query57	1127	1049	1047	1047
query58	257	238	240	238
query59	3355	3427	3185	3185
query60	314	275	283	275
query61	102	98	101	98
query62	778	661	651	651
query63	329	292	293	292
query64	9491	2236	1680	1680
query65	3183	3101	3100	3100
query66	705	332	343	332
query67	15567	15065	15148	15065
query68	8584	549	569	549
query69	708	485	371	371
query70	1426	1179	1138	1138
query71	518	289	280	280
query72	8621	5721	5474	5474
query73	2241	331	330	330
query74	5988	5575	5507	5507
query75	5049	2685	2741	2685
query76	4863	945	939	939
query77	794	316	312	312
query78	9983	9113	8988	8988
query79	9633	528	532	528
query80	1290	483	491	483
query81	576	224	227	224
query82	676	139	132	132
query83	336	167	167	167
query84	273	89	87	87
query85	1319	322	301	301
query86	389	304	306	304
query87	3323	3141	3102	3102
query88	4194	2451	2473	2451
query89	525	384	380	380
query90	2166	193	196	193
query91	134	105	103	103
query92	60	49	51	49
query93	6721	519	512	512
query94	1515	222	214	214
query95	418	324	328	324
query96	621	280	276	276
query97	3190	3079	2971	2971
query98	219	204	192	192
query99	1563	1231	1264	1231
Total cold run time: 303103 ms
Total hot run time: 175123 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.87 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 9c9cee84134a83646fc2d172e3fa8ef5905cd58e, data reload: false

query1	0.05	0.03	0.03
query2	0.08	0.04	0.04
query3	0.23	0.05	0.04
query4	1.69	0.07	0.08
query5	0.48	0.49	0.47
query6	1.13	0.72	0.72
query7	0.02	0.01	0.01
query8	0.05	0.04	0.04
query9	0.55	0.49	0.49
query10	0.55	0.54	0.54
query11	0.15	0.12	0.11
query12	0.15	0.12	0.13
query13	0.60	0.60	0.57
query14	0.77	0.78	0.77
query15	0.86	0.81	0.82
query16	0.37	0.38	0.34
query17	0.98	0.96	1.04
query18	0.23	0.22	0.21
query19	1.89	1.68	1.74
query20	0.02	0.01	0.01
query21	15.41	0.76	0.66
query22	3.91	7.51	2.32
query23	18.24	1.34	1.25
query24	2.19	0.22	0.23
query25	0.15	0.09	0.09
query26	0.29	0.21	0.21
query27	0.46	0.23	0.23
query28	13.20	1.02	0.99
query29	12.65	3.30	3.26
query30	0.25	0.07	0.05
query31	2.89	0.39	0.39
query32	3.24	0.48	0.48
query33	2.84	2.93	2.92
query34	17.20	4.34	4.40
query35	4.49	4.43	4.40
query36	0.65	0.46	0.46
query37	0.19	0.16	0.15
query38	0.16	0.15	0.15
query39	0.04	0.04	0.04
query40	0.15	0.11	0.12
query41	0.09	0.04	0.05
query42	0.05	0.04	0.05
query43	0.04	0.05	0.04
Total cold run time: 109.63 s
Total hot run time: 30.87 s

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@zhiqiang-hhhh zhiqiang-hhhh changed the title FIX [fix](query cancel) Fix query is cancelled when it comes from follower FE Jul 11, 2024
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 40186 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 097d51f891a07a955c7b139eed8161f8d67aaf61, data reload: false

------ Round 1 ----------------------------------
q1	17618	4353	4370	4353
q2	2021	195	191	191
q3	10455	1266	1072	1072
q4	10432	741	866	741
q5	7686	2728	2717	2717
q6	230	138	156	138
q7	982	616	616	616
q8	9684	2093	2122	2093
q9	9073	6553	6530	6530
q10	8733	3799	3802	3799
q11	477	247	250	247
q12	402	232	232	232
q13	17786	2993	3000	2993
q14	286	238	246	238
q15	525	489	503	489
q16	530	382	379	379
q17	965	665	698	665
q18	8127	7598	7370	7370
q19	1720	1493	1465	1465
q20	689	320	334	320
q21	5039	3205	3266	3205
q22	395	333	349	333
Total cold run time: 113855 ms
Total hot run time: 40186 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4362	4287	4258	4258
q2	383	280	279	279
q3	2977	2759	2799	2759
q4	1894	1643	1585	1585
q5	5276	5314	5321	5314
q6	221	129	132	129
q7	2134	1732	1760	1732
q8	3266	3411	3352	3352
q9	8463	8420	8429	8420
q10	3900	3674	3630	3630
q11	604	492	496	492
q12	820	616	620	616
q13	16370	3002	2995	2995
q14	307	275	276	275
q15	510	479	487	479
q16	490	429	430	429
q17	1821	1502	1513	1502
q18	7721	7401	7420	7401
q19	1712	1499	1560	1499
q20	2014	1785	1768	1768
q21	4970	4763	4649	4649
q22	632	559	544	544
Total cold run time: 70847 ms
Total hot run time: 54107 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173991 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 097d51f891a07a955c7b139eed8161f8d67aaf61, data reload: false

query1	908	373	370	370
query2	6451	2467	2374	2374
query3	6678	210	229	210
query4	27205	17446	17506	17446
query5	4293	469	481	469
query6	275	169	166	166
query7	4598	307	294	294
query8	322	318	293	293
query9	8446	2490	2491	2490
query10	442	278	271	271
query11	10983	10129	10004	10004
query12	137	84	83	83
query13	1642	376	380	376
query14	10276	7721	7742	7721
query15	241	186	189	186
query16	7844	336	342	336
query17	1770	556	540	540
query18	1943	288	285	285
query19	198	156	153	153
query20	91	81	84	81
query21	214	131	124	124
query22	4324	4041	4048	4041
query23	33673	32905	33225	32905
query24	12138	2852	2795	2795
query25	682	384	389	384
query26	1765	152	155	152
query27	2888	277	344	277
query28	7404	2090	2082	2082
query29	1084	635	629	629
query30	287	145	148	145
query31	962	727	730	727
query32	97	52	53	52
query33	761	296	300	296
query34	1000	493	500	493
query35	669	605	589	589
query36	1098	923	931	923
query37	284	77	78	77
query38	2897	2733	2716	2716
query39	879	822	802	802
query40	277	118	121	118
query41	53	53	53	53
query42	121	97	100	97
query43	583	528	553	528
query44	1235	738	727	727
query45	191	162	162	162
query46	1087	716	715	715
query47	1848	1791	1792	1791
query48	372	300	297	297
query49	1187	422	420	420
query50	780	406	405	405
query51	6947	6733	6821	6733
query52	107	87	92	87
query53	355	292	295	292
query54	1023	452	453	452
query55	77	76	74	74
query56	285	267	272	267
query57	1169	1082	1042	1042
query58	248	250	248	248
query59	3335	3291	3058	3058
query60	331	274	296	274
query61	99	95	92	92
query62	841	625	659	625
query63	323	296	291	291
query64	10416	2223	1677	1677
query65	3200	3086	3095	3086
query66	1411	332	325	325
query67	15530	15019	15028	15019
query68	4806	541	549	541
query69	684	424	341	341
query70	1208	1147	1120	1120
query71	451	297	281	281
query72	8380	5842	5611	5611
query73	762	334	329	329
query74	5921	5483	5434	5434
query75	4298	2737	2643	2643
query76	3730	972	994	972
query77	641	308	310	308
query78	9618	8977	8868	8868
query79	2445	518	514	514
query80	1242	468	470	468
query81	588	216	220	216
query82	1352	134	129	129
query83	304	165	164	164
query84	241	89	88	88
query85	1568	312	297	297
query86	449	320	337	320
query87	3238	3078	3092	3078
query88	4295	2470	2468	2468
query89	466	379	406	379
query90	1919	193	187	187
query91	132	101	102	101
query92	64	52	50	50
query93	2246	499	512	499
query94	1347	215	220	215
query95	401	317	323	317
query96	611	276	271	271
query97	3177	3046	3058	3046
query98	221	202	199	199
query99	1570	1225	1256	1225
Total cold run time: 289655 ms
Total hot run time: 173991 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.7 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 097d51f891a07a955c7b139eed8161f8d67aaf61, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.04	0.04
query3	0.22	0.06	0.06
query4	1.67	0.08	0.08
query5	0.51	0.48	0.49
query6	1.14	0.73	0.73
query7	0.02	0.02	0.02
query8	0.05	0.04	0.04
query9	0.56	0.48	0.49
query10	0.54	0.54	0.54
query11	0.16	0.11	0.11
query12	0.14	0.13	0.13
query13	0.59	0.59	0.57
query14	0.76	0.78	0.79
query15	0.86	0.82	0.80
query16	0.36	0.35	0.35
query17	0.96	1.02	0.97
query18	0.24	0.23	0.22
query19	1.80	1.71	1.69
query20	0.01	0.00	0.00
query21	15.40	0.72	0.65
query22	3.86	6.59	3.09
query23	18.36	1.43	1.32
query24	2.16	0.21	0.22
query25	0.16	0.08	0.09
query26	0.30	0.21	0.21
query27	0.46	0.24	0.23
query28	13.26	1.02	0.99
query29	12.62	3.32	3.29
query30	0.25	0.06	0.05
query31	2.89	0.39	0.39
query32	3.26	0.47	0.48
query33	2.85	3.01	2.89
query34	16.93	4.31	4.34
query35	4.43	4.40	4.36
query36	0.65	0.46	0.46
query37	0.20	0.16	0.15
query38	0.16	0.15	0.15
query39	0.05	0.03	0.04
query40	0.15	0.13	0.12
query41	0.09	0.05	0.05
query42	0.06	0.05	0.04
query43	0.05	0.04	0.04
Total cold run time: 109.3 s
Total hot run time: 31.7 s

@zhiqiang-hhhh zhiqiang-hhhh marked this pull request as ready for review July 11, 2024 16:53
@zhiqiang-hhhh
Copy link
Contributor Author

run external

Copy link
Contributor

@yiguolei yiguolei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 12, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@xinyiZzz xinyiZzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yiguolei yiguolei merged commit bad7f3b into apache:master Jul 12, 2024
28 of 32 checks passed
@zhiqiang-hhhh zhiqiang-hhhh deleted the fix-coordinator-dead branch July 12, 2024 04:15
yiguolei pushed a commit that referenced this pull request Jul 12, 2024
seawinde pushed a commit to seawinde/doris that referenced this pull request Jul 17, 2024
…r FE (apache#37662)

In some rear cases, the rpc port of follower FE is not updated in time,
the value of rpc port of this follower in heartbeat will be 0, but
actually it is still running. Query from the follower FE will be
cancelled by be until rpc port is updated correctly on BE.

This pr fixes the problem on BE by detecting above situation, and avoid
cancel query in this situation.
dataroaring pushed a commit that referenced this pull request Jul 17, 2024
…r FE (#37662)

In some rear cases, the rpc port of follower FE is not updated in time,
the value of rpc port of this follower in heartbeat will be 0, but
actually it is still running. Query from the follower FE will be
cancelled by be until rpc port is updated correctly on BE.

This pr fixes the problem on BE by detecting above situation, and avoid
cancel query in this situation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.5-merged dev/3.0.1-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants