Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](publish) Fix publish failed because because "task" is null #37531

Merged
merged 2 commits into from
Jul 11, 2024

Conversation

mymeiyi
Copy link
Contributor

@mymeiyi mymeiyi commented Jul 9, 2024

Proposed changes

2024-07-08 00:43:27,149 ERROR (PUBLISH_VERSION|33) [PublishVersionDaemon.runAfterCatalogReady():73] errors while publish version to all backends
java.lang.NullPointerException: Cannot invoke "org.apache.doris.task.PublishVersionTask.isFinished()" because "task" is null
at org.apache.doris.transaction.PublishVersionDaemon.lambda$tryFinishTxn$0(PublishVersionDaemon.java:163) ~[doris-fe.jar:1.2-SNAPSHOT]
at java.util.HashMap.forEach(HashMap.java:1421) ~[?:?]
at org.apache.doris.transaction.PublishVersionDaemon.tryFinishTxn(PublishVersionDaemon.java:160) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.transaction.PublishVersionDaemon.publishVersion(PublishVersionDaemon.java:96) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.transaction.PublishVersionDaemon.runAfterCatalogReady(PublishVersionDaemon.java:70) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.common.util.Daemon.run(Daemon.java:116) ~[doris-fe.jar:1.2-SNAPSHOT]
  1. When try finish one txn, catch the exception to make the failed txn does not block the other txns.
  2. In the original way, when commit txn, add a <be_id, null publish task> to publish tasks, and then when publish txn, reset the null publish task to a new publish task.
    This pr modify it to when commit txn, record the involved be ids, and then when publish txn, generate the publish tasks to all involved bes.
  3. There is also a bug of tableIdToTabletDeltaRows in transaction state, it records all ready txn infos, because the variable scope is out of for (TransactionState transactionState : readyTransactionStates)

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@mymeiyi
Copy link
Contributor Author

mymeiyi commented Jul 9, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40033 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 15f2620cf1e34f5dd202d75c9fdca16b1b673d4f, data reload: false

------ Round 1 ----------------------------------
q1	17625	4379	4279	4279
q2	2008	197	191	191
q3	10455	1191	1129	1129
q4	10200	778	798	778
q5	7786	2731	2684	2684
q6	225	143	141	141
q7	974	608	619	608
q8	9555	2062	2087	2062
q9	8789	6471	6480	6471
q10	8950	3742	3728	3728
q11	464	243	244	243
q12	427	228	227	227
q13	17856	2966	2971	2966
q14	279	235	219	219
q15	521	490	486	486
q16	504	390	370	370
q17	968	736	738	736
q18	8077	7375	7371	7371
q19	7937	1421	1566	1421
q20	692	323	326	323
q21	4901	3259	3287	3259
q22	403	341	352	341
Total cold run time: 119596 ms
Total hot run time: 40033 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4345	4238	4225	4225
q2	375	261	259	259
q3	2924	2731	2717	2717
q4	1854	1571	1551	1551
q5	5250	5306	5260	5260
q6	215	134	132	132
q7	2133	1721	1783	1721
q8	3185	3350	3350	3350
q9	8382	8285	8425	8285
q10	3876	3640	3585	3585
q11	604	499	504	499
q12	766	627	595	595
q13	16346	3014	3022	3014
q14	296	274	253	253
q15	510	470	483	470
q16	477	408	418	408
q17	1768	1480	1461	1461
q18	7616	7557	7371	7371
q19	2278	1571	1557	1557
q20	2008	1747	1823	1747
q21	4899	4765	4764	4764
q22	623	558	534	534
Total cold run time: 70730 ms
Total hot run time: 53758 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172346 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 15f2620cf1e34f5dd202d75c9fdca16b1b673d4f, data reload: false

query1	913	363	363	363
query2	6461	2574	2357	2357
query3	6650	211	222	211
query4	28538	17312	17197	17197
query5	4192	502	488	488
query6	288	175	177	175
query7	4584	308	287	287
query8	334	311	311	311
query9	8520	2467	2447	2447
query10	475	270	272	270
query11	11561	9838	9905	9838
query12	133	82	86	82
query13	1628	368	371	368
query14	10233	6952	7673	6952
query15	233	187	191	187
query16	7812	318	305	305
query17	1576	549	524	524
query18	1965	271	272	271
query19	193	148	148	148
query20	86	83	79	79
query21	209	128	125	125
query22	4379	4061	3841	3841
query23	33696	33093	33061	33061
query24	12171	2810	2806	2806
query25	669	365	371	365
query26	1798	153	150	150
query27	2940	273	275	273
query28	7765	2089	2083	2083
query29	1126	655	642	642
query30	282	147	147	147
query31	948	758	755	755
query32	93	53	54	53
query33	774	291	303	291
query34	948	478	501	478
query35	706	578	562	562
query36	1099	945	937	937
query37	167	82	79	79
query38	2837	2708	2742	2708
query39	889	804	795	795
query40	273	122	121	121
query41	54	51	51	51
query42	120	101	108	101
query43	590	542	555	542
query44	1172	730	725	725
query45	194	158	159	158
query46	1070	735	701	701
query47	1815	1749	1764	1749
query48	393	303	302	302
query49	1142	411	423	411
query50	781	404	404	404
query51	6809	6783	6736	6736
query52	103	99	97	97
query53	362	299	292	292
query54	1015	457	457	457
query55	77	75	74	74
query56	303	307	282	282
query57	1143	1031	1044	1031
query58	257	240	247	240
query59	3375	3083	3125	3083
query60	327	282	291	282
query61	101	99	98	98
query62	832	658	663	658
query63	325	300	296	296
query64	10489	2218	6341	2218
query65	3184	3109	3101	3101
query66	1363	335	335	335
query67	15439	14828	14900	14828
query68	5311	543	542	542
query69	633	451	338	338
query70	1160	1062	1140	1062
query71	436	275	283	275
query72	7182	5132	5865	5132
query73	765	330	326	326
query74	5912	5486	5490	5486
query75	3719	2657	2699	2657
query76	3219	895	868	868
query77	663	313	317	313
query78	9453	9796	8713	8713
query79	3214	540	524	524
query80	1383	481	530	481
query81	563	224	215	215
query82	1189	138	131	131
query83	198	172	177	172
query84	269	96	89	89
query85	1486	328	306	306
query86	471	336	320	320
query87	3292	3093	3130	3093
query88	3631	2466	2430	2430
query89	517	386	396	386
query90	1852	192	193	192
query91	135	103	106	103
query92	60	51	51	51
query93	4015	510	504	504
query94	1199	221	211	211
query95	410	318	333	318
query96	593	274	270	270
query97	3183	3013	3056	3013
query98	226	208	195	195
query99	1610	1246	1250	1246
Total cold run time: 290928 ms
Total hot run time: 172346 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.61 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 15f2620cf1e34f5dd202d75c9fdca16b1b673d4f, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.04	0.04
query3	0.22	0.06	0.06
query4	1.66	0.09	0.09
query5	0.51	0.48	0.49
query6	1.14	0.72	0.72
query7	0.01	0.02	0.01
query8	0.05	0.04	0.04
query9	0.56	0.47	0.49
query10	0.53	0.54	0.54
query11	0.16	0.11	0.11
query12	0.15	0.12	0.12
query13	0.61	0.58	0.59
query14	0.76	0.78	0.78
query15	0.85	0.81	0.82
query16	0.38	0.39	0.37
query17	1.00	0.98	1.03
query18	0.23	0.22	0.22
query19	1.82	1.70	1.76
query20	0.01	0.01	0.01
query21	15.40	0.76	0.65
query22	4.33	7.25	1.83
query23	18.27	1.41	1.33
query24	2.05	0.26	0.21
query25	0.17	0.09	0.09
query26	0.28	0.21	0.21
query27	0.45	0.25	0.23
query28	13.26	1.03	1.00
query29	12.62	3.31	3.32
query30	0.24	0.07	0.06
query31	2.86	0.39	0.39
query32	3.29	0.49	0.47
query33	2.94	2.92	2.93
query34	17.14	4.38	4.33
query35	4.42	4.39	4.49
query36	0.66	0.46	0.48
query37	0.19	0.15	0.15
query38	0.15	0.15	0.14
query39	0.04	0.04	0.04
query40	0.15	0.12	0.13
query41	0.09	0.05	0.04
query42	0.05	0.04	0.04
query43	0.05	0.04	0.04
Total cold run time: 109.87 s
Total hot run time: 30.61 s

@@ -822,7 +822,7 @@ public String getErrMsg() {
public void pruneAfterVisible() {
publishVersionTasks.clear();
tableIdToTabletDeltaRows.clear();
// TODO if subTransactionStates can be cleared?
involvedBackends.clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @SerializedName to involvedBackends ?

Copy link
Collaborator

@yujun777 yujun777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 11, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@dataroaring dataroaring merged commit 78e1409 into apache:master Jul 11, 2024
33 of 37 checks passed
dataroaring pushed a commit that referenced this pull request Jul 11, 2024
…37546)

## Proposed changes

Pick #37531

This pr catch the exception to make the failed txn does not block the
other txns.
seawinde pushed a commit to seawinde/doris that referenced this pull request Jul 17, 2024
…che#37531)

## Proposed changes

```
2024-07-08 00:43:27,149 ERROR (PUBLISH_VERSION|33) [PublishVersionDaemon.runAfterCatalogReady():73] errors while publish version to all backends
java.lang.NullPointerException: Cannot invoke "org.apache.doris.task.PublishVersionTask.isFinished()" because "task" is null
at org.apache.doris.transaction.PublishVersionDaemon.lambda$tryFinishTxn$0(PublishVersionDaemon.java:163) ~[doris-fe.jar:1.2-SNAPSHOT]
at java.util.HashMap.forEach(HashMap.java:1421) ~[?:?]
at org.apache.doris.transaction.PublishVersionDaemon.tryFinishTxn(PublishVersionDaemon.java:160) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.transaction.PublishVersionDaemon.publishVersion(PublishVersionDaemon.java:96) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.transaction.PublishVersionDaemon.runAfterCatalogReady(PublishVersionDaemon.java:70) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.common.util.Daemon.run(Daemon.java:116) ~[doris-fe.jar:1.2-SNAPSHOT]
```


1. When try finish one txn, catch the exception to make the failed txn
does not block the other txns.
2. In the original way, when commit txn, add a <be_id, null publish
task> to publish tasks, and then when publish txn, reset the null
publish task to a new publish task.
This pr modify it to when commit txn, record the involved be ids, and
then when publish txn, generate the publish tasks to all involved bes.
3. There is also a bug of `tableIdToTabletDeltaRows` in transaction
state, it records all ready txn infos, because the variable scope is out
of `for (TransactionState transactionState : readyTransactionStates)`
dataroaring pushed a commit that referenced this pull request Jul 17, 2024
)

## Proposed changes

```
2024-07-08 00:43:27,149 ERROR (PUBLISH_VERSION|33) [PublishVersionDaemon.runAfterCatalogReady():73] errors while publish version to all backends
java.lang.NullPointerException: Cannot invoke "org.apache.doris.task.PublishVersionTask.isFinished()" because "task" is null
at org.apache.doris.transaction.PublishVersionDaemon.lambda$tryFinishTxn$0(PublishVersionDaemon.java:163) ~[doris-fe.jar:1.2-SNAPSHOT]
at java.util.HashMap.forEach(HashMap.java:1421) ~[?:?]
at org.apache.doris.transaction.PublishVersionDaemon.tryFinishTxn(PublishVersionDaemon.java:160) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.transaction.PublishVersionDaemon.publishVersion(PublishVersionDaemon.java:96) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.transaction.PublishVersionDaemon.runAfterCatalogReady(PublishVersionDaemon.java:70) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.common.util.Daemon.run(Daemon.java:116) ~[doris-fe.jar:1.2-SNAPSHOT]
```


1. When try finish one txn, catch the exception to make the failed txn
does not block the other txns.
2. In the original way, when commit txn, add a <be_id, null publish
task> to publish tasks, and then when publish txn, reset the null
publish task to a new publish task.
This pr modify it to when commit txn, record the involved be ids, and
then when publish txn, generate the publish tasks to all involved bes.
3. There is also a bug of `tableIdToTabletDeltaRows` in transaction
state, it records all ready txn infos, because the variable scope is out
of `for (TransactionState transactionState : readyTransactionStates)`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.5-merged dev/3.0.1-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants