Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](csv-reader) fix column split error when there is escape character #34364

Merged
merged 1 commit into from
May 2, 2024

Conversation

liaoxin01
Copy link
Contributor

@liaoxin01 liaoxin01 commented Apr 30, 2024

Proposed changes

Issue Number: close #xxx

A row of data happened to have an escape character truncated, mistaking the next character for an enclose, resulting in a delimitation error.

for example:
origin data

1,2,3,"{~"title~": ~"abced~", ~"id~": 1, ~"name~": ~"user1~", ~"email~": ~"user1@example.com~"}"

curl --location-trusted -u root: -T data -H "format:csv" -H "column_separator:," -H "format:csv" -H 'enclose:"' -H "trim_double_quotes:true" -H 'escape:~' http://127.0.0.1:8040/api/test/load_test_2/_stream_load

Truncated when a row of data is read

1,2,3,"{~"title~": ~"abced~", ~"id~": 1, ~"name~": ~
"user1~", ~"email~": ~"user1@example.com~"}"

The quotation before user1 is considered an enclose character.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@liaoxin01
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 40349 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b917d7b8266092424396fd751ecd813d1a04754b, data reload: false

------ Round 1 ----------------------------------
q1	17695	4279	4269	4269
q2	2018	193	194	193
q3	10530	1286	1220	1220
q4	10263	858	857	857
q5	7675	2665	2663	2663
q6	216	135	135	135
q7	1047	627	605	605
q8	9433	2156	2100	2100
q9	9525	6782	6690	6690
q10	9285	3664	3700	3664
q11	456	235	234	234
q12	418	216	216	216
q13	17764	2960	2952	2952
q14	270	213	227	213
q15	510	468	469	468
q16	510	400	394	394
q17	949	717	722	717
q18	8038	7438	7372	7372
q19	2368	1507	1537	1507
q20	671	305	312	305
q21	5121	3301	4165	3301
q22	352	274	282	274
Total cold run time: 115114 ms
Total hot run time: 40349 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4289	4182	4192	4182
q2	371	279	272	272
q3	2961	2717	2689	2689
q4	1873	1608	1586	1586
q5	5286	5294	5268	5268
q6	212	125	127	125
q7	2295	1890	1874	1874
q8	3179	3352	3322	3322
q9	8494	8490	8435	8435
q10	3892	3681	3696	3681
q11	589	482	482	482
q12	747	599	595	595
q13	16388	2968	2950	2950
q14	304	251	251	251
q15	525	489	474	474
q16	459	395	402	395
q17	1757	1474	1475	1474
q18	7651	7619	7384	7384
q19	1676	1528	1598	1528
q20	1945	1767	1769	1767
q21	4822	4942	4917	4917
q22	584	503	501	501
Total cold run time: 70299 ms
Total hot run time: 54152 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 35.68% (8952/25092)
Line Coverage: 27.27% (73828/270699)
Region Coverage: 26.46% (38140/144146)
Branch Coverage: 23.22% (19429/83684)
Coverage Report: http://coverage.selectdb-in.cc/coverage/b917d7b8266092424396fd751ecd813d1a04754b_b917d7b8266092424396fd751ecd813d1a04754b/report/index.html

@doris-robot
Copy link

TPC-DS: Total hot run time: 184805 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b917d7b8266092424396fd751ecd813d1a04754b, data reload: false

query1	916	362	341	341
query2	6458	2312	2316	2312
query3	6657	209	216	209
query4	22833	21259	21162	21162
query5	4134	411	410	410
query6	278	188	178	178
query7	4605	295	286	286
query8	247	204	194	194
query9	8707	2337	2321	2321
query10	426	246	271	246
query11	14749	14144	14141	14141
query12	132	93	88	88
query13	1645	369	371	369
query14	10305	7428	6639	6639
query15	250	157	174	157
query16	8182	259	247	247
query17	1881	568	535	535
query18	2106	272	264	264
query19	204	150	149	149
query20	90	84	85	84
query21	194	131	126	126
query22	5080	4810	4795	4795
query23	33913	33620	33236	33236
query24	11754	2899	2841	2841
query25	652	371	356	356
query26	1718	155	148	148
query27	3040	308	319	308
query28	7401	2055	2024	2024
query29	987	595	595	595
query30	290	148	152	148
query31	996	734	717	717
query32	97	52	52	52
query33	749	243	233	233
query34	1079	472	471	471
query35	810	668	657	657
query36	1096	929	897	897
query37	167	67	64	64
query38	3167	3016	2997	2997
query39	1603	1520	1565	1520
query40	280	123	123	123
query41	42	38	39	38
query42	104	94	102	94
query43	592	535	525	525
query44	1221	721	738	721
query45	269	227	260	227
query46	1083	702	727	702
query47	1981	1847	1873	1847
query48	354	288	296	288
query49	1206	401	389	389
query50	775	371	384	371
query51	6706	6643	6571	6571
query52	105	85	87	85
query53	352	279	280	279
query54	307	234	251	234
query55	75	73	73	73
query56	234	217	216	216
query57	1226	1116	1138	1116
query58	223	196	193	193
query59	3361	3307	2994	2994
query60	251	245	229	229
query61	89	87	88	87
query62	660	443	461	443
query63	301	278	279	278
query64	9663	7233	7160	7160
query65	3157	3043	3058	3043
query66	1373	336	333	333
query67	15235	15013	15239	15013
query68	5253	519	523	519
query69	472	297	304	297
query70	1156	1104	1055	1055
query71	426	265	255	255
query72	7440	2703	2469	2469
query73	710	320	319	319
query74	6539	6133	6168	6133
query75	3537	2698	2644	2644
query76	3490	1053	1016	1016
query77	444	269	266	266
query78	11015	10158	10306	10158
query79	7699	528	512	512
query80	1727	449	439	439
query81	536	219	216	216
query82	1302	95	97	95
query83	291	171	171	171
query84	273	90	90	90
query85	2134	321	318	318
query86	489	312	307	307
query87	3298	3110	3091	3091
query88	5060	2396	2297	2297
query89	506	387	374	374
query90	2072	180	181	180
query91	125	95	104	95
query92	59	47	48	47
query93	5922	509	487	487
query94	1226	183	182	182
query95	385	302	294	294
query96	592	264	258	258
query97	3182	2966	2913	2913
query98	245	222	220	220
query99	1256	899	921	899
Total cold run time: 299684 ms
Total hot run time: 184805 ms

Copy link
Collaborator

@TangSiyang2001 TangSiyang2001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

github-actions bot commented May 1, 2024

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit be4260c into apache:master May 2, 2024
28 of 32 checks passed
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 2, 2024
Copy link
Contributor

github-actions bot commented May 2, 2024

PR approved by at least one committer and no changes requested.

yiguolei pushed a commit that referenced this pull request May 7, 2024
yiguolei pushed a commit that referenced this pull request May 7, 2024
liaoxin01 added a commit to liaoxin01/doris that referenced this pull request May 8, 2024
ByteYue pushed a commit to ByteYue/doris that referenced this pull request May 15, 2024
mongo360 pushed a commit to mongo360/doris that referenced this pull request Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.0.10-merged dev/3.0.0-merged p0_b reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants