Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](csv reader) fix csv parser incorrect if enclosing line_delimiter #38347

Merged
merged 1 commit into from
Jul 26, 2024

Conversation

sollhui
Copy link
Contributor

@sollhui sollhui commented Jul 25, 2024

Csv reader parse data incorrect when data enclosing line_delimiter, for example, line_delimiter is \n and enclose is ', data as follows:

'aaaaaaaaaaaa
bbbb'

it will be parsed as two columns: 'aaaaaaaaaaaa and bbbb', rather than one column

'aaaaaaaaaaaa
bbbb'

The reason why this happened is csv reader will not reset result when not match enclose in this output_buf_read, causing incorrect truncation was made.

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@sollhui
Copy link
Contributor Author

sollhui commented Jul 25, 2024

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@sollhui
Copy link
Contributor Author

sollhui commented Jul 25, 2024

run buildall

@sollhui
Copy link
Contributor Author

sollhui commented Jul 25, 2024

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@@ -160,6 +160,7 @@ void EncloseCsvLineReaderContext::_on_pre_match_enclose(const uint8_t* start, si
if (_idx != _total_len) {
len = update_reading_bound(start);
} else {
_result = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment.

dataroaring
dataroaring previously approved these changes Jul 25, 2024
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 25, 2024
Copy link
Contributor

PR approved by anyone and no changes requested.

@sollhui sollhui force-pushed the fix_csv_reader branch 2 times, most recently from cfd1903 to 409e0b5 Compare July 25, 2024 13:30
Co-authored-by: Xin Liao <liaoxinbit@126.com>
@sollhui
Copy link
Contributor Author

sollhui commented Jul 25, 2024

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

2 similar comments
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 39273 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d8b9acc8290b63f3dce97540e38c24c20a104dfb, data reload: false

------ Round 1 ----------------------------------
q1	17636	4373	4298	4298
q2	2008	197	193	193
q3	10436	1159	1000	1000
q4	10143	702	682	682
q5	7584	2768	2661	2661
q6	216	140	145	140
q7	953	591	593	591
q8	9217	1908	1917	1908
q9	8796	6571	6574	6571
q10	8901	3800	3749	3749
q11	455	242	240	240
q12	571	234	223	223
q13	17772	2978	2974	2974
q14	287	238	242	238
q15	526	494	490	490
q16	509	399	388	388
q17	972	738	704	704
q18	8099	7433	7383	7383
q19	6032	1081	1083	1081
q20	655	351	336	336
q21	4994	3136	3177	3136
q22	350	292	287	287
Total cold run time: 117112 ms
Total hot run time: 39273 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4553	4269	4286	4269
q2	363	272	269	269
q3	2993	2822	2884	2822
q4	2024	1712	1694	1694
q5	5640	5559	5540	5540
q6	230	137	131	131
q7	2166	1812	1883	1812
q8	3297	3402	3386	3386
q9	8789	8788	8841	8788
q10	4087	3897	3752	3752
q11	577	475	494	475
q12	781	636	658	636
q13	17069	3174	3143	3143
q14	330	296	274	274
q15	543	495	514	495
q16	497	421	422	421
q17	1853	1529	1505	1505
q18	8140	8288	8197	8197
q19	1700	1504	1605	1504
q20	2074	1874	1858	1858
q21	5102	4924	4752	4752
q22	581	510	497	497
Total cold run time: 73389 ms
Total hot run time: 56220 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173216 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d8b9acc8290b63f3dce97540e38c24c20a104dfb, data reload: false

query1	905	369	360	360
query2	6436	1874	1812	1812
query3	6634	205	217	205
query4	23135	17273	17474	17273
query5	3623	480	470	470
query6	272	173	183	173
query7	4580	287	283	283
query8	261	209	195	195
query9	8736	2440	2401	2401
query10	425	269	267	267
query11	11986	9928	10043	9928
query12	118	85	87	85
query13	1646	377	372	372
query14	9511	7532	7624	7532
query15	260	171	168	168
query16	7837	483	530	483
query17	1610	562	540	540
query18	1879	291	285	285
query19	198	144	146	144
query20	96	87	92	87
query21	208	101	100	100
query22	4230	4107	3877	3877
query23	34040	33619	33593	33593
query24	10918	2973	2861	2861
query25	619	399	411	399
query26	704	156	157	156
query27	2164	286	288	286
query28	6189	2089	2068	2068
query29	788	442	452	442
query30	262	160	159	159
query31	978	753	766	753
query32	101	56	58	56
query33	749	346	338	338
query34	889	490	505	490
query35	864	761	767	761
query36	1123	1012	958	958
query37	142	93	85	85
query38	2882	2892	2821	2821
query39	873	858	873	858
query40	304	120	113	113
query41	48	43	44	43
query42	118	101	95	95
query43	493	462	459	459
query44	1157	724	725	724
query45	207	174	173	173
query46	1085	719	742	719
query47	1855	1731	1752	1731
query48	366	285	299	285
query49	828	407	434	407
query50	788	397	397	397
query51	6751	6687	6609	6609
query52	109	89	85	85
query53	256	180	181	180
query54	857	443	429	429
query55	75	72	70	70
query56	302	269	272	269
query57	1101	1025	1063	1025
query58	264	267	254	254
query59	2868	2683	2695	2683
query60	296	274	284	274
query61	117	94	96	94
query62	816	638	649	638
query63	201	173	186	173
query64	9145	2248	1681	1681
query65	3127	3117	3101	3101
query66	751	329	323	323
query67	15471	15040	14898	14898
query68	8495	571	563	563
query69	747	411	314	314
query70	1200	1026	1064	1026
query71	536	275	268	268
query72	9095	5936	5843	5843
query73	1407	324	327	324
query74	6169	5629	5599	5599
query75	5110	2659	2679	2659
query76	5024	980	988	980
query77	786	305	293	293
query78	9674	9245	10245	9245
query79	9697	543	522	522
query80	943	490	485	485
query81	589	220	218	218
query82	753	136	131	131
query83	334	171	173	171
query84	268	79	75	75
query85	1232	300	358	300
query86	394	302	298	298
query87	3212	3116	3051	3051
query88	4733	2369	2396	2369
query89	513	279	279	279
query90	2083	187	187	187
query91	125	97	99	97
query92	65	52	52	52
query93	5927	547	546	546
query94	970	296	292	292
query95	361	257	266	257
query96	613	268	272	268
query97	3186	3061	3013	3013
query98	213	202	200	200
query99	1593	1235	1253	1235
Total cold run time: 294674 ms
Total hot run time: 173216 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.82 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit d8b9acc8290b63f3dce97540e38c24c20a104dfb, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.04	0.04
query3	0.22	0.04	0.05
query4	1.68	0.08	0.08
query5	0.50	0.52	0.49
query6	1.12	0.73	0.72
query7	0.02	0.02	0.02
query8	0.05	0.04	0.04
query9	0.54	0.47	0.49
query10	0.55	0.56	0.54
query11	0.16	0.11	0.11
query12	0.15	0.12	0.12
query13	0.59	0.58	0.59
query14	0.76	0.79	0.76
query15	0.85	0.80	0.80
query16	0.36	0.37	0.36
query17	0.96	1.03	1.04
query18	0.23	0.23	0.22
query19	1.90	1.74	1.76
query20	0.01	0.01	0.01
query21	15.43	0.76	0.64
query22	4.10	7.13	2.08
query23	18.33	1.42	1.36
query24	2.05	0.24	0.23
query25	0.16	0.09	0.08
query26	0.29	0.20	0.20
query27	0.46	0.23	0.24
query28	13.28	1.01	0.99
query29	12.61	3.37	3.29
query30	0.26	0.06	0.05
query31	2.88	0.40	0.39
query32	3.27	0.47	0.48
query33	2.92	2.88	2.88
query34	16.91	4.36	4.32
query35	4.41	4.43	4.38
query36	0.65	0.47	0.48
query37	0.19	0.15	0.16
query38	0.14	0.14	0.14
query39	0.04	0.04	0.03
query40	0.16	0.12	0.13
query41	0.09	0.04	0.04
query42	0.05	0.04	0.04
query43	0.04	0.04	0.03
Total cold run time: 109.49 s
Total hot run time: 30.82 s

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit be3a906 into apache:master Jul 26, 2024
27 of 30 checks passed
liaoxin01 added a commit that referenced this pull request Jul 29, 2024
#38347) (#38446)

Csv reader parse data incorrect when data enclosing line_delimiter, for
example, line_delimiter is \n and enclose is ', data as follows:
```
'aaaaaaaaaaaa
bbbb'
```
it will be parsed as two columns: `'aaaaaaaaaaaa` and `bbbb',` rather
than one column
```
'aaaaaaaaaaaa
bbbb'
```

The reason why this happened is csv reader will not reset result when
not match enclose in this `output_buf_read`, causing incorrect
truncation was made.

Co-authored-by: Xin Liao <liaoxinbit@126.com>
liaoxin01 added a commit that referenced this pull request Jul 29, 2024
#38347) (#38445)

Csv reader parse data incorrect when data enclosing line_delimiter, for
example, line_delimiter is \n and enclose is ', data as follows:
```
'aaaaaaaaaaaa
bbbb'
```
it will be parsed as two columns: `'aaaaaaaaaaaa` and `bbbb',` rather
than one column
```
'aaaaaaaaaaaa
bbbb'
```

The reason why this happened is csv reader will not reset result when
not match enclose in this `output_buf_read`, causing incorrect
truncation was made.

Co-authored-by: Xin Liao <liaoxinbit@126.com>
dataroaring pushed a commit that referenced this pull request Jul 29, 2024
#38347)

Csv reader parse data incorrect when data enclosing line_delimiter, for
example, line_delimiter is \n and enclose is ', data as follows:
```
'aaaaaaaaaaaa
bbbb'
```
it will be parsed as two columns: `'aaaaaaaaaaaa` and `bbbb',` rather
than one column
```
'aaaaaaaaaaaa
bbbb'
```

The reason why this happened is csv reader will not reset result when
not match enclose in this `output_buf_read`, causing incorrect
truncation was made.

Co-authored-by: Xin Liao <liaoxinbit@126.com>
@bludwujiang
Copy link

Is anybody knew how to get rid of enclose symbol from column-value itself When Using stream-load CSV?

@sollhui
Copy link
Contributor Author

sollhui commented Dec 6, 2024

Is anybody knew how to get rid of enclose symbol from column-value itself When Using stream-load CSV?

Can you give an example?

mongo360 pushed a commit to mongo360/doris that referenced this pull request Dec 11, 2024
apache#38347) (apache#38446)

Csv reader parse data incorrect when data enclosing line_delimiter, for
example, line_delimiter is \n and enclose is ', data as follows:
```
'aaaaaaaaaaaa
bbbb'
```
it will be parsed as two columns: `'aaaaaaaaaaaa` and `bbbb',` rather
than one column
```
'aaaaaaaaaaaa
bbbb'
```

The reason why this happened is csv reader will not reset result when
not match enclose in this `output_buf_read`, causing incorrect
truncation was made.

Co-authored-by: Xin Liao <liaoxinbit@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants