Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Stream Load导入csv行数较多时行分割错误 #35954

Open
2 of 3 tasks
heartdance opened this issue Jun 6, 2024 · 10 comments
Open
2 of 3 tasks

[Bug] Stream Load导入csv行数较多时行分割错误 #35954

heartdance opened this issue Jun 6, 2024 · 10 comments

Comments

@heartdance
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

Version

doris-2.1.2-rc04

What's Wrong?

数据中包含换行符,但是使用引号包裹了,也设置了enclose: "。当数据行数不多时不会报错,一旦行数多了就会分割错误,报错为:Reason: actual column number in csv file is less than schema column number: 97, schema column number: 102

What You Expected?

行数多少应该不影响csv解析,怀疑是buffer或者哪里的问题没有正常解析

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@heartdance
Copy link
Author

heartdance commented Jun 6, 2024

请求头csv相关参数设置如下,其中f4为base64格式,所以包含\r\n:
columns: f1,f2,f3,f4,f5,f6
column_separator: ,
enclose: '
escape: \
数据格式如下,其中的\r\n为换行符非转义后的字符:

1,2,3,'line1\r\nline2\r\nline3\r\n...',5,6
1,2,3,'line1\r\nline2\r\nline3\r\n...',5,6
...

相同的一批数据,如果按每10条一批插入就没问题,按10000条左右(共64MB左右)批量插入就会报错,其中某行会从lineN被截断

@liaoxin01
Copy link
Contributor

#34364 这个pr应该fix这个问题,你试下2.1.3版本看看的

@heartdance
Copy link
Author

好的非常感谢,我试一下

@heartdance
Copy link
Author

#34364 这个pr应该fix这个问题,你试下2.1.3版本看看的

我跟这个pr应该不是相同的问题,他的问题是必现的,跟数据量无关,我的是跟批量文件大小相关的,把大文件(64MB左右)拆分为小几个文件再执行就不会报错

@liaoxin01
Copy link
Contributor

@heartdance 试了新版还是不行吗?那个bug大文件容易触发到

@heartdance
Copy link
Author

@heartdance 试了新版还是不行吗?那个bug大文件容易触发到

是的,还是有同样的问题

@heartdance
Copy link
Author

我看了下be日志,有如下报错:

stream_load.cpp:349] append body content failed, errmsg=[INTERNAL_ERROR]cancelled: closed, id=...
stream_load_executor.cpp:100] fragment execute failed, err_msg=[DATA_QUALITY_ERROR]too many filtered rows, id=...

@ixzc
Copy link
Contributor

ixzc commented Jun 6, 2024

please add my wechat: Faith_xzc

@mark-triker
Copy link

我在想是不是doris数据加载时做了文件分割来并行加载,分割时没考虑enclose符号的问题,我也遇到了类似的问题

@liaoxin01
Copy link
Contributor

liaoxin01 commented Aug 8, 2024

包围符里的数据有换行符的时候是有个bug可能会导致切分错误,#38347 这个pr最近修了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants