-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV processor #49509
CSV processor #49509
Conversation
This change adds new ingest processor that breaks line from CSV file into separate fields. By default it conforms to RFC 4180 but can be tweaked. Closes elastic#49113
Pinging @elastic/es-core-features (:Core/Features/Ingest) |
@probakowski Can you add some docs with this pull request? Some REST tests would be nice too. |
@jasontedor sure, I'll add both on Monday I've also run JMH benchmark to compare version above with https://github.com/johtani/elasticsearch-ingest-csv mentioned in #49113 (the less the better)
(I've used version 7.4.2 of @johtani library, which is thread safe but suffers from synchronization) |
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvProcessor.java
Outdated
Show resolved
Hide resolved
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two more comments, otherwise LGTM.
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvProcessor.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvProcessor.java
Outdated
Show resolved
Hide resolved
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small things and a question regarding the trim
setting when parsing quoted values.
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvParser.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvParser.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvParser.java
Outdated
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvParser.java
Outdated
Show resolved
Hide resolved
boolean shouldSetField = true; | ||
for (; currentIndex < length; currentIndex++) { | ||
c = currentChar(); | ||
if (isWhitespace(c)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the value we are parsing is a quoted string, are the spaces around it always trimmed?
I've taken a look at the RFC but I'm still not sure what the right path here is. It states "Spaces are considered part of a field and should not be ignored." In the implementation we have a trim
option which is a fine extension, but if reading a quoted string, the spaces are trimmed regardless of if the trim
option is set to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces around quotes are always trimmed since it's well defined where start and end of data is. It's deviation from RFC, where no spaces are allowed both before and after quotes at all. But it's common for CSV files in the wild to have them, so this change make it easier for a user with no ambiguity introduced.
For unquoted fields situation is different, there's no way to tell automatically where the data starts so a user must decide here if he wants to trim whitespaces or not. That's why it's parametrized.
Now when I think of it, my gut feeling is most use cases would trim leading/trailing whitespaces so I would even consider default trim to true. What do you think? @martijnvg you opinion would be appreciated as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with trimming leading/trailing whitespaces by default, because it seems more practical to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 convinced me otherwise: #49509 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, me too, reverted back to false
…ommon/CsvParser.java Co-Authored-By: James Baiera <james.baiera@gmail.com>
Thanks for changing the trim functionality 👍 |
@elasticmachine update branch |
Some reasons not to trim CSV by default are:
I agree it's quite common to find non-standard CSV that needs spaces trimming, so it's good to have the option, but it seems more defensible to me to default to the standard and make people with non-standard data customise an option. |
@droberts195 these are valid points, I'll revert default value for trim back to false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points about the clear start and end for quoted content. LGTM!
@elasticmachine update branch |
@elasticmachine run elasticsearch-ci/2 |
* CSV Processor for Ingest This change adds new ingest processor that breaks line from CSV file into separate fields. By default it conforms to RFC 4180 but can be tweaked. Closes elastic#49113
* CSV Processor for Ingest This change adds new ingest processor that breaks line from CSV file into separate fields. By default it conforms to RFC 4180 but can be tweaked. Closes elastic#49113
@probakowski What's the intended behavior of this processor for quoted fields containing line-breaks? It seems that the processor is interpreting this as a new message, which is in contrast to how Excel and Numbers are dealing with line breaks in quoted strings. |
Hi @dschneiter, this processor handles only single line from CSV so it doesn't care for line breaks at all (unless you set it as separator) and it doesn't have any notion of "new message" |
This change adds new ingest processor that breaks line from CSV file into separate fields.
By default it conforms to RFC 4180 but can be tweaked.
Closes #49113