Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV processor #49509

Merged
merged 18 commits into from
Dec 11, 2019
Merged

CSV processor #49509

merged 18 commits into from
Dec 11, 2019

Conversation

probakowski
Copy link
Contributor

This change adds new ingest processor that breaks line from CSV file into separate fields.
By default it conforms to RFC 4180 but can be tweaked.

Closes #49113

This change adds new ingest processor that breaks line from CSV file into separate fields.
By default it conforms to RFC 4180 but can be tweaked.

Closes elastic#49113
@probakowski probakowski added >feature :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.0.0 v7.6.0 labels Nov 22, 2019
@probakowski probakowski self-assigned this Nov 22, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Ingest)

@jasontedor
Copy link
Member

@probakowski Can you add some docs with this pull request? Some REST tests would be nice too.

@probakowski
Copy link
Contributor Author

probakowski commented Nov 23, 2019

@jasontedor sure, I'll add both on Monday

I've also run JMH benchmark to compare version above with https://github.com/johtani/elasticsearch-ingest-csv mentioned in #49113 (the less the better)

Benchmark                 Mode  Score    Error  Units
1Thread johtani           avgt  2,502  ± 0,166  us/op
1Thread probakowski       avgt  1,730  ± 0,123  us/op
8Threads johtani          avgt  24,180 ± 0,563  us/op
8Threads probakowski      avgt  3,104  ± 0,294  us/op

(I've used version 7.4.2 of @johtani library, which is thread safe but suffers from synchronization)

@probakowski
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two more comments, otherwise LGTM.

@probakowski
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small things and a question regarding the trim setting when parsing quoted values.

boolean shouldSetField = true;
for (; currentIndex < length; currentIndex++) {
c = currentChar();
if (isWhitespace(c)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the value we are parsing is a quoted string, are the spaces around it always trimmed?

I've taken a look at the RFC but I'm still not sure what the right path here is. It states "Spaces are considered part of a field and should not be ignored." In the implementation we have a trim option which is a fine extension, but if reading a quoted string, the spaces are trimmed regardless of if the trim option is set to true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces around quotes are always trimmed since it's well defined where start and end of data is. It's deviation from RFC, where no spaces are allowed both before and after quotes at all. But it's common for CSV files in the wild to have them, so this change make it easier for a user with no ambiguity introduced.
For unquoted fields situation is different, there's no way to tell automatically where the data starts so a user must decide here if he wants to trim whitespaces or not. That's why it's parametrized.
Now when I think of it, my gut feeling is most use cases would trim leading/trailing whitespaces so I would even consider default trim to true. What do you think? @martijnvg you opinion would be appreciated as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with trimming leading/trailing whitespaces by default, because it seems more practical to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 convinced me otherwise: #49509 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, me too, reverted back to false

@droberts195
Copy link
Contributor

Thanks for changing the trim functionality 👍

@probakowski
Copy link
Contributor Author

@elasticmachine update branch

@droberts195
Copy link
Contributor

Some reasons not to trim CSV by default are:

  1. It's not in the standard
  2. Excel doesn't quote values that start and end in spaces when saving CSV
  3. Google Sheets doesn't quote values that start and end in spaces when saving CSV
  4. macOS Numbers doesn't quote values that start and end in spaces when saving CSV

I agree it's quite common to find non-standard CSV that needs spaces trimming, so it's good to have the option, but it seems more defensible to me to default to the standard and make people with non-standard data customise an option.

@probakowski
Copy link
Contributor Author

@droberts195 these are valid points, I'll revert default value for trim back to false

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points about the clear start and end for quoted content. LGTM!

@probakowski
Copy link
Contributor Author

@elasticmachine update branch

@probakowski
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/2

@probakowski probakowski merged commit 64e1a77 into elastic:master Dec 11, 2019
@probakowski probakowski deleted the csv-processor branch December 11, 2019 13:52
probakowski added a commit to probakowski/elasticsearch that referenced this pull request Dec 11, 2019
* CSV Processor for Ingest

This change adds new ingest processor that breaks line from CSV file into separate fields.
By default it conforms to RFC 4180 but can be tweaked.

Closes elastic#49113
probakowski added a commit that referenced this pull request Dec 11, 2019
* CSV ingest processor (#49509)

This change adds new ingest processor that breaks line from CSV file into separate fields.
By default it conforms to RFC 4180 but can be tweaked.

Closes #49113
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
* CSV Processor for Ingest

This change adds new ingest processor that breaks line from CSV file into separate fields.
By default it conforms to RFC 4180 but can be tweaked.

Closes elastic#49113
@dschneiter
Copy link
Contributor

@probakowski What's the intended behavior of this processor for quoted fields containing line-breaks? It seems that the processor is interpreting this as a new message, which is in contrast to how Excel and Numbers are dealing with line breaks in quoted strings.

@probakowski
Copy link
Contributor Author

Hi @dschneiter, this processor handles only single line from CSV so it doesn't care for line breaks at all (unless you set it as separator) and it doesn't have any notion of "new message"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSV processor
8 participants