-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Second implementation of parsing for space-delimited data #36
base: master
Are you sure you want to change the base?
Conversation
Space-delimited parsing is mostly working, except that trailing whitespace is not dropped correctly.
It's hard to strip whitespaces correctly because a) It's valid part of field for CSV so "a,b,c " -> ["a","b","c "] b) If we're using spaces as delimiter we get spurious empty field at the end fo the line "a b c " -> ["a","b","c",""] Only reliable way to strip them is to read whole line, strip spaces and parse stripped line.
It's necessary for implementing delimiters which are not single character
Now all tests are passing.
Now it depends on rather complex data structure and needs to be specialized by compiler
The change looks good from a brief look. I will have to take a deeper look and run the benchmarks later next week (I'm very busy this week, sorry!) |
Data/Csv/Encoding.hs
Outdated
encodeRecord :: Word8 -> Record -> Builder | ||
encodeRecord delim = mconcat . intersperse (fromWord8 delim) | ||
. map fromByteString . map escape . V.toList | ||
encodeRecord :: EncodeOptions -> Record -> Builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why passing the whole record here is necessary (since we only use the delimiter). It might be slower or it might not, but I'm in favor of not changing things without reason (unless there is a reason I missed).
I took a deeper look and things look fine except for the small comments above. Could you please post the criterion benchmarks for before and after your change (you can use |
Always escape delimiter. Now it's passed to escape function as parameter and not hardcoded as comma. Note correct encoding of space-delimited data depends on escaping of space character.
All tests are passing
I changed EncodeOptions for a few times and when I selected simple option I didn't roll back all changes. I've pushed amended commits. Here is table with benchmarks results. Current HEAD, space delimited implementation and ration space/HEAD
Streaming suffered badly and named fields see performance impact as well, |
The reason I haven't looked into this yet is I was hoping to find some time to figure out why the streaming decode got 70% slower. |
I've merged current master and updated benchmarks. Streaming performans still suffer although less
|
Previous dicussion is in #30 pul request
Everything does work. Although encoding os space delimited data is a bit fragile and depends on fact that space is always escaped
There are performance regression with respect to current HEAD. Benchmarks shows that streaming have around 1.3 slowdown and decoding of named CSV have similar slowdown.