Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second implementation of parsing for space-delimited data #36

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

Shimuuar
Copy link
Contributor

@Shimuuar Shimuuar commented Mar 8, 2013

Previous dicussion is in #30 pul request

Everything does work. Although encoding os space delimited data is a bit fragile and depends on fact that space is always escaped

There are performance regression with respect to current HEAD. Benchmarks shows that streaming have around 1.3 slowdown and decoding of named CSV have similar slowdown.

tibbe and others added 9 commits February 28, 2013 20:32
Space-delimited parsing is mostly working, except that trailing
whitespace is not dropped correctly.
It's hard to strip whitespaces correctly because
 a) It's valid part of field for CSV so
    "a,b,c " -> ["a","b","c "]
 b) If we're using spaces as delimiter we get spurious empty field
    at the end fo the line
    "a b c " -> ["a","b","c",""]

Only reliable way to strip them is to read whole line, strip spaces
and parse stripped line.
It's necessary for implementing delimiters which are not single
character
Now all tests are passing.
Now it depends on rather complex data structure and needs to be
specialized by compiler
@tibbe
Copy link
Collaborator

tibbe commented Mar 11, 2013

The change looks good from a brief look. I will have to take a deeper look and run the benchmarks later next week (I'm very busy this week, sorry!)

encodeRecord :: Word8 -> Record -> Builder
encodeRecord delim = mconcat . intersperse (fromWord8 delim)
. map fromByteString . map escape . V.toList
encodeRecord :: EncodeOptions -> Record -> Builder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why passing the whole record here is necessary (since we only use the delimiter). It might be slower or it might not, but I'm in favor of not changing things without reason (unless there is a reason I missed).

@tibbe
Copy link
Collaborator

tibbe commented Mar 21, 2013

I took a deeper look and things look fine except for the small comments above.

Could you please post the criterion benchmarks for before and after your change (you can use cabal bench to run the benchmarks).

Always escape delimiter. Now it's passed to escape function as
parameter and not hardcoded as comma.

Note correct encoding of space-delimited data depends on escaping
of space character.
@Shimuuar
Copy link
Contributor Author

I changed EncodeOptions for a few times and when I selected simple option I didn't roll back all changes. I've pushed amended commits.

Here is table with benchmarks results. Current HEAD, space delimited implementation and ration space/HEAD

                                                       Name       old       new     ratio
1           positional/decode/presidents/without conversion  60.86681  57.76073 0.9489693
2              positional/decode/presidents/with conversion 134.72521 123.18220 0.9143218
3 positional/decode/streaming/presidents/without conversion  98.87020 171.77240 1.7373526
4    positional/decode/streaming/presidents/with conversion 137.70761 191.79104 1.3927410
5              positional/encode/presidents/with conversion 162.29042 180.51010 1.1122659
6                named/decode/presidents/without conversion 282.89621 355.71811 1.2574156
7                   named/decode/presidents/with conversion 213.22024 275.15457 1.2904712
8                   named/encode/presidents/with conversion 264.76437 267.29542 1.0095596
9                                       comparison/lazy-csv 150.31214 147.47115 0.9810994

Streaming suffered badly and named fields see performance impact as well,

@tibbe
Copy link
Collaborator

tibbe commented Apr 11, 2013

The reason I haven't looked into this yet is I was hoping to find some time to figure out why the streaming decode got 70% slower.

@Shimuuar
Copy link
Contributor Author

I've merged current master and updated benchmarks. Streaming performans still suffer although less

                                                       name       old      new     ratio
1           positional/decode/presidents/without conversion  59.47160  63.4873 1.0675229
2              positional/decode/presidents/with conversion 107.10150 112.4752 1.0501736
3 positional/decode/streaming/presidents/without conversion  89.97198 126.9854 1.4113888
4    positional/decode/streaming/presidents/with conversion 107.13403 149.0713 1.3914473
5              positional/encode/presidents/with conversion 134.93142 128.5330 0.9525799
6                named/decode/presidents/without conversion 247.20380 289.9606 1.1729619
7                   named/decode/presidents/with conversion 197.89479 246.9330 1.2477996
8                   named/encode/presidents/with conversion 195.77406 193.9283 0.9905722
9                                       comparison/lazy-csv 116.55858 117.6775 1.0095993

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants