Added support for whitespace-delimited input to readtable() #443

davidavdav · 2013-12-12T07:47:52Z

This patch makes the option separator=' ' to readtable() behave like the input is whitespace delimited. Whitespace-delimited text files are very common in unix processing, they're an intermediate between human readable and machine readable representations. I would want to make the case that "a b" and "a b" should be treated similarly, I think it is generally a bad idea to add semantics to varying amounts of whitespace (unlike the designers of fortran and python...).

Fields are separated by any combinations of ' ' and '\t'. '\n' and '\r' still are line separators.

With skip_white::Bool we keep track of leading spaces, trailing spaces are covered by inspection of nextchr. Intermediate spaces are covered by inspecting nextchr, and simply not storing anytything if nextchr and chr are both either ' ' or '\t'.

Keno · 2013-12-12T07:50:41Z

👍

davidavdav · 2013-12-12T11:40:02Z

I just noticed a difference in behaviour of in() between julia v0.2.0 and v0.3.0-prerelease. In v0.2.0

in(uint8(' '), " asd")

says false whereas in v0.3.0-prerelease it says true. The patch assumes the latter. Could be fixed by wrapping "nextchr" in the in() call by char().

johnmyleswhite · 2013-12-12T16:08:00Z

Does this skip whitespace? Or does it include it in the field you're parsing?

It would be really good to have a test file with string data, which is where this is most likely to break.

My only concern is to do a little bit of benchmarking. I don't see any way to make your code faster, so we're going to take some (hopefully very small) performance hit to support this functionality. I'd only like to know how much of a hit we're taking.

davidavdav · 2013-12-12T18:29:53Z

Hello,

On Thu, Dec 12, 2013 at 5:08 PM, John Myles White
notifications@git.luolix.topwrote:

Does this skip whitespace? Or does it include it in the field you're
parsing?

It simply skips whitespace (not storing it anywhere in the data structure)
if

the following character (nextchr) is a space or tab, or

it is at the start of a line

It would be really good to have a test file with string data, which is
where this is most likely to break.

Agreed, I didn't test that. Of course, if your data has strings containing
whitespace, a whitespace delimited file is an awkward choice. But it
should work, anyway. I'll check.

My only concern is to do a little bit of benchmarking. I don't see any way
to make your code faster, so we're going to take some (hopefully very
small) performance hit to support this functionality. I'd only like to know
how much of a hit we're taking.

Sure. For separator != ' ' is it one extra character compare per character
read. This might be taken out of the loop, if testing for a boolean is
faster. I wouldn't know---needs to be tested.

---david

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/443#issuecomment-30435614
.

David van Leeuwen

davidavdav · 2013-12-12T18:42:01Z

I inspired the test for quoted whitespace on the test for quoted commas, and threw some leading / trailing whitespace into the bargain. It seems to work. I also moved the original test out of the list, since the filename doesn't suggest separator=' '. Most whitespace delimited files I know are called ".txt". But that may be a personal convention...

---david

johnmyleswhite · 2013-12-13T05:51:59Z

Let's merge this, then see if it there's any tricks to get better performance.

Added support for whitespace-delimited input to readtable()

davidavdav · 2013-12-13T08:27:06Z

Thanks, did you measure any performance degradation by the introduction of the extra comparison? It is trivial to get the "o.separator == ' '" out of the loop, but there would still be a conditional.

johnmyleswhite · 2013-12-14T17:48:24Z

I didn't have a chance to measure this yet. I'm just acutely aware of how hard we have to fight for performance in this reader: even adding a second NA symbol makes us slower than R.

Added support for whitespace-delimited input to readtable()

be28331

char(nextchr) for julia.in() v0.2.0 compatibility

622d331

Added test for quoted whitespace (works)

591c130

johnmyleswhite added a commit that referenced this pull request Dec 13, 2013

Merge pull request #443 from davidavdav/master

afb0d61

Added support for whitespace-delimited input to readtable()

johnmyleswhite merged commit afb0d61 into JuliaData:master Dec 13, 2013

garborg mentioned this pull request Mar 11, 2014

multiple spaces delimiter #396

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for whitespace-delimited input to readtable() #443

Added support for whitespace-delimited input to readtable() #443

davidavdav commented Dec 12, 2013

Keno commented Dec 12, 2013

davidavdav commented Dec 12, 2013

johnmyleswhite commented Dec 12, 2013

davidavdav commented Dec 12, 2013

davidavdav commented Dec 12, 2013

johnmyleswhite commented Dec 13, 2013

davidavdav commented Dec 13, 2013

johnmyleswhite commented Dec 14, 2013

Added support for whitespace-delimited input to readtable() #443

Added support for whitespace-delimited input to readtable() #443

Conversation

davidavdav commented Dec 12, 2013

Keno commented Dec 12, 2013

davidavdav commented Dec 12, 2013

johnmyleswhite commented Dec 12, 2013

davidavdav commented Dec 12, 2013

davidavdav commented Dec 12, 2013

johnmyleswhite commented Dec 13, 2013

davidavdav commented Dec 13, 2013

johnmyleswhite commented Dec 14, 2013