Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for whitespace-delimited input to readtable() #443

Merged
merged 3 commits into from
Dec 13, 2013

Conversation

davidavdav
Copy link
Contributor

This patch makes the option separator=' ' to readtable() behave like the input is whitespace delimited. Whitespace-delimited text files are very common in unix processing, they're an intermediate between human readable and machine readable representations. I would want to make the case that "a b" and "a b" should be treated similarly, I think it is generally a bad idea to add semantics to varying amounts of whitespace (unlike the designers of fortran and python...).

Fields are separated by any combinations of ' ' and '\t'. '\n' and '\r' still are line separators.

With skip_white::Bool we keep track of leading spaces, trailing spaces are covered by inspection of nextchr. Intermediate spaces are covered by inspecting nextchr, and simply not storing anytything if nextchr and chr are both either ' ' or '\t'.

@Keno
Copy link
Contributor

Keno commented Dec 12, 2013

👍

@davidavdav
Copy link
Contributor Author

I just noticed a difference in behaviour of in() between julia v0.2.0 and v0.3.0-prerelease. In v0.2.0

in(uint8(' '), " asd")

says false whereas in v0.3.0-prerelease it says true. The patch assumes the latter. Could be fixed by wrapping "nextchr" in the in() call by char().

@johnmyleswhite
Copy link
Contributor

Does this skip whitespace? Or does it include it in the field you're parsing?

It would be really good to have a test file with string data, which is where this is most likely to break.

My only concern is to do a little bit of benchmarking. I don't see any way to make your code faster, so we're going to take some (hopefully very small) performance hit to support this functionality. I'd only like to know how much of a hit we're taking.

@davidavdav
Copy link
Contributor Author

Hello,

On Thu, Dec 12, 2013 at 5:08 PM, John Myles White
notifications@git.luolix.topwrote:

Does this skip whitespace? Or does it include it in the field you're
parsing?

It simply skips whitespace (not storing it anywhere in the data structure)
if

  • the following character (nextchr) is a space or tab, or
  • it is at the start of a line

It would be really good to have a test file with string data, which is
where this is most likely to break.

Agreed, I didn't test that. Of course, if your data has strings containing
whitespace, a whitespace delimited file is an awkward choice. But it
should work, anyway. I'll check.

My only concern is to do a little bit of benchmarking. I don't see any way
to make your code faster, so we're going to take some (hopefully very
small) performance hit to support this functionality. I'd only like to know
how much of a hit we're taking.

Sure. For separator != ' ' is it one extra character compare per character
read. This might be taken out of the loop, if testing for a boolean is
faster. I wouldn't know---needs to be tested.

---david


Reply to this email directly or view it on GitHubhttps://github.com//pull/443#issuecomment-30435614
.

David van Leeuwen

@davidavdav
Copy link
Contributor Author

I inspired the test for quoted whitespace on the test for quoted commas, and threw some leading / trailing whitespace into the bargain. It seems to work. I also moved the original test out of the list, since the filename doesn't suggest separator=' '. Most whitespace delimited files I know are called ".txt". But that may be a personal convention...

---david

@johnmyleswhite
Copy link
Contributor

Let's merge this, then see if it there's any tricks to get better performance.

johnmyleswhite added a commit that referenced this pull request Dec 13, 2013
Added support for whitespace-delimited input to readtable()
@johnmyleswhite johnmyleswhite merged commit afb0d61 into JuliaData:master Dec 13, 2013
@davidavdav
Copy link
Contributor Author

Thanks, did you measure any performance degradation by the introduction of the extra comparison? It is trivial to get the "o.separator == ' '" out of the loop, but there would still be a conditional.

@johnmyleswhite
Copy link
Contributor

I didn't have a chance to measure this yet. I'm just acutely aware of how hard we have to fight for performance in this reader: even adding a second NA symbol makes us slower than R.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants