-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for whitespace-delimited input to readtable() #443
Conversation
👍 |
I just noticed a difference in behaviour of in() between julia v0.2.0 and v0.3.0-prerelease. In v0.2.0 in(uint8(' '), " asd") says false whereas in v0.3.0-prerelease it says true. The patch assumes the latter. Could be fixed by wrapping "nextchr" in the in() call by char(). |
Does this skip whitespace? Or does it include it in the field you're parsing? It would be really good to have a test file with string data, which is where this is most likely to break. My only concern is to do a little bit of benchmarking. I don't see any way to make your code faster, so we're going to take some (hopefully very small) performance hit to support this functionality. I'd only like to know how much of a hit we're taking. |
Hello, On Thu, Dec 12, 2013 at 5:08 PM, John Myles White
---david
David van Leeuwen |
I inspired the test for quoted whitespace on the test for quoted commas, and threw some leading / trailing whitespace into the bargain. It seems to work. I also moved the original test out of the list, since the filename doesn't suggest separator=' '. Most whitespace delimited files I know are called ".txt". But that may be a personal convention... ---david |
Let's merge this, then see if it there's any tricks to get better performance. |
Added support for whitespace-delimited input to readtable()
Thanks, did you measure any performance degradation by the introduction of the extra comparison? It is trivial to get the "o.separator == ' '" out of the loop, but there would still be a conditional. |
I didn't have a chance to measure this yet. I'm just acutely aware of how hard we have to fight for performance in this reader: even adding a second |
This patch makes the option separator=' ' to readtable() behave like the input is whitespace delimited. Whitespace-delimited text files are very common in unix processing, they're an intermediate between human readable and machine readable representations. I would want to make the case that "a b" and "a b" should be treated similarly, I think it is generally a bad idea to add semantics to varying amounts of whitespace (unlike the designers of fortran and python...).
Fields are separated by any combinations of ' ' and '\t'. '\n' and '\r' still are line separators.
With skip_white::Bool we keep track of leading spaces, trailing spaces are covered by inspection of nextchr. Intermediate spaces are covered by inspecting nextchr, and simply not storing anytything if nextchr and chr are both either ' ' or '\t'.