-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16480: [R] Update read_csv_arrow and open_dataset parse_options, read_options, and convert_options to take lists #15270
Conversation
|
r/tests/testthat/test-csv.R
Outdated
tf <- tempfile() | ||
on.exit(unlink(tf)) | ||
|
||
writeLines('"x"\nNA\nNA\n"NULL"\n\n"foo"\n', tf, ) | ||
readLines(tf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit: we make a lot of temp files in these tests, when we could just do this in-memory. Don't worry about the rest of the file, but maybe simplify to:
tf <- tempfile() | |
on.exit(unlink(tf)) | |
writeLines('"x"\nNA\nNA\n"NULL"\n\n"foo"\n', tf, ) | |
readLines(tf) | |
tf <- buffer(charToRaw('"x"\nNA\nNA\n"NULL"\n\n"foo"\n')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, is this the right output? I'm not sure the null values are being parsed correctly or ignore_empty_lines
is working, but skip_rows
does seem to be working.
> library(arrow)
Attaching package: ‘arrow’
The following object is masked from ‘package:utils’:
timestamp
> buf <- buffer(charToRaw('"x"\nNA\nNA\n"NULL"\n\n"foo"\n'))
> tab1 <- read_csv_arrow(
+ buf,
+ convert_options = list(null_values = c("NA", "NULL")),
+ parse_options = list(ignore_empty_lines = FALSE),
+ read_options = list(skip_rows = 1L)
+ )
> tab1
NA
1 NA
2 NULL
3
4 foo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch! null_values
requires strings_can_be_null
to be passed in as well to work there, will update. ignore_empty_lines
works as intended though - compare the results with it set to TRUE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the impact of creating temporary files? We could looks to make it more efficient (having a single tempfile we overwrite each time perhaps?) if it's slowing things down, though I would prefer to have them keep using files given that this is the topic of these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just slowing things down. The C++ tests generally write to buffers, not files, for single file tests. But if people would prefer to keep as files that's fine too.
though I would prefer to have them keep using files given that this is the topic of these tests.
If there is a test that is specifically about the interaction of CSV data and the filesystem, I agree that makes sense. For example, writing a dataset and verifying the directory structure on disk is expected. But most of these don't really care about the filesystem; writing CSV data to a buffer is just as valid as to a temp file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that makes sense, thanks for explaining. If the difference is significant, we should definitely look into it. Any idea how much impact it has?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idk I can look into it later. Don't worry about it for now.
…as lists in dataset CSV reader
Benchmark runs are scheduled for baseline = 85b167c and contender = f7b18c4. f7b18c4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.