ARROW-16480: [R] Update read_csv_arrow and open_dataset parse_options, read_options, and convert_options to take lists #15270

thisisnic · 2023-01-09T12:44:48Z

No description provided.

github-actions · 2023-01-09T12:45:13Z

https://issues.apache.org/jira/browse/ARROW-16480

github-actions · 2023-01-09T12:45:14Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

wjones127 · 2023-01-09T17:44:03Z

r/tests/testthat/test-csv.R

+  tf <- tempfile()
+  on.exit(unlink(tf))
+
+  writeLines('"x"\nNA\nNA\n"NULL"\n\n"foo"\n', tf, )
+  readLines(tf)


Small nit: we make a lot of temp files in these tests, when we could just do this in-memory. Don't worry about the rest of the file, but maybe simplify to:

Suggested change

tf <- tempfile()

on.exit(unlink(tf))

writeLines('"x"\nNA\nNA\n"NULL"\n\n"foo"\n', tf, )

readLines(tf)

tf <- buffer(charToRaw('"x"\nNA\nNA\n"NULL"\n\n"foo"\n'))

Although, is this the right output? I'm not sure the null values are being parsed correctly or ignore_empty_lines is working, but skip_rows does seem to be working.

> library(arrow) Attaching package: ‘arrow’ The following object is masked from ‘package:utils’: timestamp > buf <- buffer(charToRaw('"x"\nNA\nNA\n"NULL"\n\n"foo"\n')) > tab1 <- read_csv_arrow( + buf, + convert_options = list(null_values = c("NA", "NULL")), + parse_options = list(ignore_empty_lines = FALSE), + read_options = list(skip_rows = 1L) + ) > tab1 NA 1 NA 2 NULL 3 4 foo

Ah, good catch! null_values requires strings_can_be_null to be passed in as well to work there, will update. ignore_empty_lines works as intended though - compare the results with it set to TRUE.

What's the impact of creating temporary files? We could looks to make it more efficient (having a single tempfile we overwrite each time perhaps?) if it's slowing things down, though I would prefer to have them keep using files given that this is the topic of these tests.

Just slowing things down. The C++ tests generally write to buffers, not files, for single file tests. But if people would prefer to keep as files that's fine too.

though I would prefer to have them keep using files given that this is the topic of these tests.

If there is a test that is specifically about the interaction of CSV data and the filesystem, I agree that makes sense. For example, writing a dataset and verifying the directory structure on disk is expected. But most of these don't really care about the filesystem; writing CSV data to a buffer is just as valid as to a temp file.

OK, that makes sense, thanks for explaining. If the difference is significant, we should definitely look into it. Any idea how much impact it has?

idk I can look into it later. Don't worry about it for now.

…as lists in dataset CSV reader

r/tests/testthat/test-dataset-csv.R

ursabot · 2023-01-11T01:53:19Z

Benchmark runs are scheduled for baseline = 85b167c and contender = f7b18c4. f7b18c4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.66% ⬆️0.06%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.06% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] f7b18c42 ec2-t3-xlarge-us-east-2
[Failed] f7b18c42 test-mac-arm
[Finished] f7b18c42 ursa-i9-9960x
[Finished] f7b18c42 ursa-thinkcentre-m75q
[Finished] 85b167c0 ec2-t3-xlarge-us-east-2
[Failed] 85b167c0 test-mac-arm
[Finished] 85b167c0 ursa-i9-9960x
[Finished] 85b167c0 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Allow passing in options as lists

d1b2a93

github-actions bot added the Component: R label Jan 9, 2023

thisisnic requested a review from wjones127 January 9, 2023 13:39

wjones127 reviewed Jan 9, 2023

View reviewed changes

thisisnic added 6 commits January 10, 2023 15:00

Remove unnecessary call to readLines

cead917

Add test for open_dataset

a09ade7

Add in strings_can_be_null to make tests clearer

44665d0

Run styler

620289f

Remove hanging empty param

1d64283

Allow read_options, convert_options, and parse_options to be read in …

ef02e43

…as lists in dataset CSV reader

wjones127 reviewed Jan 10, 2023

View reviewed changes

r/tests/testthat/test-dataset-csv.R Outdated Show resolved Hide resolved

Update r/tests/testthat/test-dataset-csv.R

0301ecc

wjones127 merged commit f7b18c4 into apache:master Jan 10, 2023

raulcd mentioned this pull request Jan 11, 2023

[R] Update read_csv_arrow and open_dataset parse_options, read_options, and convert_options to take lists #31848

Closed

thisisnic mentioned this pull request Jan 12, 2023

[R] CsvConvertOptions include_columns and col_select should give better error message when used in open_dataset #31355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-16480: [R] Update read_csv_arrow and open_dataset parse_options, read_options, and convert_options to take lists #15270

ARROW-16480: [R] Update read_csv_arrow and open_dataset parse_options, read_options, and convert_options to take lists #15270

thisisnic commented Jan 9, 2023

github-actions bot commented Jan 9, 2023

github-actions bot commented Jan 9, 2023

wjones127 Jan 9, 2023

wjones127 Jan 9, 2023

thisisnic Jan 10, 2023

thisisnic Jan 10, 2023

wjones127 Jan 10, 2023

thisisnic Jan 10, 2023

wjones127 Jan 10, 2023

ursabot commented Jan 11, 2023

ARROW-16480: [R] Update read_csv_arrow and open_dataset parse_options, read_options, and convert_options to take lists #15270

ARROW-16480: [R] Update read_csv_arrow and open_dataset parse_options, read_options, and convert_options to take lists #15270

Conversation

thisisnic commented Jan 9, 2023

github-actions bot commented Jan 9, 2023

github-actions bot commented Jan 9, 2023

wjones127 Jan 9, 2023

Choose a reason for hiding this comment

wjones127 Jan 9, 2023

Choose a reason for hiding this comment

thisisnic Jan 10, 2023

Choose a reason for hiding this comment

thisisnic Jan 10, 2023

Choose a reason for hiding this comment

wjones127 Jan 10, 2023

Choose a reason for hiding this comment

thisisnic Jan 10, 2023

Choose a reason for hiding this comment

wjones127 Jan 10, 2023

Choose a reason for hiding this comment

ursabot commented Jan 11, 2023