Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16480: [R] Update read_csv_arrow and open_dataset parse_options, read_options, and convert_options to take lists #15270

Merged
merged 8 commits into from
Jan 10, 2023

Conversation

thisisnic
Copy link
Member

No description provided.

@github-actions
Copy link

github-actions bot commented Jan 9, 2023

@github-actions
Copy link

github-actions bot commented Jan 9, 2023

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Comment on lines 669 to 673
tf <- tempfile()
on.exit(unlink(tf))

writeLines('"x"\nNA\nNA\n"NULL"\n\n"foo"\n', tf, )
readLines(tf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: we make a lot of temp files in these tests, when we could just do this in-memory. Don't worry about the rest of the file, but maybe simplify to:

Suggested change
tf <- tempfile()
on.exit(unlink(tf))
writeLines('"x"\nNA\nNA\n"NULL"\n\n"foo"\n', tf, )
readLines(tf)
tf <- buffer(charToRaw('"x"\nNA\nNA\n"NULL"\n\n"foo"\n'))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, is this the right output? I'm not sure the null values are being parsed correctly or ignore_empty_lines is working, but skip_rows does seem to be working.

> library(arrow)

Attaching package:arrowThe following object is masked frompackage:utils:

    timestamp

> buf <- buffer(charToRaw('"x"\nNA\nNA\n"NULL"\n\n"foo"\n'))
> tab1 <- read_csv_arrow(
+     buf,
+     convert_options = list(null_values = c("NA", "NULL")),
+     parse_options = list(ignore_empty_lines = FALSE),
+     read_options = list(skip_rows = 1L)
+   )
> tab1
    NA
1   NA
2 NULL
3     
4  foo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch! null_values requires strings_can_be_null to be passed in as well to work there, will update. ignore_empty_lines works as intended though - compare the results with it set to TRUE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the impact of creating temporary files? We could looks to make it more efficient (having a single tempfile we overwrite each time perhaps?) if it's slowing things down, though I would prefer to have them keep using files given that this is the topic of these tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just slowing things down. The C++ tests generally write to buffers, not files, for single file tests. But if people would prefer to keep as files that's fine too.

though I would prefer to have them keep using files given that this is the topic of these tests.

If there is a test that is specifically about the interaction of CSV data and the filesystem, I agree that makes sense. For example, writing a dataset and verifying the directory structure on disk is expected. But most of these don't really care about the filesystem; writing CSV data to a buffer is just as valid as to a temp file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that makes sense, thanks for explaining. If the difference is significant, we should definitely look into it. Any idea how much impact it has?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk I can look into it later. Don't worry about it for now.

@wjones127 wjones127 merged commit f7b18c4 into apache:master Jan 10, 2023
@ursabot
Copy link

ursabot commented Jan 11, 2023

Benchmark runs are scheduled for baseline = 85b167c and contender = f7b18c4. f7b18c4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.66% ⬆️0.06%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.06% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] f7b18c42 ec2-t3-xlarge-us-east-2
[Failed] f7b18c42 test-mac-arm
[Finished] f7b18c42 ursa-i9-9960x
[Finished] f7b18c42 ursa-thinkcentre-m75q
[Finished] 85b167c0 ec2-t3-xlarge-us-east-2
[Failed] 85b167c0 test-mac-arm
[Finished] 85b167c0 ursa-i9-9960x
[Finished] 85b167c0 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants