-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use pyarrow for string functions #2616
Conversation
Codecov Report
Additional details and impacted files
|
The The The only remaining function that needs to be broadcasted (like By comparison, Two of the containment tests take an array-like second argument: None of the remaining functions look sufficiently different from the ones that have been implemented so far as to be a problem. |
Jim, this is a phenomenal piece of work. Looking forward to reviewing it. P.S. I touched your PR description to fix the links. Correct anything if you feel I made a mistake. |
461eb83
to
258dab9
Compare
I've added As such, I added a new It's not clear to me whether we're passing a nullable type, or whether Arrow is just deciding to return a nullable type. My hunch was that if the input I haven't added tests here yet, but they should be fairly easy. |
When working on these functions, two questions come to mind:
On both of these points, I think the answer is yes. There are some Awkward functions for which being permissive is perhaps a good thing, e.g. |
if layout.content.is_option: | ||
assert isinstance(layout.content, UnmaskedArray) | ||
return layout.copy(content=layout.content.content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we only ever send non-missing strings to Arrow, it's fair to assume that the only outputs are non-missing as well, even if Arrow says that the type is potentially nullable.
Actually, I think that the non-nullable type is part of the information that's lost by setting extensionarray=False
. But we want that because Arrow Compute only applies its string operations if it recognizes the type, and it doesn't recognize the type if it's an extension array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me whether we're passing a nullable type, or whether Arrow is just deciding to return a nullable type.
I'm pretty sure that Arrow is just deciding to return a nullable type. Arrow's type system does not allow it to express non-nullability (the default is nullable) in the type objects themselves; non-nullability can only be expressed in fields. There's a two-level structure under each struct and Table: field (with name, nullability, and some other things), which contains type. If you have a raw list, not in a struct or Table, I don't think there's a place to put the non-nullability information.
These corner-cases are the reason we use extension arrays, to carry more information through Arrow and Parquet. But we can't use one here if we want Arrow Compute to recognize the input as strings.
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
c78a499
to
6e39bf1
Compare
@jpivarski I think this is ready to go! I see the merit in being permissive for our string operations, and I think upon reflection that most of our functions are like this. I have one outstanding API question that we need to figure out — whether to drop the |
The You're right that Maybe all of the modules should be prefixed by something neutral, like I don't think this needs to be decided now. It's great that all of these functions are done! We should tell @martindurant that he can try them out on some log-file data whenever he gets a chance. I'll merge this (unclear who should review it) after testing against all other outstanding merges into main. |
If we don't sort this now, well have to change the names in a breaking way down the road (only breaking for imports, which resolve file paths under the hood). That's already true for ak.operations. I don't think we should make the ak.operations module private though - I think we want public apis to live in public modules. I'm still thinking that through, though. |
Technically true, but should people really be accessing |
I suppose this means there's no longer any need for the same functionality in awkward-pandas, but I do think it would like to keep a How is the performance? Yes, agree that nice log datasets would be perfect; https://github.com/logpai/loghub ? |
Performance should be whatever Arrow provides, which I expect to be good. (All of those functions are C++, exposed through Cython.) It looks like the log files in that GitHub repo are not the full datasets. Some of them are supposedly 20‒30 GB, but the file itself looks like a few kB. Besides, if they really put many GB in a git repo, it would be very hard to clone. We'd also need something to preprocess it, to get it into an Awkward format. That text processing would take some significant time. Are any of them JSON? We already have an optimized JSON parser. |
The full data files are at https://zenodo.org/record/8196385 (zip and tar.gz files). The repo contains all the data descriptions as READMEs. |
I renamed the modules in I'm guessing this is not controversial, so I'll enable auto-merge, but you can stop it and we'll discuss if you have another idea. |
docs/prepare_docstrings.py
Outdated
shortname = re.sub(r"\.operations\.str\.ak_\w+", ".str", shortname) | ||
shortname = re.sub(r"\.operations\.str\.akstr_\w+", ".str", shortname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This PR adds support for vectorised string operations using PyArrow's
pyarrow.compute
module.It implements all of the listed string predicates, string transforms, string padding, string trimming, string splitting, string component extraction, string joining, string slicing, and string containment tests.
String predicates
String transforms
String padding
String trimming
String splitting
String component extraction
String joining
String slicing
Containment tests