feat: use pyarrow for string functions #2616

jpivarski · 2023-08-04T21:38:02Z

This PR adds support for vectorised string operations using PyArrow's pyarrow.compute module.
It implements all of the listed string predicates, string transforms, string padding, string trimming, string splitting, string component extraction, string joining, string slicing, and string containment tests.

String predicates

String transforms

String padding

center Center strings by padding with a given character.
lpad Right-align strings by padding with a given character.
rpad Left-align strings by padding with a given character.

String trimming

ltrim Trim leading characters.
ltrim_whitespace Trim leading whitespace characters.
rtrim Trim trailing characters.
rtrim_whitespace Trim trailing whitespace characters.
trim Trim leading and trailing characters.
trim_whitespace Trim leading and trailing whitespace characters.

String splitting

split_pattern Split string according to separator.
split_whitespace Split string according to any ASCII whitespace.
split_pattern_regex Split string according to regex pattern.

String component extraction

extract_regex Extract substrings captured by a regex pattern.

String joining

join Join a list of strings together with a separator.
join_element_wise Join string arguments together, with the last argument as separator.

String slicing

slice Slice string.

Containment tests

codecov · 2023-08-04T21:57:35Z

Codecov Report

❗ No coverage uploaded for pull request base (main@85ca6d5). Click here to learn what that means.
The diff coverage is 98.93%.

Additional details and impacted files

Files Changed	Coverage Δ
src/awkward/_connect/pyarrow.py	`91.15% <90.00%> (ø)`
src/awkward/operations/str/akstr_join.py	`92.68% <92.68%> (ø)`
src/awkward/operations/str/akstr_repeat.py	`94.44% <94.44%> (ø)`
.../awkward/operations/str/akstr_join_element_wise.py	`95.65% <95.65%> (ø)`
src/awkward/operations/str/akstr_index_in.py	`96.42% <96.42%> (ø)`
src/awkward/operations/str/akstr_is_in.py	`96.42% <96.42%> (ø)`
src/awkward/operations/str/__init__.py	`98.80% <98.80%> (ø)`
src/awkward/contents/unmaskedarray.py	`73.62% <100.00%> (ø)`
src/awkward/operations/__init__.py	`100.00% <100.00%> (ø)`
src/awkward/operations/str/akstr_capitalize.py	`100.00% <100.00%> (ø)`
... and 43 more

jpivarski · 2023-08-05T02:28:59Z

The split_* functions don't look fundamentally hard. They have to pass parameters (constants, not broadcasted arrays) and they output lists of strings instead of strings. If the bytestring_to_string=True option is needed for bytestrings (very unlikely!), then the string → bytestring repacking will need to be applied one level deeper (but I don't think that bytestring_to_string=True will be needed).

The extract_regex function is straightforward to call, but it returns an Arrow struct. If the record field names are formulaic ("\1", "\2", etc.), then we'd probably want to set the fields of the output to None so that it becomes an Awkward tuple. (We can assign to the output in place because it was just created by ak.from_arrow, and the fields is not the zero-copy data.)

The only remaining function that needs to be broadcasted (like binary_repeat) is join_element_wise. Unlike binary_repeat, there's no choice about whether to broadcast or not, and there can be more than two arrays to broadcast. (All of the *strings arrays documented here would have to be broadcasted). This will probably be the hardest one.

By comparison, join (documented here) does not broadcast, but it needs to be applied at the level of lists of strings, not just strings (one level deep). It's a reducer.

Two of the containment tests take an array-like second argument: index_in and is_in. This argument is value_set (see index_in and is_in, which is a set of strings that each string in the values is compared against, but it is not broadcasted to values. So it's much easier than join_element_wise.

None of the remaining functions look sufficiently different from the ones that have been implemented so far as to be a problem.

agoose77 · 2023-08-06T17:00:01Z

Jim, this is a phenomenal piece of work. Looking forward to reviewing it.

P.S. I touched your PR description to fix the links. Correct anything if you feel I made a mistake.

agoose77 · 2023-08-07T11:45:15Z

I've added split_whitespace, split_pattern, and split_pattern_regex. I've noticed that these operations return nullable types, so the round-trip logic in _get_action doesn't correctly re-interpret the result as bytestrings if the input is a bytestring (and bytestring_to_string=True).

As such, I added a new _get_split_action that handles the case where a returned list contains a nullable string. I don't *think* we should ever get anything besides an UnmaskedArray` here, as we never pass any null data to Arrow.

It's not clear to me whether we're passing a nullable type, or whether Arrow is just deciding to return a nullable type. My hunch was that if the input StringArray is nullable, then the result is nullable, but I can't check whether the type we build is considered nullable by arrow; large_string or large_binary don't seem to expose this information.

I haven't added tests here yet, but they should be fairly easy.

agoose77 · 2023-08-07T13:06:39Z

binary_join is not too tricky, but it looks like the Arrow kernel only supports nullable, small strings. A first-pass is here, although it should be cleaned up. I just played around to get it working, rather than thinking through robustly what the recursion function should look like.

When working on these functions, two questions come to mind:

Should we perform validation logic in Awkward, rather than relying on Arrow to error if an invalid type is passed?
Should we treat operations on incompatible types (e.g. string operations on non-strings, or on record arrays) as errors, rather than silently returning the original array?

On both of these points, I think the answer is yes. There are some Awkward functions for which being permissive is perhaps a good thing, e.g. ak.values_astype. However, in general we have type information available and I think it would be useful for users to be able to detect operations upon the "wrong" types as early as possible. Do you agree, @jpivarski?

src/awkward/contents/unmaskedarray.py

jpivarski · 2023-08-07T16:18:30Z

src/awkward/operations/str/__init__.py

+    if layout.content.is_option:
+        assert isinstance(layout.content, UnmaskedArray)
+        return layout.copy(content=layout.content.content)


Since we only ever send non-missing strings to Arrow, it's fair to assume that the only outputs are non-missing as well, even if Arrow says that the type is potentially nullable.

Actually, I think that the non-nullable type is part of the information that's lost by setting extensionarray=False. But we want that because Arrow Compute only applies its string operations if it recognizes the type, and it doesn't recognize the type if it's an extension array.

It's not clear to me whether we're passing a nullable type, or whether Arrow is just deciding to return a nullable type.

I'm pretty sure that Arrow is just deciding to return a nullable type. Arrow's type system does not allow it to express non-nullability (the default is nullable) in the type objects themselves; non-nullability can only be expressed in fields. There's a two-level structure under each struct and Table: field (with name, nullability, and some other things), which contains type. If you have a raw list, not in a struct or Table, I don't think there's a place to put the non-nullability information.

These corner-cases are the reason we use extension arrays, to carry more information through Arrow and Parquet. But we can't use one here if we want Arrow Compute to recognize the input as strings.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

agoose77 · 2023-08-08T11:09:42Z

@jpivarski I think this is ready to go! I see the merit in being permissive for our string operations, and I think upon reflection that most of our functions are like this.

I have one outstanding API question that we need to figure out — whether to drop the ak_ prefix for string function module names? I think we should, and it suggests that the module names in ak.operations should also be changed. I don't know the motivation for the ak_ prefix; it helps with indicating that a module contains a high-level function, but that's not a strong motivation for keeping it.

jpivarski · 2023-08-08T14:13:55Z

The ak_ prefix was to distinguish between modules and the high-level functions they contain. Some of the high-level functions have the same names as Python built-ins, so I wanted something to not spread that degeneracy into the module names as well.

You're right that ak_ makes a lot less sense for these functions, which aren't ak.* but ak.str.*. I wasn't sure what to do about that when starting this PR, but I just went with the flow. We could use a ak_str_ prefix, adding boilerplate, but it would be less confusing. If someone has a nested, indented file-browser, digging more deeply into the file structure would indent to the right because these functions are in a directory named str, but also because their names have longer prefixes.

Maybe all of the modules should be prefixed by something neutral, like m_ for module (not to be confused with "private" in C++), h_ for high-level function, or o_ for operation.

I don't think this needs to be decided now.

It's great that all of these functions are done! We should tell @martindurant that he can try them out on some log-file data whenever he gets a chance. I'll merge this (unclear who should review it) after testing against all other outstanding merges into main.

agoose77 · 2023-08-08T14:21:50Z

If we don't sort this now, well have to change the names in a breaking way down the road (only breaking for imports, which resolve file paths under the hood). That's already true for ak.operations.

I don't think we should make the ak.operations module private though - I think we want public apis to live in public modules. I'm still thinking that through, though.

jpivarski · 2023-08-08T14:27:57Z

Technically true, but should people really be accessing ak.XYZ through ak.operations.XYZ or ak.operations.ak_XYZ.XYZ anyway? It seems like there should be only one way to access these functions; putting them inside directories is for our own organization. I'd be happy with operations → _operations.

martindurant · 2023-08-08T14:34:57Z

I suppose this means there's no longer any need for the same functionality in awkward-pandas, but I do think it would like to keep a str accessor whatever you decide here, since that's the pandas pattern.

How is the performance?

Yes, agree that nice log datasets would be perfect; https://github.com/logpai/loghub ?

jpivarski · 2023-08-08T14:44:38Z

Performance should be whatever Arrow provides, which I expect to be good. (All of those functions are C++, exposed through Cython.)

It looks like the log files in that GitHub repo are not the full datasets. Some of them are supposedly 20‒30 GB, but the file itself looks like a few kB. Besides, if they really put many GB in a git repo, it would be very hard to clone.

We'd also need something to preprocess it, to get it into an Awkward format. That text processing would take some significant time.

Are any of them JSON? We already have an optimized JSON parser.

martindurant · 2023-08-08T14:46:26Z

The full data files are at https://zenodo.org/record/8196385 (zip and tar.gz files). The repo contains all the data descriptions as READMEs.

jpivarski · 2023-08-08T15:08:15Z

I renamed the modules in ak.operations.str as akstr_* to acknowledge that they're not in the ak.* namespace, but the ak.str.* namespace. I think this will at least be a non-confusing choice (and I thought the extra underscore between "ak" and "str" is gratuitous). Eventually, I think ak.operations should be private (L3), but that can be later and it will require a deprecation cycle.

I'm guessing this is not controversial, so I'll enable auto-merge, but you can stop it and we'll discuss if you have another idea.

jpivarski · 2023-08-08T15:14:19Z

docs/prepare_docstrings.py

-    shortname = re.sub(r"\.operations\.str\.ak_\w+", ".str", shortname)
+    shortname = re.sub(r"\.operations\.str\.akstr_\w+", ".str", shortname)


jpivarski temporarily deployed to docs-preview August 4, 2023 21:45 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 4, 2023 21:59 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 4, 2023 22:14 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 4, 2023 22:36 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 4, 2023 22:43 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 4, 2023 22:54 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 4, 2023 23:08 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 4, 2023 23:20 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 4, 2023 23:35 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 5, 2023 00:21 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 5, 2023 00:34 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 5, 2023 00:55 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 5, 2023 01:13 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 5, 2023 01:32 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 5, 2023 01:52 — with GitHub Actions Inactive

jpivarski temporarily deployed to docs-preview August 5, 2023 02:18 — with GitHub Actions Inactive

agoose77 temporarily deployed to docs-preview August 7, 2023 09:27 — with GitHub Actions Inactive

agoose77 force-pushed the jpivarski/use-pyarrow-for-strings branch from 461eb83 to 258dab9 Compare August 7, 2023 09:44

agoose77 temporarily deployed to docs-preview August 7, 2023 09:51 — with GitHub Actions Inactive

agoose77 temporarily deployed to docs-preview August 7, 2023 10:30 — with GitHub Actions Inactive

agoose77 temporarily deployed to docs-preview August 7, 2023 11:52 — with GitHub Actions Inactive

agoose77 temporarily deployed to docs-preview August 7, 2023 12:19 — with GitHub Actions Inactive

jpivarski commented Aug 7, 2023

View reviewed changes

src/awkward/contents/unmaskedarray.py Outdated Show resolved Hide resolved

jpivarski commented Aug 7, 2023

View reviewed changes

jpivarski temporarily deployed to docs-preview August 7, 2023 18:11 — with GitHub Actions Inactive

agoose77 and others added 5 commits August 8, 2023 11:55

fix: pass module to str high_level_function

307a3ea

docs: homogenize docstrings

51a5c5c

docs: add see also

447cde7

docs: include ak.str in toctree

cbba554

chore: update pre-commit hooks (#2619)

6e39bf1

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

agoose77 force-pushed the jpivarski/use-pyarrow-for-strings branch from c78a499 to 6e39bf1 Compare August 8, 2023 10:56

agoose77 temporarily deployed to docs-preview August 8, 2023 11:03 — with GitHub Actions Inactive

refactor: cleanup error handling

9fee3fc

agoose77 marked this pull request as ready for review August 8, 2023 11:06

agoose77 temporarily deployed to docs-preview August 8, 2023 11:13 — with GitHub Actions Inactive

jpivarski and others added 2 commits August 8, 2023 09:58

Merge branch 'main' into jpivarski/use-pyarrow-for-strings

a2ca690

Rename ak_*.py modules -> akstr_*.py.

c5f5cb7

jpivarski enabled auto-merge (squash) August 8, 2023 15:08

docs: be explicit about ak_str_

7bcb12c

jpivarski commented Aug 8, 2023

View reviewed changes

agoose77 temporarily deployed to docs-preview August 8, 2023 15:19 — with GitHub Actions Inactive

Merge branch 'main' into jpivarski/use-pyarrow-for-strings

34d0184

agoose77 temporarily deployed to docs-preview August 8, 2023 17:54 — with GitHub Actions Inactive

jpivarski merged commit 1cfea2f into main Aug 8, 2023

jpivarski deleted the jpivarski/use-pyarrow-for-strings branch August 8, 2023 18:00

This was referenced Aug 8, 2023

feat: enable use of ak.str namespace intake/akimbo#34

Merged

feat: support ak.str as dak.str dask-contrib/dask-awkward#339

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use pyarrow for string functions #2616

feat: use pyarrow for string functions #2616

jpivarski commented Aug 4, 2023 •

edited by agoose77

Loading

codecov bot commented Aug 4, 2023 •

edited

Loading

jpivarski commented Aug 5, 2023

agoose77 commented Aug 6, 2023 •

edited

Loading

agoose77 commented Aug 7, 2023

agoose77 commented Aug 7, 2023 •

edited

Loading

jpivarski Aug 7, 2023

jpivarski Aug 7, 2023

agoose77 commented Aug 8, 2023 •

edited

Loading

jpivarski commented Aug 8, 2023

agoose77 commented Aug 8, 2023

jpivarski commented Aug 8, 2023

martindurant commented Aug 8, 2023

jpivarski commented Aug 8, 2023

martindurant commented Aug 8, 2023

jpivarski commented Aug 8, 2023

jpivarski Aug 8, 2023

		shortname = re.sub(r"\.operations\.str\.ak_\w+", ".str", shortname)
		shortname = re.sub(r"\.operations\.str\.akstr_\w+", ".str", shortname)

feat: use pyarrow for string functions #2616

feat: use pyarrow for string functions #2616

Conversation

jpivarski commented Aug 4, 2023 • edited by agoose77 Loading

String predicates

String transforms

String padding

String trimming

String splitting

String component extraction

String joining

String slicing

Containment tests

codecov bot commented Aug 4, 2023 • edited Loading

Codecov Report

jpivarski commented Aug 5, 2023

agoose77 commented Aug 6, 2023 • edited Loading

agoose77 commented Aug 7, 2023

agoose77 commented Aug 7, 2023 • edited Loading

jpivarski Aug 7, 2023

Choose a reason for hiding this comment

jpivarski Aug 7, 2023

Choose a reason for hiding this comment

agoose77 commented Aug 8, 2023 • edited Loading

jpivarski commented Aug 8, 2023

agoose77 commented Aug 8, 2023

jpivarski commented Aug 8, 2023

martindurant commented Aug 8, 2023

jpivarski commented Aug 8, 2023

martindurant commented Aug 8, 2023

jpivarski commented Aug 8, 2023

jpivarski Aug 8, 2023

Choose a reason for hiding this comment

jpivarski commented Aug 4, 2023 •

edited by agoose77

Loading

codecov bot commented Aug 4, 2023 •

edited

Loading

agoose77 commented Aug 6, 2023 •

edited

Loading

agoose77 commented Aug 7, 2023 •

edited

Loading

agoose77 commented Aug 8, 2023 •

edited

Loading