Split `std.mem.split` and `tokenize` into `sequence`, `any`, and `scalar` versions #15579

squeek502 · 2023-05-05T01:23:56Z

This allows users to choose which version they need for their particular use case, as the previous default (now the 'sequence' version for split, and the 'any' version for tokenize) was (1) not always the desired type of delimiter and (2) performed worse than the scalar version if the delimiter was a single item.
std.mem.split is now an alias for std.mem.splitSequence and std.mem.tokenize is now an alias for std.mem.tokenizeAny with a deprecation doc comment on each of split and tokenize
Everywhere that can now use the 'scalar' version should get a nice little performance boost. EDIT: The performance boost might only apply to split, didn't check the performance difference of tokenizeAny versus tokenizeScalar

Any suggestions for a better name for the Full version are welcome. EDIT: Maybe Sequence? EDIT#2: Went with Sequence for now

wooster0 · 2023-05-05T02:01:53Z

I was curious myself so I investigated in Godbolt https://zig.godbolt.org/z/84qYfvejq (see export fn xy at top), and indeed, the code size is a lot less. The compiler is able to figure out splitAny and splitScalar with constant values though and will generate the same code.

How about splitAll instead of splitFull? I think "all" complements the "any" well.
And then instead of "scalar" we could say "one". It sounds very simple... but that's kind of the point:
splitOne, splitAny, splitAll. Now the names are also the same length. Not sure though.

Alternatively, or in addition, what if we have a sensible default which would be simply called split? Maybe we can make splitFull the default so it becomes split as before... I do think that would be okay because that behavior is what most other split functions from other languages use too, such as JavaScript or Python. And from a logical perspective it makes the most sense too.

And then for any advanced uses one can use one of the two new functions you added here.

We already do this for other functions too. There is std.mem.lastIndexOf, and std.mem.lastIndexOfAny and std.mem.lastIndexOfScalar.

xdBronch · 2023-05-05T02:07:30Z

I agree with keeping the name of the new splitFull as just split. The name makes enough sense and imo its best to avoid any type of breaking change if its not necessary, even something as small as a name change.

squeek502 · 2023-05-05T02:15:36Z

The reason for deprecating split and tokenize in favor of only the explicitly suffixed versions is that the default behavior is not consistent between them: the default for split is what-is-now splitFull, but the default for tokenize is what-is-now tokenizeAny. Not sure how to square that circle without changing the default behavior of tokenize (doesn't sound good) or forcing at least tokenize to be explicit (i.e. deprecrating the un-suffixed version). I just went with deprecating both since at some point it will cause people to need to explicitly choose which to update their code to use (similar to the ensureCapacity -> ensureTotalCapacity/ensureUnusedCapacity split that happened previously).

Other than that,

I think All as a suffix is a bit too ambiguous since it sounds like it could apply to the buffer param or the delimiter param (EDIT: Full suffers from this too)
Scalar matches indexOfScalar and other std.mem functions, and same with Any (indexOfAny)

EDIT: Also, just for clarity, in this PR std.mem.split is an alias for splitFull and std.mem.tokenize is an alias for tokenizeAny with a deprecation doc comment on each. Added this bit to the OP since it's easy to miss it in the diff.

…ter type: full, any, and scalar This allows users to choose which version they need for their particular use case, as the previous default (now the 'full' version) was (1) not always the desired type of delimiter and (2) performed worse than the scalar version if the delimiter was a single item.

…y, and scalar This allows users to choose which version they need for their particular use case, as the previous default (now the 'any' version) was (1) not always the desired type of delimiter and (2) performed worse than the scalar version if the delimiter was a single item.

Everywhere that can now use `tokenizeScalar` should get a nice little performance boost.

Everywhere that can now use `splitScalar` should get a nice little performance boost.

I think this makes the name less ambiguous and more obvious that the suffix applies to the `delimiter`.

squeek502 requested a review from kristoff-it as a code owner May 5, 2023 01:23

squeek502 force-pushed the mem-delimiters branch from 96d50b5 to fefb5cc Compare May 5, 2023 01:35

squeek502 force-pushed the mem-delimiters branch from fefb5cc to 982f5af Compare May 6, 2023 04:32

squeek502 changed the title ~~Split std.mem.split and tokenize into full, any, and scalar versions~~ Split std.mem.split and tokenize into sequence, any, and scalar versions May 7, 2023

squeek502 added 6 commits May 13, 2023 13:43

Fix SplitIterator and TokenIterator type instantiation

ce9f3ec

Update all std.mem.tokenize calls to their appropriate function

815e53b

Everywhere that can now use `tokenizeScalar` should get a nice little performance boost.

Update all std.mem.split calls to their appropriate function

2129f28

Everywhere that can now use `splitScalar` should get a nice little performance boost.

std.mem: Rename splitFull/tokenizeFull to splitSequence/tokenizeSequence

104f405

I think this makes the name less ambiguous and more obvious that the suffix applies to the `delimiter`.

squeek502 force-pushed the mem-delimiters branch from c2acd95 to 104f405 Compare May 13, 2023 20:45

andrewrk merged commit 629f0d2 into ziglang:master Jun 3, 2023

squeek502 mentioned this pull request Jun 3, 2023

std: add mem.SplitIterator.peek() #15670

Merged

xdBronch mentioned this pull request Jun 4, 2023

core/gpu: start on usingnamespace issue and new zig split usage hexops/mach#797

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split `std.mem.split` and `tokenize` into `sequence`, `any`, and `scalar` versions #15579

Split `std.mem.split` and `tokenize` into `sequence`, `any`, and `scalar` versions #15579

squeek502 commented May 5, 2023 •

edited

Loading

wooster0 commented May 5, 2023 •

edited

Loading

xdBronch commented May 5, 2023

squeek502 commented May 5, 2023 •

edited

Loading

Split std.mem.split and tokenize into sequence, any, and scalar versions #15579

Split std.mem.split and tokenize into sequence, any, and scalar versions #15579

Conversation

squeek502 commented May 5, 2023 • edited Loading

wooster0 commented May 5, 2023 • edited Loading

xdBronch commented May 5, 2023

squeek502 commented May 5, 2023 • edited Loading

Split `std.mem.split` and `tokenize` into `sequence`, `any`, and `scalar` versions #15579

Split `std.mem.split` and `tokenize` into `sequence`, `any`, and `scalar` versions #15579

squeek502 commented May 5, 2023 •

edited

Loading

wooster0 commented May 5, 2023 •

edited

Loading

squeek502 commented May 5, 2023 •

edited

Loading