Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split std.mem.split and tokenize into sequence, any, and scalar versions #15579

Merged
merged 6 commits into from
Jun 3, 2023

Conversation

squeek502
Copy link
Collaborator

@squeek502 squeek502 commented May 5, 2023

  • This allows users to choose which version they need for their particular use case, as the previous default (now the 'sequence' version for split, and the 'any' version for tokenize) was (1) not always the desired type of delimiter and (2) performed worse than the scalar version if the delimiter was a single item.
  • std.mem.split is now an alias for std.mem.splitSequence and std.mem.tokenize is now an alias for std.mem.tokenizeAny with a deprecation doc comment on each of split and tokenize
  • Everywhere that can now use the 'scalar' version should get a nice little performance boost. EDIT: The performance boost might only apply to split, didn't check the performance difference of tokenizeAny versus tokenizeScalar

Any suggestions for a better name for the Full version are welcome. EDIT: Maybe Sequence? EDIT#2: Went with Sequence for now

@wooster0
Copy link
Contributor

wooster0 commented May 5, 2023

I was curious myself so I investigated in Godbolt https://zig.godbolt.org/z/84qYfvejq (see export fn xy at top), and indeed, the code size is a lot less. The compiler is able to figure out splitAny and splitScalar with constant values though and will generate the same code.

How about splitAll instead of splitFull? I think "all" complements the "any" well.
And then instead of "scalar" we could say "one". It sounds very simple... but that's kind of the point:
splitOne, splitAny, splitAll. Now the names are also the same length. Not sure though.

Alternatively, or in addition, what if we have a sensible default which would be simply called split? Maybe we can make splitFull the default so it becomes split as before... I do think that would be okay because that behavior is what most other split functions from other languages use too, such as JavaScript or Python. And from a logical perspective it makes the most sense too.

And then for any advanced uses one can use one of the two new functions you added here.

We already do this for other functions too. There is std.mem.lastIndexOf, and std.mem.lastIndexOfAny and std.mem.lastIndexOfScalar.

@xdBronch
Copy link
Contributor

xdBronch commented May 5, 2023

I agree with keeping the name of the new splitFull as just split. The name makes enough sense and imo its best to avoid any type of breaking change if its not necessary, even something as small as a name change.

@squeek502
Copy link
Collaborator Author

squeek502 commented May 5, 2023

The reason for deprecating split and tokenize in favor of only the explicitly suffixed versions is that the default behavior is not consistent between them: the default for split is what-is-now splitFull, but the default for tokenize is what-is-now tokenizeAny. Not sure how to square that circle without changing the default behavior of tokenize (doesn't sound good) or forcing at least tokenize to be explicit (i.e. deprecrating the un-suffixed version). I just went with deprecating both since at some point it will cause people to need to explicitly choose which to update their code to use (similar to the ensureCapacity -> ensureTotalCapacity/ensureUnusedCapacity split that happened previously).

Other than that,

  • I think All as a suffix is a bit too ambiguous since it sounds like it could apply to the buffer param or the delimiter param (EDIT: Full suffers from this too)
  • Scalar matches indexOfScalar and other std.mem functions, and same with Any (indexOfAny)

EDIT: Also, just for clarity, in this PR std.mem.split is an alias for splitFull and std.mem.tokenize is an alias for tokenizeAny with a deprecation doc comment on each. Added this bit to the OP since it's easy to miss it in the diff.

@squeek502 squeek502 changed the title Split std.mem.split and tokenize into full, any, and scalar versions Split std.mem.split and tokenize into sequence, any, and scalar versions May 7, 2023
…ter type: full, any, and scalar

This allows users to choose which version they need for their particular use case, as the previous default (now the 'full' version) was (1) not always the desired type of delimiter and (2) performed worse than the scalar version if the delimiter was a single item.
…y, and scalar

This allows users to choose which version they need for their particular use case, as the previous default (now the 'any' version) was (1) not always the desired type of delimiter and (2) performed worse than the scalar version if the delimiter was a single item.
Everywhere that can now use `tokenizeScalar` should get a nice little performance boost.
Everywhere that can now use `splitScalar` should get a nice little performance boost.
I think this makes the name less ambiguous and more obvious that the suffix applies to the `delimiter`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants