⚡️ Faster tokenizer of strings #5387

dubzzz · 2024-10-31T18:58:43Z

Description

Our string arbitrary starts its initialization by tokenizing known vulnerable strings into set of units (chars). The idea behind this tokenization process is to later generate vulnerable strings while generating entries with this arbitrary.

The process is the following:

for each string known to be vulnerable, try to tokenize it with respect to the provided constraints on length and the unit arbitrary
for each tokenizable string, add it to the bucket of potentially to be generated strings

This original tokenizer process was able to abide by constraints on length. Computed tokens were depending on the set of provided contraints on lengths and the arbitrary being considered. But this flexibility had a runtime cost we don't want to pay anymore. The tokenizer will stop trying to optimize on the lengths and will just tokenize for the requested arbitrary.

Checklist — Don't delete this checklist and make sure you do the following before opening the PR

The name of my PR follows gitmoji specification
My PR references one of several related issues (if any)
- New features or breaking changes must come with an associated Issue or Discussion
- My PR does not add any new dependency without an associated Issue or Discussion
My PR includes bumps details, please run yarn bump and flag the impacts properly
My PR adds relevant tests and they would have failed without my PR (when applicable)

Advanced

Category: ⚡️ Improve performance
Impacts: Slight performance uplift, but way more to come thanks to this change

Our `string` arbitrary starts its initialization by tokenizing known vulnerable strings into set of units (chars). The idea behind this tokenization process is to later generate vulnerable strings while generating entries with this arbitrary. The process is the following: - for each string known to be vulnerable, try to tokenize it with respect to the provided constraints on length and the unit arbitrary - for each tokenizable string, add it to the bucket of potentially to be generated strings This original tokenizer process was able to abide by constraints on length. Computed tokens were depending on the set of provided contraints on lengths and the arbitrary being considered. But this flexibility had a runtime cost we don't want to pay anymore. The tokenizer will stop trying to optimize on the lengths and will just tokenize for the requested arbitrary.

changeset-bot · 2024-10-31T18:58:47Z

🦋 Changeset detected

Latest commit: 76da3a8

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
fast-check	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

packages/fast-check/src/arbitrary/_internals/helpers/TokenizeString.ts

…tring.ts

codesandbox-ci · 2024-10-31T19:00:55Z

This pull request is automatically built and testable in CodeSandbox.

To see build info of the built libraries, click here or the icon next to each commit SHA.

Latest deployment of this branch, based on commit e009720:

Sandbox	Source
@fast-check/examples	Configuration

codesandbox-ci · 2024-10-31T19:02:04Z

This pull request is automatically built and testable in CodeSandbox.

To see build info of the built libraries, click here or the icon next to each commit SHA.

Latest deployment of this branch, based on commit 76da3a8:

Sandbox	Source
@fast-check/examples	Configuration

dubzzz · 2024-10-31T19:03:11Z

Benchmark stats:

┌───────────────────────────────────────────────────────────────────────┬──────────┬────────────────────┬───────────┬─────────┐
│ Task Name                                                             | ops/sec  │ Average Time (ns)  │ Margin    │ Samples │
├───────────────────────────────────────────────────────────────────────┼──────────┼────────────────────┼───────────┼─────────┤
│ fc.char() on 3.21.0                                                   | '9,979'  │ 100201.50000025751 │ '±8.28%'  │ 100     │
│ fc.char() on 3.22.0                                                   | '9,198'  │ 108718.22000277461 │ '±9.35%'  │ 100     │
│ fc.char() on main                                                     | '9,001'  │ 111086.55000018189 │ '±9.93%'  │ 100     │
│ fc.char() on extra                                                    | '10,826' │ 92365.17000128515  │ '±7.28%'  │ 100     │
│ '—'                                                                   | '—'      │ '—'                │ '—'       │ '—'     │
│ fc.string() on 3.21.0                                                 | '4,882'  │ 204795.64999812283 │ '±8.12%'  │ 100     │
│ fc.string() on 3.22.0                                                 | '3,736'  │ 267655.0999999745  │ '±10.56%' │ 100     │
│ fc.string() on main                                                   | '3,959'  │ 252560.6999997399  │ '±7.91%'  │ 100     │
│ fc.string() on extra                                                  | '3,761'  │ 265849.1199996206  │ '±9.45%'  │ 100     │
│ '—'                                                                   | '—'      │ '—'                │ '—'       │ '—'     │
│ fc.constant('').chain(() => fc.string()) on 3.21.0                    | '1,367'  │ 731179.5900005382  │ '±5.92%'  │ 100     │
│ fc.constant('').chain(() => fc.string()) on 3.22.0                    | '332'    │ 3007857.6800005976 │ '±3.53%'  │ 100     │
│ fc.constant('').chain(() => fc.string()) on main                      | '306'    │ 3265226.5699984855 │ '±4.25%'  │ 100     │
│ fc.constant('').chain(() => fc.string()) on extra                     | '318'    │ 3140647.440001485  │ '±3.92%'  │ 100     │
│ '—'                                                                   | '—'      │ '—'                │ '—'       │ '—'     │
│ fc.string({ minLength: 0, maxLength: 500, size: 'max' }) on 3.21.0    | '364'    │ 2740566.6000011843 │ '±2.19%'  │ 100     │
│ fc.string({ minLength: 0, maxLength: 500, size: 'max' }) on 3.22.0    | '337'    │ 2964428.4999990487 │ '±4.32%'  │ 100     │
│ fc.string({ minLength: 0, maxLength: 500, size: 'max' }) on main      | '310'    │ 3217269.920000399  │ '±2.96%'  │ 100     │
│ fc.string({ minLength: 0, maxLength: 500, size: 'max' }) on extra     | '316'    │ 3162695.5100000487 │ '±2.53%'  │ 100     │
│ '—'                                                                   | '—'      │ '—'                │ '—'       │ '—'     │
│ fc.string({ minLength: 0, maxLength: 25_000, size: 'max' }) on 3.21.0 | '5'      │ 183261603.60999987 │ '±2.57%'  │ 100     │
│ fc.string({ minLength: 0, maxLength: 25_000, size: 'max' }) on 3.22.0 | '5'      │ 184139708.5599997  │ '±2.36%'  │ 100     │
│ fc.string({ minLength: 0, maxLength: 25_000, size: 'max' }) on main   | '5'      │ 197418345.70999897 │ '±2.15%'  │ 100     │
│ fc.string({ minLength: 0, maxLength: 25_000, size: 'max' }) on extra  | '4'      │ 202038732.63999936 │ '±2.08%'  │ 100     │
└───────────────────────────────────────────────────────────────────────┴──────────┴────────────────────┴───────────┴─────────┘

Detailed stats at: https://github.com/dubzzz/fast-check-benchmarks/actions/runs/11618811228

dubzzz · 2024-10-31T19:08:37Z

packages/fast-check/src/arbitrary/_internals/helpers/TokenizeString.ts

+ * Split a string into valid tokens of patternsArb
+ * @internal
+ */
+export function tokenizeString(patternsArb: Arbitrary<string>, value: string): string[] | undefined {


We might want to put an early exit if we are going too far in the number of chunks while we know for sure that in the current context there is no need to go that far. But it could be an optim for later. It's probably way less critical than the optim we are going to unlock now.

github-actions · 2024-10-31T19:09:15Z

👋 A preview of the new documentation is available at: http://6723d5da5aedeb1cb80f032b--dubzzz-fast-check.netlify.app

packages/fast-check/src/arbitrary/_internals/helpers/TokenizeString.ts

…tring.ts

github-actions · 2024-10-31T19:20:01Z

👋 A preview of the new documentation is available at: http://6723d86164c8cd1e8493cfb0--dubzzz-fast-check.netlify.app

codecov · 2024-10-31T19:28:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.28%. Comparing base (640157f) to head (76da3a8).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #5387   +/-   ##
=======================================
  Coverage   95.28%   95.28%           
=======================================
  Files         234      235    +1     
  Lines       10497    10504    +7     
  Branches     2799     2802    +3     
=======================================
+ Hits        10002    10009    +7     
  Misses        495      495

Flag	Coverage Δ
unit-tests	`95.28% <100.00%> (+<0.01%)`	⬆️
unit-tests-18.x-Linux	`95.28% <100.00%> (+<0.01%)`	⬆️
unit-tests-20.x-Linux	`95.28% <100.00%> (+<0.01%)`	⬆️
unit-tests-22.x-Linux	`95.28% <100.00%> (+<0.01%)`	⬆️
unit-tests-latest-Linux	`95.28% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Reverting part of the capability-drop introduced by #5387. When working on #5387, we dropped some unmapping capabilities on `string`. In the past `string` arbitrary used to be able to shrink even custom values and that even if it uses a pretty uncommon dictionary of units. With #5387 we restricted the unmapping capabilities a bit by dropping some of these very advanced supports. Now that we know that we can add this support back while preserving our performance uplift we add part of them back. In other words, before #5387 the following arbitrary could have shrunk `...___...`: ```ts fc.string({ maxLength: 3, unit: fc.constantFrom('._', '_...', '_._.', '_..', '.', '.._.', '__.', '....', '..', '.___', '._..', '__', '_.', '___', '.__.', '__._', '._.', '...', '_', '.._', '..._', '.__', '_.._', '_.__', '__..') }) ``` After the PR #5387, he was not able anymore. But we are adding back such support! Which will be mostly invisible for all our users 😂

**Description**  Reverting part of the capability-drop introduced by #5387. When working on #5387, we dropped some unmapping capabilities on `string`. In the past `string` arbitrary used to be able to shrink even custom values and that even if it uses a pretty uncommon dictionary of units. With #5387 we restricted the unmapping capabilities a bit by dropping some of these very advanced supports. Now that we know that we can add this support back while preserving our performance uplift we add part of them back. In other words, before #5387 the following arbitrary could have shrunk `...___...`: ```ts fc.string({ maxLength: 3, unit: fc.constantFrom('._', '_...', '_._.', '_..', '.', '.._.', '__.', '....', '..', '.___', '._..', '__', '_.', '___', '.__.', '__._', '._.', '...', '_', '.._', '..._', '.__', '_.._', '_.__', '__..') }) ``` After the PR #5387, he was not able anymore. But we are adding back such support! Which will be mostly invisible for all our users 😂    **Checklist** — _Don't delete this checklist and make sure you do the following before opening the PR_ - [x] The name of my PR follows [gitmoji](https://gitmoji.dev/) specification - [x] My PR references one of several related issues (if any) - [x] New features or breaking changes must come with an associated Issue or Discussion - [x] My PR does not add any new dependency without an associated Issue or Discussion - [x] My PR includes bumps details, please run `yarn bump` and flag the impacts properly - [x] My PR adds relevant tests and they would have failed without my PR (when applicable)  **Advanced**  - [x] Category: ✨ Introduce new features - [x] Impacts: New and not new at the same time, it never disappeared from officially published versions

dubzzz commented Oct 31, 2024

View reviewed changes

packages/fast-check/src/arbitrary/_internals/helpers/TokenizeString.ts Show resolved Hide resolved

dubzzz added 2 commits October 31, 2024 19:59

Update packages/fast-check/src/arbitrary/_internals/helpers/TokenizeS…

e009720

…tring.ts

Create polite-pumpkins-turn.md

ce3521c

fix lint

87c33fc

dubzzz commented Oct 31, 2024

View reviewed changes

packages/fast-check/src/arbitrary/_internals/helpers/TokenizeString.ts Show resolved Hide resolved

Update packages/fast-check/src/arbitrary/_internals/helpers/TokenizeS…

76da3a8

…tring.ts

dubzzz merged commit d336e2e into main Oct 31, 2024
73 checks passed

dubzzz deleted the faster-tokenizer-on-random-strings branch October 31, 2024 20:02

dubzzz mentioned this pull request Nov 1, 2024

✨ Add back strong unmapping capabilities to string #5390

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Faster tokenizer of strings #5387

⚡️ Faster tokenizer of strings #5387

dubzzz commented Oct 31, 2024 •

edited

Loading

changeset-bot bot commented Oct 31, 2024 •

edited

Loading

codesandbox-ci bot commented Oct 31, 2024

codesandbox-ci bot commented Oct 31, 2024 •

edited

Loading

dubzzz commented Oct 31, 2024 •

edited

Loading

dubzzz Oct 31, 2024

github-actions bot commented Oct 31, 2024

github-actions bot commented Oct 31, 2024

codecov bot commented Oct 31, 2024 •

edited

Loading

⚡️ Faster tokenizer of strings #5387

⚡️ Faster tokenizer of strings #5387

Conversation

dubzzz commented Oct 31, 2024 • edited Loading

changeset-bot bot commented Oct 31, 2024 • edited Loading

🦋 Changeset detected

codesandbox-ci bot commented Oct 31, 2024

codesandbox-ci bot commented Oct 31, 2024 • edited Loading

dubzzz commented Oct 31, 2024 • edited Loading

dubzzz Oct 31, 2024

Choose a reason for hiding this comment

github-actions bot commented Oct 31, 2024

github-actions bot commented Oct 31, 2024

codecov bot commented Oct 31, 2024 • edited Loading

Codecov Report

dubzzz commented Oct 31, 2024 •

edited

Loading

changeset-bot bot commented Oct 31, 2024 •

edited

Loading

codesandbox-ci bot commented Oct 31, 2024 •

edited

Loading

dubzzz commented Oct 31, 2024 •

edited

Loading

codecov bot commented Oct 31, 2024 •

edited

Loading