Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡️ Faster tokenizer of strings #5387

Merged
merged 5 commits into from
Oct 31, 2024
Merged

Conversation

dubzzz
Copy link
Owner

@dubzzz dubzzz commented Oct 31, 2024

Description

Our string arbitrary starts its initialization by tokenizing known vulnerable strings into set of units (chars). The idea behind this tokenization process is to later generate vulnerable strings while generating entries with this arbitrary.

The process is the following:

  • for each string known to be vulnerable, try to tokenize it with respect to the provided constraints on length and the unit arbitrary
  • for each tokenizable string, add it to the bucket of potentially to be generated strings

This original tokenizer process was able to abide by constraints on length. Computed tokens were depending on the set of provided contraints on lengths and the arbitrary being considered. But this flexibility had a runtime cost we don't want to pay anymore. The tokenizer will stop trying to optimize on the lengths and will just tokenize for the requested arbitrary.

ChecklistDon't delete this checklist and make sure you do the following before opening the PR

  • The name of my PR follows gitmoji specification
  • My PR references one of several related issues (if any)
    • New features or breaking changes must come with an associated Issue or Discussion
    • My PR does not add any new dependency without an associated Issue or Discussion
  • My PR includes bumps details, please run yarn bump and flag the impacts properly
  • My PR adds relevant tests and they would have failed without my PR (when applicable)

Advanced

  • Category: ⚡️ Improve performance
  • Impacts: Slight performance uplift, but way more to come thanks to this change

Our `string` arbitrary starts its initialization by tokenizing known vulnerable strings into set of units (chars). The idea behind this tokenization process is to later generate vulnerable strings while generating entries with this arbitrary.

The process is the following:
- for each string known to be vulnerable, try to tokenize it with respect to the provided constraints on length and the unit arbitrary
- for each tokenizable string, add it to the bucket of potentially to be generated strings

This original tokenizer process was able to abide by constraints on length. Computed tokens were depending on the set of provided contraints on lengths and the arbitrary being considered. But this flexibility had a runtime cost we don't want to pay anymore. The tokenizer will stop trying to optimize on the lengths and will just tokenize for the requested arbitrary.
Copy link

changeset-bot bot commented Oct 31, 2024

🦋 Changeset detected

Latest commit: 76da3a8

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
fast-check Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link

This pull request is automatically built and testable in CodeSandbox.

To see build info of the built libraries, click here or the icon next to each commit SHA.

Latest deployment of this branch, based on commit e009720:

Sandbox Source
@fast-check/examples Configuration

Copy link

codesandbox-ci bot commented Oct 31, 2024

This pull request is automatically built and testable in CodeSandbox.

To see build info of the built libraries, click here or the icon next to each commit SHA.

Latest deployment of this branch, based on commit 76da3a8:

Sandbox Source
@fast-check/examples Configuration

@dubzzz
Copy link
Owner Author

dubzzz commented Oct 31, 2024

Benchmark stats:

┌───────────────────────────────────────────────────────────────────────┬──────────┬────────────────────┬───────────┬─────────┐
│ Task Name                                                             | ops/sec  │ Average Time (ns)  │ Margin    │ Samples │
├───────────────────────────────────────────────────────────────────────┼──────────┼────────────────────┼───────────┼─────────┤
│ fc.char() on 3.21.0                                                   | '9,979'  │ 100201.50000025751 │ '±8.28%'  │ 100     │
│ fc.char() on 3.22.0                                                   | '9,198'  │ 108718.22000277461 │ '±9.35%'  │ 100     │
│ fc.char() on main                                                     | '9,001'  │ 111086.55000018189 │ '±9.93%'  │ 100     │
│ fc.char() on extra                                                    | '10,826' │ 92365.17000128515  │ '±7.28%'  │ 100     │
│ '—'                                                                   | '—'      │ '—'                │ '—'       │ '—'     │
│ fc.string() on 3.21.0                                                 | '4,882'  │ 204795.64999812283 │ '±8.12%'  │ 100     │
│ fc.string() on 3.22.0                                                 | '3,736'  │ 267655.0999999745  │ '±10.56%' │ 100     │
│ fc.string() on main                                                   | '3,959'  │ 252560.6999997399  │ '±7.91%'  │ 100     │
│ fc.string() on extra                                                  | '3,761'  │ 265849.1199996206  │ '±9.45%'  │ 100     │
│ '—'                                                                   | '—'      │ '—'                │ '—'       │ '—'     │
│ fc.constant('').chain(() => fc.string()) on 3.21.0                    | '1,367'  │ 731179.5900005382  │ '±5.92%'  │ 100     │
│ fc.constant('').chain(() => fc.string()) on 3.22.0                    | '332'    │ 3007857.6800005976 │ '±3.53%'  │ 100     │
│ fc.constant('').chain(() => fc.string()) on main                      | '306'    │ 3265226.5699984855 │ '±4.25%'  │ 100     │
│ fc.constant('').chain(() => fc.string()) on extra                     | '318'    │ 3140647.440001485  │ '±3.92%'  │ 100     │
│ '—'                                                                   | '—'      │ '—'                │ '—'       │ '—'     │
│ fc.string({ minLength: 0, maxLength: 500, size: 'max' }) on 3.21.0    | '364'    │ 2740566.6000011843 │ '±2.19%'  │ 100     │
│ fc.string({ minLength: 0, maxLength: 500, size: 'max' }) on 3.22.0    | '337'    │ 2964428.4999990487 │ '±4.32%'  │ 100     │
│ fc.string({ minLength: 0, maxLength: 500, size: 'max' }) on main      | '310'    │ 3217269.920000399  │ '±2.96%'  │ 100     │
│ fc.string({ minLength: 0, maxLength: 500, size: 'max' }) on extra     | '316'    │ 3162695.5100000487 │ '±2.53%'  │ 100     │
│ '—'                                                                   | '—'      │ '—'                │ '—'       │ '—'     │
│ fc.string({ minLength: 0, maxLength: 25_000, size: 'max' }) on 3.21.0 | '5'      │ 183261603.60999987 │ '±2.57%'  │ 100     │
│ fc.string({ minLength: 0, maxLength: 25_000, size: 'max' }) on 3.22.0 | '5'      │ 184139708.5599997  │ '±2.36%'  │ 100     │
│ fc.string({ minLength: 0, maxLength: 25_000, size: 'max' }) on main   | '5'      │ 197418345.70999897 │ '±2.15%'  │ 100     │
│ fc.string({ minLength: 0, maxLength: 25_000, size: 'max' }) on extra  | '4'      │ 202038732.63999936 │ '±2.08%'  │ 100     │
└───────────────────────────────────────────────────────────────────────┴──────────┴────────────────────┴───────────┴─────────┘

Detailed stats at: https://github.com/dubzzz/fast-check-benchmarks/actions/runs/11618811228

* Split a string into valid tokens of patternsArb
* @internal
*/
export function tokenizeString(patternsArb: Arbitrary<string>, value: string): string[] | undefined {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to put an early exit if we are going too far in the number of chunks while we know for sure that in the current context there is no need to go that far. But it could be an optim for later. It's probably way less critical than the optim we are going to unlock now.

Copy link
Contributor

👋 A preview of the new documentation is available at: http://6723d5da5aedeb1cb80f032b--dubzzz-fast-check.netlify.app

Copy link
Contributor

👋 A preview of the new documentation is available at: http://6723d86164c8cd1e8493cfb0--dubzzz-fast-check.netlify.app

Copy link

codecov bot commented Oct 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.28%. Comparing base (640157f) to head (76da3a8).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5387   +/-   ##
=======================================
  Coverage   95.28%   95.28%           
=======================================
  Files         234      235    +1     
  Lines       10497    10504    +7     
  Branches     2799     2802    +3     
=======================================
+ Hits        10002    10009    +7     
  Misses        495      495           
Flag Coverage Δ
unit-tests 95.28% <100.00%> (+<0.01%) ⬆️
unit-tests-18.x-Linux 95.28% <100.00%> (+<0.01%) ⬆️
unit-tests-20.x-Linux 95.28% <100.00%> (+<0.01%) ⬆️
unit-tests-22.x-Linux 95.28% <100.00%> (+<0.01%) ⬆️
unit-tests-latest-Linux 95.28% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dubzzz dubzzz merged commit d336e2e into main Oct 31, 2024
73 checks passed
@dubzzz dubzzz deleted the faster-tokenizer-on-random-strings branch October 31, 2024 20:02
dubzzz added a commit that referenced this pull request Nov 1, 2024
Reverting part of the capability-drop introduced by #5387.

When working on #5387, we dropped some unmapping capabilities on `string`. In the past `string` arbitrary used to be able to shrink even custom values and that even if it uses a pretty uncommon dictionary of units. With #5387 we restricted the unmapping capabilities a bit by dropping some of these very advanced supports.

Now that we know that we can add this support back while preserving our performance uplift we add part of them back.

In other words, before #5387 the following arbitrary could have shrunk `...___...`:

```ts
fc.string({ maxLength: 3, unit: fc.constantFrom('._', '_...', '_._.', '_..', '.', '.._.', '__.', '....', '..', '.___', '._..', '__', '_.', '___', '.__.', '__._', '._.', '...', '_', '.._', '..._', '.__', '_.._', '_.__', '__..') })
```

After the PR #5387, he was not able anymore.

But we are adding back such support! Which will be mostly invisible for all our users 😂
dubzzz added a commit that referenced this pull request Nov 1, 2024
**Description**

<!-- Please provide a short description and potentially linked issues
justifying the need for this PR -->

Reverting part of the capability-drop introduced by #5387.

When working on #5387, we dropped some unmapping capabilities on
`string`. In the past `string` arbitrary used to be able to shrink even
custom values and that even if it uses a pretty uncommon dictionary of
units. With #5387 we restricted the unmapping capabilities a bit by
dropping some of these very advanced supports.

Now that we know that we can add this support back while preserving our
performance uplift we add part of them back.

In other words, before #5387 the following arbitrary could have shrunk
`...___...`:

```ts
fc.string({ maxLength: 3, unit: fc.constantFrom('._', '_...', '_._.', '_..', '.', '.._.', '__.', '....', '..', '.___', '._..', '__', '_.', '___', '.__.', '__._', '._.', '...', '_', '.._', '..._', '.__', '_.._', '_.__', '__..') })
```

After the PR #5387, he was not able anymore.

But we are adding back such support! Which will be mostly invisible for
all our users 😂

<!-- * Your PR is fixing a bug or regression? Check for existing issues
related to this bug and link them -->
<!-- * Your PR is adding a new feature? Make sure there is a related
issue or discussion attached to it -->

<!-- You can provide any additional context to help into understanding
what's this PR is attempting to solve: reproduction of a bug, code
snippets... -->

**Checklist** — _Don't delete this checklist and make sure you do the
following before opening the PR_

- [x] The name of my PR follows [gitmoji](https://gitmoji.dev/)
specification
- [x] My PR references one of several related issues (if any)
- [x] New features or breaking changes must come with an associated
Issue or Discussion
- [x] My PR does not add any new dependency without an associated Issue
or Discussion
- [x] My PR includes bumps details, please run `yarn bump` and flag the
impacts properly
- [x] My PR adds relevant tests and they would have failed without my PR
(when applicable)

<!-- More about contributing at
https://github.com/dubzzz/fast-check/blob/main/CONTRIBUTING.md -->

**Advanced**

<!-- How to fill the advanced section is detailed below! -->

- [x] Category: ✨ Introduce new features
- [x] Impacts: New and not new at the same time, it never disappeared
from officially published versions

<!-- [Category] Please use one of the categories below, it will help us
into better understanding the urgency of the PR -->
<!-- * ✨ Introduce new features -->
<!-- * 📝 Add or update documentation -->
<!-- * ✅ Add or update tests -->
<!-- * 🐛 Fix a bug -->
<!-- * 🏷️ Add or update types -->
<!-- * ⚡️ Improve performance -->
<!-- * _Other(s):_ ... -->

<!-- [Impacts] Please provide a comma separated list of the potential
impacts that might be introduced by this change -->
<!-- * Generated values: Can your change impact any of the existing
generators in terms of generated values, if so which ones? when? -->
<!-- * Shrink values: Can your change impact any of the existing
generators in terms of shrink values, if so which ones? when? -->
<!-- * Performance: Can it require some typings changes on user side?
Please give more details -->
<!-- * Typings: Is there a potential performance impact? In which cases?
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant