-
-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚡️ Faster tokenizer of strings #5387
Conversation
Our `string` arbitrary starts its initialization by tokenizing known vulnerable strings into set of units (chars). The idea behind this tokenization process is to later generate vulnerable strings while generating entries with this arbitrary. The process is the following: - for each string known to be vulnerable, try to tokenize it with respect to the provided constraints on length and the unit arbitrary - for each tokenizable string, add it to the bucket of potentially to be generated strings This original tokenizer process was able to abide by constraints on length. Computed tokens were depending on the set of provided contraints on lengths and the arbitrary being considered. But this flexibility had a runtime cost we don't want to pay anymore. The tokenizer will stop trying to optimize on the lengths and will just tokenize for the requested arbitrary.
🦋 Changeset detectedLatest commit: 76da3a8 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
This pull request is automatically built and testable in CodeSandbox. To see build info of the built libraries, click here or the icon next to each commit SHA. Latest deployment of this branch, based on commit e009720:
|
This pull request is automatically built and testable in CodeSandbox. To see build info of the built libraries, click here or the icon next to each commit SHA. Latest deployment of this branch, based on commit 76da3a8:
|
Benchmark stats: ┌───────────────────────────────────────────────────────────────────────┬──────────┬────────────────────┬───────────┬─────────┐
│ Task Name | ops/sec │ Average Time (ns) │ Margin │ Samples │
├───────────────────────────────────────────────────────────────────────┼──────────┼────────────────────┼───────────┼─────────┤
│ fc.char() on 3.21.0 | '9,979' │ 100201.50000025751 │ '±8.28%' │ 100 │
│ fc.char() on 3.22.0 | '9,198' │ 108718.22000277461 │ '±9.35%' │ 100 │
│ fc.char() on main | '9,001' │ 111086.55000018189 │ '±9.93%' │ 100 │
│ fc.char() on extra | '10,826' │ 92365.17000128515 │ '±7.28%' │ 100 │
│ '—' | '—' │ '—' │ '—' │ '—' │
│ fc.string() on 3.21.0 | '4,882' │ 204795.64999812283 │ '±8.12%' │ 100 │
│ fc.string() on 3.22.0 | '3,736' │ 267655.0999999745 │ '±10.56%' │ 100 │
│ fc.string() on main | '3,959' │ 252560.6999997399 │ '±7.91%' │ 100 │
│ fc.string() on extra | '3,761' │ 265849.1199996206 │ '±9.45%' │ 100 │
│ '—' | '—' │ '—' │ '—' │ '—' │
│ fc.constant('').chain(() => fc.string()) on 3.21.0 | '1,367' │ 731179.5900005382 │ '±5.92%' │ 100 │
│ fc.constant('').chain(() => fc.string()) on 3.22.0 | '332' │ 3007857.6800005976 │ '±3.53%' │ 100 │
│ fc.constant('').chain(() => fc.string()) on main | '306' │ 3265226.5699984855 │ '±4.25%' │ 100 │
│ fc.constant('').chain(() => fc.string()) on extra | '318' │ 3140647.440001485 │ '±3.92%' │ 100 │
│ '—' | '—' │ '—' │ '—' │ '—' │
│ fc.string({ minLength: 0, maxLength: 500, size: 'max' }) on 3.21.0 | '364' │ 2740566.6000011843 │ '±2.19%' │ 100 │
│ fc.string({ minLength: 0, maxLength: 500, size: 'max' }) on 3.22.0 | '337' │ 2964428.4999990487 │ '±4.32%' │ 100 │
│ fc.string({ minLength: 0, maxLength: 500, size: 'max' }) on main | '310' │ 3217269.920000399 │ '±2.96%' │ 100 │
│ fc.string({ minLength: 0, maxLength: 500, size: 'max' }) on extra | '316' │ 3162695.5100000487 │ '±2.53%' │ 100 │
│ '—' | '—' │ '—' │ '—' │ '—' │
│ fc.string({ minLength: 0, maxLength: 25_000, size: 'max' }) on 3.21.0 | '5' │ 183261603.60999987 │ '±2.57%' │ 100 │
│ fc.string({ minLength: 0, maxLength: 25_000, size: 'max' }) on 3.22.0 | '5' │ 184139708.5599997 │ '±2.36%' │ 100 │
│ fc.string({ minLength: 0, maxLength: 25_000, size: 'max' }) on main | '5' │ 197418345.70999897 │ '±2.15%' │ 100 │
│ fc.string({ minLength: 0, maxLength: 25_000, size: 'max' }) on extra | '4' │ 202038732.63999936 │ '±2.08%' │ 100 │
└───────────────────────────────────────────────────────────────────────┴──────────┴────────────────────┴───────────┴─────────┘ Detailed stats at: https://github.com/dubzzz/fast-check-benchmarks/actions/runs/11618811228 |
* Split a string into valid tokens of patternsArb | ||
* @internal | ||
*/ | ||
export function tokenizeString(patternsArb: Arbitrary<string>, value: string): string[] | undefined { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to put an early exit if we are going too far in the number of chunks while we know for sure that in the current context there is no need to go that far. But it could be an optim for later. It's probably way less critical than the optim we are going to unlock now.
👋 A preview of the new documentation is available at: http://6723d5da5aedeb1cb80f032b--dubzzz-fast-check.netlify.app |
👋 A preview of the new documentation is available at: http://6723d86164c8cd1e8493cfb0--dubzzz-fast-check.netlify.app |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5387 +/- ##
=======================================
Coverage 95.28% 95.28%
=======================================
Files 234 235 +1
Lines 10497 10504 +7
Branches 2799 2802 +3
=======================================
+ Hits 10002 10009 +7
Misses 495 495
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Reverting part of the capability-drop introduced by #5387. When working on #5387, we dropped some unmapping capabilities on `string`. In the past `string` arbitrary used to be able to shrink even custom values and that even if it uses a pretty uncommon dictionary of units. With #5387 we restricted the unmapping capabilities a bit by dropping some of these very advanced supports. Now that we know that we can add this support back while preserving our performance uplift we add part of them back. In other words, before #5387 the following arbitrary could have shrunk `...___...`: ```ts fc.string({ maxLength: 3, unit: fc.constantFrom('._', '_...', '_._.', '_..', '.', '.._.', '__.', '....', '..', '.___', '._..', '__', '_.', '___', '.__.', '__._', '._.', '...', '_', '.._', '..._', '.__', '_.._', '_.__', '__..') }) ``` After the PR #5387, he was not able anymore. But we are adding back such support! Which will be mostly invisible for all our users 😂
**Description** <!-- Please provide a short description and potentially linked issues justifying the need for this PR --> Reverting part of the capability-drop introduced by #5387. When working on #5387, we dropped some unmapping capabilities on `string`. In the past `string` arbitrary used to be able to shrink even custom values and that even if it uses a pretty uncommon dictionary of units. With #5387 we restricted the unmapping capabilities a bit by dropping some of these very advanced supports. Now that we know that we can add this support back while preserving our performance uplift we add part of them back. In other words, before #5387 the following arbitrary could have shrunk `...___...`: ```ts fc.string({ maxLength: 3, unit: fc.constantFrom('._', '_...', '_._.', '_..', '.', '.._.', '__.', '....', '..', '.___', '._..', '__', '_.', '___', '.__.', '__._', '._.', '...', '_', '.._', '..._', '.__', '_.._', '_.__', '__..') }) ``` After the PR #5387, he was not able anymore. But we are adding back such support! Which will be mostly invisible for all our users 😂 <!-- * Your PR is fixing a bug or regression? Check for existing issues related to this bug and link them --> <!-- * Your PR is adding a new feature? Make sure there is a related issue or discussion attached to it --> <!-- You can provide any additional context to help into understanding what's this PR is attempting to solve: reproduction of a bug, code snippets... --> **Checklist** — _Don't delete this checklist and make sure you do the following before opening the PR_ - [x] The name of my PR follows [gitmoji](https://gitmoji.dev/) specification - [x] My PR references one of several related issues (if any) - [x] New features or breaking changes must come with an associated Issue or Discussion - [x] My PR does not add any new dependency without an associated Issue or Discussion - [x] My PR includes bumps details, please run `yarn bump` and flag the impacts properly - [x] My PR adds relevant tests and they would have failed without my PR (when applicable) <!-- More about contributing at https://github.com/dubzzz/fast-check/blob/main/CONTRIBUTING.md --> **Advanced** <!-- How to fill the advanced section is detailed below! --> - [x] Category: ✨ Introduce new features - [x] Impacts: New and not new at the same time, it never disappeared from officially published versions <!-- [Category] Please use one of the categories below, it will help us into better understanding the urgency of the PR --> <!-- * ✨ Introduce new features --> <!-- * 📝 Add or update documentation --> <!-- * ✅ Add or update tests --> <!-- * 🐛 Fix a bug --> <!-- * 🏷️ Add or update types --> <!-- * ⚡️ Improve performance --> <!-- * _Other(s):_ ... --> <!-- [Impacts] Please provide a comma separated list of the potential impacts that might be introduced by this change --> <!-- * Generated values: Can your change impact any of the existing generators in terms of generated values, if so which ones? when? --> <!-- * Shrink values: Can your change impact any of the existing generators in terms of shrink values, if so which ones? when? --> <!-- * Performance: Can it require some typings changes on user side? Please give more details --> <!-- * Typings: Is there a potential performance impact? In which cases? -->
Description
Our
string
arbitrary starts its initialization by tokenizing known vulnerable strings into set of units (chars). The idea behind this tokenization process is to later generate vulnerable strings while generating entries with this arbitrary.The process is the following:
This original tokenizer process was able to abide by constraints on length. Computed tokens were depending on the set of provided contraints on lengths and the arbitrary being considered. But this flexibility had a runtime cost we don't want to pay anymore. The tokenizer will stop trying to optimize on the lengths and will just tokenize for the requested arbitrary.
Checklist — Don't delete this checklist and make sure you do the following before opening the PR
yarn bump
and flag the impacts properlyAdvanced