-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize performance of blocklist filtering and checking by using Regex #17
base: main
Are you sure you want to change the base?
Conversation
9559ba4
to
8b1d826
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @4kimov will take a look who is the blocklist expert.
if ($id == $word) { | ||
return true; | ||
} | ||
} elseif (preg_match('/~[0-9]+~/', (string) $word)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regex is not matching anything as the words never contain the tilde ~
char.
src/Sqids.php
Outdated
protected MathInterface $math; | ||
|
||
protected ?string $blocklist = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing the type of a protected property (from array
to ?string
is a BC break. I don't know what is the Backward Compatiblity policy of this project (I got here because of your toot about performance improvements, by curiosity) but it might make sense to be care about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I reverted to use an other property name and leave this one. Even if I don't see any reason this class would be extended. It should be final, it has an interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right @stof, we don't want to introduce any breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anyone extends the class and updates $this->blocklist
after calling the constructor, then this change will not be used later in the id generator.
But this is already a misuse, as it bypasses the filter.
@GromNaN I gotta admit, I'm a bit confused by this PR. A few questions:
|
dc0198b
to
67d951e
Compare
Removing the words invalid chars from the blocklist is an optimization, not a feature. Even after optimization (step 1), this task is more costly during class initialization, for an insignificant benefit when an ID is generated.
If I've missed anything, it's that it's not covered by the tests. My understanding is that you block any generated short ID that contains a blocked word. Whether the id "starts with", "ends with" or "is equals to" the word is covered by the "contains" verification done using the regex.
That was expected. I rebased the PR. |
You're right, the tests for blocklist logic should cover more scenarios. I'm not sure just the contains logic would mimic what the spec does. Here's one recent discussion about this: sqids/sqids-javascript#30 Another example is: only if the word is bigger than 3 chars and it contains numbers and it starts with or ends with blocked word, then we block it. So id like |
Currently, there is a Lines 818 to 819 in 9390a85
The regex match is always Lines 814 to 816 in 9390a85
This condition on id or word smaller that 3 is not tested. If you find a test, I could fix the code. |
Thanks for pointing out the PS: Updating tests for PHP is also not as straightforward since they're the same across all implementations. |
A single call to
preg_match
can replace a lot of lines of code, and is executed in optimized C code instead of PHP.1.
Blocklist filterThe regex/^[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]+$/i
tell if all the characters of the blocked work are in the alphabet.(cancelled because I removed the blocklist filter in 3).
2. Check if the ID contains a blocked word
A regex is generated with all the blocked words
/(1d10t|b0ob)/i
and run with in case-insensitive mode.3. Apply leet transformation to blocked word
The blocklist if full of leet variations of the same words. Using regex, we can check directly for alternative way of writting the same word.
/(ahole)/i
becomes/(ah[oO][l1]e)/i
to checkah0le
,aho1e
,ah01e
and all other case variations.Benchmark
PHPBench code
In
phpbench.json
In
tests/SqidsBench.php
Before
After 2
After 3
After rebase on #18