Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize performance of blocklist filtering and checking by using Regex #17

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

GromNaN
Copy link
Contributor

@GromNaN GromNaN commented Nov 28, 2024

A single call to preg_match can replace a lot of lines of code, and is executed in optimized C code instead of PHP.

1. Blocklist filter

The regex /^[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]+$/i tell if all the characters of the blocked work are in the alphabet.

(cancelled because I removed the blocklist filter in 3).

2. Check if the ID contains a blocked word

A regex is generated with all the blocked words /(1d10t|b0ob)/i and run with in case-insensitive mode.

3. Apply leet transformation to blocked word

The blocklist if full of leet variations of the same words. Using regex, we can check directly for alternative way of writting the same word. /(ahole)/i becomes /(ah[oO][l1]e)/i to check ah0le, aho1e, ah01e and all other case variations.

Benchmark

PHPBench code
composer req --dev phpbench/phpbench

In phpbench.json

{
    "$schema": "./vendor/phpbench/phpbench/phpbench.schema.json",
    "runner.bootstrap": "vendor/autoload.php",
    "runner.file_pattern": "*Bench.php",
    "runner.path": "tests",
    "runner.iterations": 3
}

In tests/SqidsBench.php

<?php

namespace Sqids\Tests;

use PhpBench\Attributes\ParamProviders;
use PhpBench\Attributes\Revs;
use PhpBench\Attributes\Warmup;
use Sqids\Sqids;

#[Warmup(1)]
final class SqidsBench
{
    #[Revs(1_000)]
    public static function benchInit(): void
    {
        new Sqids();
    }

    #[Revs(1_000)]
    #[ParamProviders('provideSqids')]
    public static function benchEncode(array $params): void
    {
        $params[0]->encode([1_000_000, 2_000_000]);
    }

    public static function provideSqids(): \Generator
    {
        yield 'default' => [
            new Sqids()
        ];
        yield 'custom blocklist' => [
            new Sqids(blocklist: [
                'JSwXFaosAN',
                'OCjV9JK64o',
                'rBHf',
                '79SM',
                '7tE6',
            ])
        ];
    }
}

Before

    benchInit...............................I2 - Mo1.617ms (±0.23%)
    benchEncode # default...................I2 - Mo280.402μs (±0.16%)
    benchEncode # custom blocklist..........I2 - Mo194.539μs (±0.34%)

After 2

    benchInit...............................I2 - Mo164.020μs (±0.44%)
    benchEncode # default...................I2 - Mo34.207μs (±0.46%)
    benchEncode # custom blocklist..........I2 - Mo183.985μs (±0.22%)

After 3

    benchInit...............................I2 - Mo77.698μs (±0.70%)
    benchEncode # default...................I2 - Mo32.270μs (±0.44%)
    benchEncode # custom blocklist..........I2 - Mo183.796μs (±0.63%)

After rebase on #18

    benchInit...............................I2 - Mo67.920μs (±0.22%)
    benchEncode # default...................I2 - Mo23.862μs (±0.52%)
    benchEncode # custom blocklist..........I2 - Mo136.598μs (±0.70%)
    benchEncode # no blocklist..............I2 - Mo22.825μs (±0.39%)

@GromNaN GromNaN force-pushed the optim-blocklist branch 2 times, most recently from 9559ba4 to 8b1d826 Compare November 29, 2024 00:18
Copy link
Collaborator

@vinkla vinkla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @4kimov will take a look who is the blocklist expert.

if ($id == $word) {
return true;
}
} elseif (preg_match('/~[0-9]+~/', (string) $word)) {
Copy link
Contributor Author

@GromNaN GromNaN Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex is not matching anything as the words never contain the tilde ~ char.

src/Sqids.php Outdated Show resolved Hide resolved
src/Sqids.php Outdated
protected MathInterface $math;

protected ?string $blocklist = null;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing the type of a protected property (from array to ?string is a BC break. I don't know what is the Backward Compatiblity policy of this project (I got here because of your toot about performance improvements, by curiosity) but it might make sense to be care about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I reverted to use an other property name and leave this one. Even if I don't see any reason this class would be extended. It should be final, it has an interface.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right @stof, we don't want to introduce any breaking changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone extends the class and updates $this->blocklist after calling the constructor, then this change will not be used later in the id generator.
But this is already a misuse, as it bypasses the filter.

@vinkla vinkla requested a review from 4kimov December 12, 2024 15:28
@4kimov 4kimov mentioned this pull request Dec 22, 2024
@4kimov
Copy link
Member

4kimov commented Dec 29, 2024

@GromNaN I gotta admit, I'm a bit confused by this PR. A few questions:

  1. You don't worry about checking if blocklist words contain chars that might not be in alphabet?
  2. Does the new isBlockedId take care of the scenario of blocking short ids that match short blocklist words exactly?
  3. Same question as number 2, but for ids that start with or end with blocked word?
  4. Finally, did I mess up this PR by merging Performance optimizations #18 first? Looks like there's a conflict :[

@GromNaN
Copy link
Contributor Author

GromNaN commented Dec 30, 2024

  1. You don't worry about checking if blocklist words contain chars that might not be in alphabet?

Removing the words invalid chars from the blocklist is an optimization, not a feature. Even after optimization (step 1), this task is more costly during class initialization, for an insignificant benefit when an ID is generated.

  1. Does the new isBlockedId take care of the scenario of blocking short ids that match short blocklist words exactly?

  2. Same question as number 2, but for ids that start with or end with blocked word?

If I've missed anything, it's that it's not covered by the tests.

My understanding is that you block any generated short ID that contains a blocked word. Whether the id "starts with", "ends with" or "is equals to" the word is covered by the "contains" verification done using the regex.

  1. Finally, did I mess up this PR by merging Performance optimizations #18 first? Looks like there's a conflict :[

That was expected. I rebased the PR.

@4kimov
Copy link
Member

4kimov commented Dec 30, 2024

You're right, the tests for blocklist logic should cover more scenarios.

I'm not sure just the contains logic would mimic what the spec does.

Here's one recent discussion about this: sqids/sqids-javascript#30

Another example is: only if the word is bigger than 3 chars and it contains numbers and it starts with or ends with blocked word, then we block it. So id like abcd1efgh2ijkl would be blocked if blocklist contains abcd, but not blocked if blocklist contains efgh.

@GromNaN
Copy link
Contributor Author

GromNaN commented Dec 30, 2024

Currently, there is a str_contains that returns true as soon as the ShortId contains one of the blocked words. The previous conditions are irrelevant.

sqids-php/src/Sqids.php

Lines 818 to 819 in 9390a85

} elseif (str_contains($id, (string) $word)) {
return true;

The regex match is always false, it requires the number to be surrounded by ~.

sqids-php/src/Sqids.php

Lines 814 to 816 in 9390a85

} elseif (preg_match('/~[0-9]+~/', (string) $word)) {
if (str_starts_with($id, (string) $word) || strrpos($id, (string) $word) === strlen($id) - strlen((string) $word)) {
return true;

This condition on id or word smaller that 3 is not tested.

If you find a test, I could fix the code.

@4kimov
Copy link
Member

4kimov commented Dec 30, 2024

Thanks for pointing out the ~. Looks like a few todo items [for me]. Putting this PR on ice for now.

PS: Updating tests for PHP is also not as straightforward since they're the same across all implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants