Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

findspam.py: body_text_repeated(): phrase repeated at beginning of body #7002

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tripleee
Copy link
Member

Second PR, had to back out the broken one I committed yesterday.

test/test_findspam.py: reinstate backed-out test case

Second PR, had to back out the broken one I committed yesterday.

test/test_findspam.py: reinstate backed-out test case
@tripleee
Copy link
Member Author

The forced anchoring to the beginning of the post is probably something I would like to extend at some point. I don't think many posts have repeated text in the middle (though certainly that happens too) but having a lot of repeats at the end is definitely something I would like to catch, too. I'm thinking it should trigger the same rule, perhaps with a tweak to the detailed "why" message.

@tripleee
Copy link
Member Author

I have tested on my laptop for several hours in https://chat.stackexchange.com/rooms/136393/room-for-tripleee-and-ad-oisee and this seems to do what it's supposed to do now.

@tripleee tripleee requested a review from makyen May 19, 2022 12:23
@makyen makyen added area: spamchecks Detections or the process of testing posts. (No space in the label, is because of Hacktoberfest) status: deferred 6-8 longer weeks. labels May 20, 2022
@makyen
Copy link
Contributor

makyen commented May 20, 2022

This looks quite a bit better. Using similar regex101 tests (first, second, third) produce a much better final regex, \A(what[\W_]+is[\W_]+your[\W_]+name[\W_]*){10,}, which doesn't match just anything. :)

I wonder how it will perform on a post where the user has a fairly redundant dataset as the first thing in their post. However, in practice, that probably won't be all that many FP hits.

Unfortunately, however, the meta-concerns and side-effects I expressed elsewhere still seem quite valid. Basically, given how things are, there's quite a bit of stuff which needs to change in MS prior to our actually being able to merge this and use it. Thus, I'm going to, reluctantly, tag it as status: deferred.

I'll see if I can take a better/good look at what needs to change in MS.

if not initial_words:
return False, ""
escaped_initial_words = [regex.escape(x) for x in initial_words.groups()]
period = regex.match(
Copy link
Contributor

@makyen makyen May 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add regex.cache_all(False) prior to this regex.match() and regex.cache_all(True) after it to prevent the single-use regexes compiled here from filling the regex package's internal cache of compiled regular expressions.

Note: this is probably something which we should be doing in a substantial number of places in the code, but I was reminded when creating PR #7012 that it should be used here. Basically, we should be doing that wherever we use a regex which is unique per-post. We should also change things in individual functions/modules to separately store compiled regexes which are used repeatedly in order to reduce the number which we rely on the regex package to cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've struck-out my earlier comments here and closed an open PR I had, as regex.cache_all() is the wrong way to handle this here, or anywhere in SD, due to threading. I'll be writing a helper function which can be used for explicit regex compiles to force the implicit regex cache to not be used. I'll update this again once I've created that (soon). I'll also create an issue in the regex package asking to expose an option to be able to not use the implicit cache on a per-compile basis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened a PR, #7025, which adds regex_compile_no_cache() to helpers.py and imports it to findspam.py. That function can be used to compile a regular expression and not have it placed in the regex package's implicit cache. Using that function, the code here could be something like:

    period_regex = regex_compile_no_cache(
        r"\A%s[\W_]+%s[\W_]+%s[\W_]+(.{1,40}?)%s[\W_]+%s[\W_]+%s(?=$|[\W_])" % (
            tuple(escaped_initial_words * 2))
    period = period_regex.match(s)
    if not period:
        return False, ""
    period_words = regex.split(r"[\W_]+", period.groups(0)[0])
    escaped_words = escaped_initial_words + [
        regex.escape(x) for x in period_words]
    repeats_regex = regex_compile_no_cache(r"\A(" + r"[\W_]+".join(escaped_words) + r"[\W_]*){10,}")
    repeats = repeats_regex.match(s)

escaped_words = escaped_initial_words + [
regex.escape(x) for x in period_words]
repeats_regex = r"\A(" + r"[\W_]+".join(escaped_words) + r"[\W_]*){10,}"
repeats = regex.match(repeats_regex, s)
Copy link
Contributor

@makyen makyen May 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same request here regarding regex.cache_all(False) prior to this regex.match() and regex.cache_all(True) after it.

Copy link
Contributor

@makyen makyen May 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously, it would be possible to use only a single pair of regex.cache_all(False) and regex.cache_all(True) to cover both post-specific uses of a regular expression.

This also touches on the larger issue of how we use the regex package's cache, particularly within the detections. The larger issue is that we don't want to be using it for regexes which are individually generated specific to each post, but we do want any static regular expressions not to have to be recompiled for every use (e.g. through using the regex package's cache). There are a couple of different approaches which we could take in general. I'll open an Issue to cover/discuss the topic.

@makyen makyen added the status: confirmed Confirmed as something that needs working on. label Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: spamchecks Detections or the process of testing posts. (No space in the label, is because of Hacktoberfest) status: confirmed Confirmed as something that needs working on. status: deferred 6-8 longer weeks.
Development

Successfully merging this pull request may close these issues.

2 participants