Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reason post is likely nonsense #4304

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

user12986714
Copy link
Contributor

This PR attempts to find out vandalism and gibberish by calculating the informational entropy of a given text. The constants used in this PR is conservative.

This is not really a good practice, but too many tests think nonsense is not gibberish...
For example, "I have this number: 111111111111111" and "This asdf should asdf not asdf be asdf matched asdf because asdf the asdf words do not asdf follow on each asdf other".
@ArtOfCode-
Copy link
Member

What's the standard deviation of entropy-per-char? 3.0 seems quite close to 2.6 to me, but it depends - that might be super unlikely or relatively likely; the standard deviation will reveal which it is.

@user12986714
Copy link
Contributor Author

Well, entropy per char for English + space IIRC is ~21...

@ArtOfCode-
Copy link
Member

That's... not what I'm asking. You have a comment in here that says "Average entropy per char in English is 2.6". If that's the average, what's the stddev?

@ghost
Copy link

ghost commented Aug 8, 2020

The entropy values here are broken, every post gets caught.

Legit posts:

  • “I have seen the discussion about the Turkish Airlines COVID Cabin policy which makes little sense. Regardless though, does anyone know if they are enforcing it? I, like many, will be transferring at a European Airport on two tickets issued separately. I can't check my luggage all the way through from Istanbul to Malaga and can't exit customs to collect the luggage in Brussels (without a forced quarantine or denied entry)” - entropy per char of 0.2332

  • “Why all Indian rupee notes are accepted in Nepal and Bhutan, except 500 Rs and 1000 Rs? Why spare those two notes?” - entropy per char of 0.2485

Gibberish posts:

  • this this this this spamd dshdshdshdshhds - entropy per char of 0.4045

  • “test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test” - entropy per char of 0.4898

  • “gibberish dshdshdsaasdlaf,afdasfkkdafkafdkdfkfdskdsfksdkfksd.fk.sdfksfdkfk.fsdk.sdfksfdkfsdkfsdk” - entropy per char of 0.3817

  • “dfahdfhdsfsfkdjjsldfksdflfsdlkjfldskjlkfdsjklsfd/jldsfjlsdfjsdfakjsdafjkfsjkldfsaklfdklsalkfdsaklsdadsaklasfdkldfsaljkdfslkaldsfkdslflfsddfskjsdfllsfdaladsflfdsalsdalfksdklafkdsafdsafdsaklkldfsakfdkslaflklfdsakldfsalfdldsaflkasfldjkdfslsdklaflkdsfakjlfsadkljfsakljafsdlkjsdfjjfsdljasdladsfljkfdsjldfsjldsfjsdf“ - entropy per char of 0.4025

I think the entropy values need to be adjusted according to these results

@user12986714
Copy link
Contributor Author

user12986714 commented Aug 8, 2020

A stat with 12405 fp posts on MS

>>> statistics.mean(result)
0.20483261275004847
>>> statistics.median(result)
0.20223865427238322
>>> statistics.stdev(result)
0.031230117152319384

So yes, I managed to mess up with the decimal point

Note: fp is defined as:

>>> def is_fp(post):
...     fp_count = 0
...     tp_count = 0
...     for fb in post['feedback']:
...             if fb[1].startswith("f"):
...                     fp_count += 1
...             elif fb[1].startswith("t"):
...                     tp_count += 1
...     return (fp_count - tp_count > 1) or ((fp_count > 0) and (tp_count ==0))

Too much very-compact code that looks like nonsense but is not actually
findspam.py Outdated Show resolved Hide resolved
@stale stale bot added the status: stale label Sep 9, 2020
@stale
Copy link

stale bot commented Sep 11, 2020

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.

@stale stale bot closed this Sep 11, 2020
@makyen makyen reopened this Sep 11, 2020
@stale stale bot removed the status: stale label Sep 11, 2020
@makyen makyen added the status: confirmed Confirmed as something that needs working on. label Sep 11, 2020
@NobodyNada
Copy link
Member

A stat with 12405 fp posts on MS

That's a lot...unless I'm misunderstanding something, it means we're catching one out of every six non-spam posts.

@user12986714
Copy link
Contributor Author

A stat with 12405 fp posts on MS

That's a lot...unless I'm misunderstanding something, it means we're catching one out of every six non-spam posts.

Well, it is that I took many fp posts out of MS record and analyzed them rather than that this reason will result in those fps.

@NobodyNada
Copy link
Member

NobodyNada commented Sep 12, 2020

@user12986714 Ah, gotcha. Do you happen to have any stats on how many tps/fps this will result in over the MS corpus?

@user12986714
Copy link
Contributor Author

@user12986714 Ah, gotcha. Do you happen to have any stats on how many tps/fps this will result in over the MS corpus?

W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything.

However, I believe that some test sessions have been run and fp rate is low.

@NobodyNada
Copy link
Member

NobodyNada commented Sep 12, 2020 via email

Copy link
Member

@NobodyNada NobodyNada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I ran some tests on this today. It's really cool, but it has a couple problems and therefore currently catches a ton of FPs. See review comments for details.

findspam.py Show resolved Hide resolved
"french.stackexchange.com", "spanish.stackexchange.com",
"portuguese.stackexchange.com", "korean.stackexchange.com",
"ukrainian.stackexchange.com", "italian.stackexchange.com"],
max_rep=10000, max_score=10000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to strip code blocks, for instance https://askubuntu.com/a/623972. Or at least collapse repeated whitespace characters.

findspam.py Outdated Show resolved Hide resolved
@NobodyNada
Copy link
Member

I ran some more tests today. It's looking a lot better, but we still have problems with:

  1. Code. We probably should strip code blocks, but then we'll still have a lot of fp's due to posts with unformatted code.
  2. Posts with lots of un-rendered whitespace. IMO we really should collapse repeated whitespace characters.
  3. Those constants don't seem to be conservative enough; e.g. https://english.stackexchange.com/a/408724/106362 and https://hermeneutics.stackexchange.com/a/51104 are both caught, with entropies of 4.0233 and 5.6742 respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: confirmed Confirmed as something that needs working on.
Development

Successfully merging this pull request may close these issues.

4 participants