Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new bots #36

Merged
merged 7 commits into from
Jul 22, 2020
Merged

Add new bots #36

merged 7 commits into from
Jul 22, 2020

Conversation

alanorth
Copy link
Contributor

This adds patterns to match the following robot user agents:

  • Citoid
  • Typhoeus
  • 7siters
  • sqlmap
  • Pattern
  • OgScrper
  • Turnitin
  • Drupal

Relevant URLs for each user agent included in the patterns file.

Note: I had added a few of these in #34 but only a few were merged. I will re-submit them again here.

alanorth added 7 commits July 20, 2020 14:26
Typhoeus wraps libcurl in order to make fast and reliable requests.

See: https://github.com/typhoeus/typhoeus
The citoid node.js service generates citation data given a URL, DOI,
ISBN, PMID, PMCID or QID. It has a companion extension, Citoid, which
aims to provide use of the citoid service to VisualEditor.

See: https://www.mediawiki.org/wiki/Citoid
7siters is some kind of link and domain analysis database operating
their own spider.

See: https://7ooo.ru/siters/
sqlmap is an open source penetration testing tool that automates the
process of detecting and exploiting SQL injection flaws and taking over
of database servers.

---

This is definitely not a human user agent, and actually you would be
well advised to ban any IP address that made requests declaring this
user agent!

See: https://github.com/sqlmapproject/sqlmap
Pattern is a web mining module for the Python programming language.

---

This seems to be an academic spider. The spider's user agent looks
like this in my logs:

    Pattern/2.6 +http://www.clips.ua.ac.be/pattern

Because the word "Pattern" is not very unique and could appear in
some legitimate human user's user agent I suggest using anchoring
the string to the beginning of the line and matching at least one
digit for the version.

See: https://www.clips.uantwerpen.be/pattern
Apparently the Turnitin.com plagiarism scanning service uses both
the TurnitinBot and Turnitin user agents. Right now COUNTER-Robots
does not block the second one.

See: https://turnitin.com/robot/crawlerinfo.html
Drupal is a content management system, not a user. Sometimes people
write plugins that perform harvesting of content into the CMS so we
should ignore these requests. Drupal uses the following user agent:

    Drupal (+http://drupal.org/)
@davidatmire davidatmire merged commit bf6d432 into atmire:master Jul 22, 2020
@alanorth alanorth deleted the new-bots-2 branch July 22, 2020 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants