GitHub - callmeGoldenboy/web-scraping-detector

Prototype for the dection of web scraping

A prototype for detection of web scraping that uses some attributes found in the header of a request. Done in collaboration with @amarhod for bachelor's thesis project

The aim of the thesis

The aim of the thesis was to explorer the landscape of webscraping and see whether it was possible to detect web scraping. For that we also developed a small prototype that used the information stored in a HTTP request to derive a conclusion on the entity that sent the request.

The different modules

Analyzer - Takes in a list of unique IPs (from the batch of logs) and gives a score between 1-4 for each IP. This is the main method.
Log reader - Used to read the logs from a text file
Detection - Used for testing the prototype. It calls the log reader to process the batch. Then it calls the analyzer with the list of unique IPs. Finally, it prints the results for each IP.
Database handler - Handles all the communication with the SQLite local DB.
Client info - Returns useful info for a given IP if there are any prior requests done by it. It returns info such as request rate, number of different user agents etc.

For privacy reasons, the logs that were used for testing the prototype will not be available in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
scraper		scraper
LogReader.py		LogReader.py
README.md		README.md
__init__.py		__init__.py
analyzer.py		analyzer.py
clientinfo.py		clientinfo.py
databasehandler.py		databasehandler.py
detection.py		detection.py
req.db		req.db
request.py		request.py
user_agents.txt		user_agents.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prototype for the dection of web scraping

The aim of the thesis

The different modules

About

Releases

Packages

Languages

callmeGoldenboy/web-scraping-detector

Folders and files

Latest commit

History

Repository files navigation

Prototype for the dection of web scraping

The aim of the thesis

The different modules

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages