🤖 `robots.txt` as a service 🤖

🚧 Project in development

Distributed robots.txt parser and rule checker through API access. If you are working on a distributed web crawler, and you want to be polite in your action, then you will find this project very useful. Also, this project can be used to integrate into any SEO tool to check if the content is being indexed correctly by robots.

For this first version, we are trying to comply with the specification used by Google to analyze websites. You can see it here. Expect support from other robot specifications soon!

Why this project?

If you are building a distributed web crawler, you know that manage robots.txt rules from websites is a hard task, and can be complicated to maintain in a scalable way. You need to focus on your business requirements. robots.txt can help by acting as a service to check if a given url resource can be crawled using a specified user agent (or robot name). It can be easily integrated in existing software through a web API, and start to work in less than a second!

Requirements

In order to build this project in your machine you will need to have installed in your system:

Java 11 and Kotlin
Docker
docker-compose
make

Getting started

If you want to test this project locally, then you will need to be installed in your system Docker, docker-compose and Make. When done, then execute the following command to compile all projects, build docker images and run it:

👉 Be patient!

$ make start-all

You can execute make logs to see how things have gone

Now you can send some URL's to the crawler system to download the rules found in the robots.txt file and persist it in the database. For example, you can invoke the crawl API using this command:

$ curl -X POST http://localhost:9081/v1/send \
       -d 'url=https://news.ycombinator.com/newcomments' \
       -H 'Content-Type: application/x-www-form-urlencoded'

Also, there is another method in the API to make a crawl request but using a GET method. If you want to check all methods this application expose, import this Postman collection.

This command will send the URL to the streaming service, and when received, the robots.txt file will be downloaded, parsed and saved into the database.

The next step is to check if you can access any resource of a known host using a user-agent directive. For this purpose, you will need to use the checker API. Imagine that you need to check if your crawler can access the newest resource from hacker news. You will execute:

$ curl -X POST http://localhost:9080/v1/allowed \
       -d '{"url": "https://news.ycombinator.com/newest","agent": "AwesomeBot"}' \
       -H 'Content-Type: application/json'

The response will be:

{
  "url":"https://news.ycombinator.com/newest",
  "agent":"AwesomeBot",
  "allowed":true
}

This is like saying: Hey!, you can crawl content from https://news.ycombinator.com/newest

When you finish your test, execute the next command to stop and remove all docker containers:

$ make stop-all

🔥 Happy Hacking! 🔥

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
checker		checker
config		config
crawler		crawler
database		database
downloader		downloader
gradle/wrapper		gradle/wrapper
infrastructure		infrastructure
parser		parser
postman		postman
setup		setup
.env		.env
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.gradle		build.gradle
docker-compose.yml		docker-compose.yml
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 `robots.txt` as a service 🤖

Why this project?

Requirements

Getting started

About

Releases

Languages

License

fooock/robots.txt

Folders and files

Latest commit

History

Repository files navigation

🤖 robots.txt as a service 🤖

Why this project?

Requirements

Getting started

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages

🤖 `robots.txt` as a service 🤖