🚧 Project in development
Distributed robots.txt parser and rule checker through API access. If you are working on a distributed web crawler, and you want to be polite in your action, then you will find this project very useful. Also, this project can be used to integrate into any SEO tool to check if the content is being indexed correctly by robots.
For this first version, we are trying to comply with the specification used by Google to analyze websites. You can see it here. Expect support from other robot specifications soon!
If you are building a distributed web crawler, you know that manage robots.txt rules from websites is a hard task,
and can be complicated to maintain in a scalable way. You need to focus on your business requirements. robots.txt
can help by acting as a service to check if a given url resource can be crawled using a specified user agent
(or robot name). It can be easily integrated in existing software through a web API, and start to work in less than a second!
In order to build this project in your machine you will need to have installed in your system:
Java 11
and Kotlin- Docker
- docker-compose
make
If you want to test this project locally, then you will need to be installed in your system Docker,
docker-compose and Make
. When done, then execute the following
command to compile all projects, build docker images and run it:
👉 Be patient!
$ make start-all
You can execute
make logs
to see how things have gone
Now you can send some URL's to the crawler system to download the rules found in the robots.txt file and persist it in the database. For example, you can invoke the crawl API using this command:
$ curl -X POST http://localhost:9081/v1/send \
-d 'url=https://news.ycombinator.com/newcomments' \
-H 'Content-Type: application/x-www-form-urlencoded'
Also, there is another method in the API to make a crawl request but using a
GET
method. If you want to check all methods this application expose, import this Postman collection.
This command will send the URL to the streaming service, and when received, the robots.txt
file
will be downloaded, parsed and saved into the database.
The next step is to check if you can access any resource of a known host using a user-agent
directive. For this
purpose, you will need to use the checker API. Imagine that you need to check if your crawler can access the
newest
resource from hacker news. You will execute:
$ curl -X POST http://localhost:9080/v1/allowed \
-d '{"url": "https://news.ycombinator.com/newest","agent": "AwesomeBot"}' \
-H 'Content-Type: application/json'
The response will be:
{
"url":"https://news.ycombinator.com/newest",
"agent":"AwesomeBot",
"allowed":true
}
This is like saying: Hey!, you can crawl content from https://news.ycombinator.com/newest
When you finish your test, execute the next command to stop and remove all docker containers:
$ make stop-all
🔥 Happy Hacking! 🔥