Git clone this project
Then go into the folder and type
mvn clean install package
The following are the different executions of the project
To crawl and gather every links from a website, type:
java -jar target/JohnnyDoop.jar -crawl 'http://mywebsite.com' 'myResults.txt' 'depth'
(The optimal depth is 2)
To rank the links with the page rank algorithm using Hadoop, type:
java -jar target/JohnnyDoop.jar -rank myUrls.txt
To rank the links using Pig, type:
pig -x local -p FILE=myResults.txt pig/SpiderPig.pl