A prototype system to integrate nutch 2.2 with elasticsearch 1.1 and hbase 0.90.
-
Fire up vagrant:
vagrant up vagrant ssh cd /vagrant
-
Download ant, nutch and hbase:
bin/wget-deps.bash
-
Check elasticsearch is running:
curl http://localhost:9200
-
Start elasticsearch and create an index:
curl -XPUT http://localhost:9200/nutch/
-
Run
build-nutch.bash
to build using ant/ivy and install config file:/vagrant/build-nutch.bash
-
Start hbase:
/opt/hbase-0.90.4/bin/start-hbase.sh
-
(Optional) Install BigDesk:
- Download: https://github.com/lukas-vlcek/bigdesk/tarball/master
- Extract BigDesk into
/var/www/html/bigdesk
- Visit the app: http://localhost:8080/bigdesk/
-
Run the nutch crawler:
cd /opt/apache-nutch-2.2.1/runtime/local /vagrant/bin/index-url.bash /vagrant/conf/urls.txt
-
Test:
bin/nutch readdb -url `cat urls/urls.txt`
-
Index into elasticsearch:
bin/nutch elasticindex elasticsearch -all
- the crawldb is stored in hbase.
Simplest crawling:
cd runtime/local
echo "http://www.kusiri.com" > urls/urls.txt
bin/nutch inject urls/
bin/nutch generate -topN 1
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb
bin/nutch elasticindex elasticsearch -all
GET /nutch/_search
Open a shell:
~/hbase/bin/hbase shell
Get help (from inside the shell):
hbase(main):001:0> help
List all tables:
hbase(main):001:0> list
Delete (i.e. disable then drop) the 'webpage' table:
hbase(main):002:0> disable 'webpage'
hbase(main):004:0> drop 'webpage'
Leave the shell:
hbase(main):002:0> exit
curl -XPUT 'http://localhost:9200/twitter/'
curl -XGET 'http://localhost:9200/_nodes/stats'
[vagrant@localhost local]$ bin/nutch elasticindex elasticsearch -all
Exception in thread "elasticsearch[Caiera][generic][T#2]" org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:138)
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:128)
at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:197)
at org.elasticsearch.action.bulk.TransportBulkAction.access$000(TransportBulkAction.java:65)
at org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(TransportBulkAction.java:143)
at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(TransportAction.java:117)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Check the following:
-
Your elasticsearch configuration is correct.
-
Your firewall is disabled:
sudo service iptables stop sudo chkconfig iptables off
- NutchTutorial
- Nutch2Tutorial
- Nutch 2 and ElasticSearch - helpful blog post ** Integrating Nutch 1.7 with ElasticSearch
- NUTCH-1745 - Upgrade to ElasticSearch 1.1.0
- 1.2. Quick Start - hbase user manual
- Hbase/Shell - from the Hadoop wiki