Skip to content

Latest commit

 

History

History
150 lines (94 loc) · 4.29 KB

README.md

File metadata and controls

150 lines (94 loc) · 4.29 KB

nutch-elasticsearch

A prototype system to integrate nutch 2.2 with elasticsearch 1.1 and hbase 0.90.

Installation

  1. Fire up vagrant:

     vagrant up
     vagrant ssh
     cd /vagrant
    
  2. Download ant, nutch and hbase:

     bin/wget-deps.bash
    
  3. Check elasticsearch is running:

     curl http://localhost:9200
    
  4. Start elasticsearch and create an index:

     curl -XPUT http://localhost:9200/nutch/
    
  5. Run build-nutch.bash to build using ant/ivy and install config file:

     /vagrant/build-nutch.bash
    
  6. Start hbase:

     /opt/hbase-0.90.4/bin/start-hbase.sh
    
  7. (Optional) Install BigDesk:

    1. Download: https://github.com/lukas-vlcek/bigdesk/tarball/master
    2. Extract BigDesk into /var/www/html/bigdesk
    3. Visit the app: http://localhost:8080/bigdesk/
  8. Run the nutch crawler:

     cd /opt/apache-nutch-2.2.1/runtime/local
     /vagrant/bin/index-url.bash /vagrant/conf/urls.txt
    
  9. Test:

     bin/nutch readdb -url `cat urls/urls.txt`
    
  10. Index into elasticsearch:

     bin/nutch elasticindex elasticsearch -all
    

Helpful Information

  • the crawldb is stored in hbase.

nutch commands

Simplest crawling:

cd runtime/local
echo "http://www.kusiri.com" > urls/urls.txt
bin/nutch inject urls/
bin/nutch generate -topN 1
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb

bin/nutch elasticindex elasticsearch -all

elasticsearch sense commands

GET /nutch/_search

hbase commands

Open a shell:

~/hbase/bin/hbase shell

Get help (from inside the shell):

hbase(main):001:0> help

List all tables:

hbase(main):001:0> list

Delete (i.e. disable then drop) the 'webpage' table:

hbase(main):002:0> disable 'webpage'
hbase(main):004:0> drop 'webpage'

Leave the shell:

hbase(main):002:0> exit

elasticsearch commands

create index:

curl -XPUT 'http://localhost:9200/twitter/'

nodes stats:

curl -XGET 'http://localhost:9200/_nodes/stats'

Troubleshooting

ClusterBlockException

[vagrant@localhost local]$ bin/nutch elasticindex elasticsearch -all
Exception in thread "elasticsearch[Caiera][generic][T#2]" org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:138)
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:128)
    at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:197)
    at org.elasticsearch.action.bulk.TransportBulkAction.access$000(TransportBulkAction.java:65)
    at org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(TransportBulkAction.java:143)
    at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(TransportAction.java:117)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

Check the following:

  • Your elasticsearch configuration is correct.

  • Your firewall is disabled:

      sudo service iptables stop
      sudo chkconfig iptables off
    

References