This demo is part of a Search webinar.
The webinar recording and slides are available at http://hortonworks.com/partners/learn
- Apache Solr + Lucidworks Connector Document Search
- Apache Solr provides a REST-like interface for searching indexed data.
- The search syntax follows the pattern of 'field:search query'. Those fields correspond to the schema defined in the Apache Solr Core.
- Lucidworks provides a connector allowing users to index multi-structured document data such as PDF, Docx, VSD, and XLSX.
- The sample documents used this demo are the PDFs from docs.hortonworks.com but instructions have been provided to allow users to search their own.
- Authors: Paul Codding, Joseph Niemiec, Piotr Pruski. Automation of setup via Ambari services and views: Ali Bajwa
These setup steps are only needed first time
- Download HDP 2.2 sandbox VM image (Sandbox_HDP_2.2_VMware.ova) from Hortonworks website
- Import Sandbox_HDP_2.2_VMware.ova into VMWare and configure its memory size to be at least 8GB RAM
- Find the IP address of the VM and add an entry into your machines hosts file e.g.
192.168.191.241 sandbox.hortonworks.com sandbox
- Connect to the VM via SSH (password hadoop)
ssh root@sandbox.hortonworks.com
- Pull latest code/sample documents and setup Solr and 'Doc Crawler' Ambari stacks and 'Doc Crawler' View
cd /root
git clone https://github.com/abajwa-hw/search-demo.git
~/search-demo/run_demo.sh
-
After script completes, login to Ambari (http://sandbox.hortonworks.com:8080) and add the Solr service via from the 'Actions' dropdown menu in the bottom left of the Ambari dashboard:
-
Next, add the "Document crawler" service the same way
-
This will install and start the Document Crawler
-
Tail the log file to get detailed status. When you see
Binding to /0.0.0.0:9090
, then the app is up
tail -f /var/log/doc-crawler.log
-
Once the service is up, you can access the demo from within Ambari via the "Document Crawler" view or by opening http://sandbox.hortonworks.com:9090
-
You can also access Solr webapp at the url below and try some queries. http://sandbox.hortonworks.com:8983/solr/#/rawdocs
- Notice that the metadata of each document appears in the result of the query and that the "body_s" field contains the entire document
- You can also run the same query in the browser or via programatic HTTP request. Try changing the output format by altering the wt param in the url e.g. &wt=json or &wt=xml http://sandbox.hortonworks.com:8983/solr/rawdocs/select?q=Getting+Started&wt=json&indent=true
- To see code snippets of how the javascript calls are made to query Solr, refer to:
- Once the demo is setup the first time, to restart it (e.g. after VM reboot), simply start the "Document Crawler" service from Ambari and open the "Document Crawler" view
- After the Document Crawler is working, you can FTP your own document zip and run below to clear the HDFS dir and create new index using the new zip
/bin/rm -rf ~/search-demo/search-docs/*
unzip /path/to/my/docs.zip -d ~/search-demo/search-docs/
~/search-demo/regenerate_solr.sh clean
Now go back to the Document Crawler view and run some queries
- Alternatively, you can create an HDFS mount on your Mac and drag/drop the documents directly to /user/solr/data/rfi_raw dir in HDFS. In this scenario, you would run the script without the 'clean' argument to just run the mapreduce job without cleaning HDFS
~/search-demo/regenerate_solr.sh
Now go back to the Document Crawler view and run some queries
-
Create a Banana dashboard webapp. Banana should be accessible at the below, showing the default starting page http://sandbox.hortonworks.com:8983/solr/banana/src/index.html#/dashboard
-
As an example, you can refer to the Twitter Banana dashboard here
- In case you need to remove the Solr/Document Crawler stacks from Ambari in the future, run below and then restart Ambari:
curl -u admin:admin -i -H 'X-Requested-By: ambari' -X DELETE http://sandbox.hortonworks.com:8080/api/v1/clusters/Sandbox/services/SOLR
curl -u admin:admin -i -H 'X-Requested-By: ambari' -X DELETE http://sandbox.hortonworks.com:8080/api/v1/clusters/Sandbox/services/DOCCRAWLER