This demo is part of a Search webinar.
The webinar recording and slides are available at
- Apache Solr + Lucidworks Connector Document Search
- Apache Solr provides a REST-like interface for searching indexed data.
- The search syntax follows the pattern of 'field:search query'. Those fields correspond to the schema defined in the Apache Solr Core.
- Lucidworks provides a connector allowing users to index multi-structured document data such as PDF, Docx, VSD, and XLSX.
- The sample documents used this demo are the PDFs from but instructions have been provided to allow users to search their own.
- Authors: Paul Codding, Joseph Niemiec, Piotr Pruski. Automation of setup via Ambari services and views: Ali Bajwa
These setup steps are only needed first time
- Download HDP 2.2 sandbox VM image (Sandbox_HDP_2.2_VMware.ova) from Hortonworks website
- Import Sandbox_HDP_2.2_VMware.ova into VMWare and configure its memory size to be at least 8GB RAM
- Find the IP address of the VM and add an entry into your machines hosts file e.g. sandbox
- Connect to the VM via SSH (password hadoop)
- Pull latest code/sample documents and setup Solr and 'Doc Crawler' Ambari stacks and 'Doc Crawler' View
cd /root
git clone
After script completes, login to Ambari ( and add the Solr service via from the 'Actions' dropdown menu in the bottom left of the Ambari dashboard:
Next, add the "Document crawler" service the same way
This will install and start the Document Crawler
Tail the log file to get detailed status. When you see
Binding to /
, then the app is up
tail -f /var/log/doc-crawler.log
Once the service is up, you can access the demo from within Ambari via the "Document Crawler" view or by opening
You can also access Solr webapp at the url below and try some queries.
- Notice that the metadata of each document appears in the result of the query and that the "body_s" field contains the entire document
- You can also run the same query in the browser or via programatic HTTP request. Try changing the output format by altering the wt param in the url e.g. &wt=json or &wt=xml
- To see code snippets of how the javascript calls are made to query Solr, refer to:
- Once the demo is setup the first time, to restart it (e.g. after VM reboot), simply start the "Document Crawler" service from Ambari and open the "Document Crawler" view
- After the Document Crawler is working, you can FTP your own document zip and run below to clear the HDFS dir and create new index using the new zip
/bin/rm -rf ~/search-demo/search-docs/*
unzip /path/to/my/ -d ~/search-demo/search-docs/
~/search-demo/ clean
Now go back to the Document Crawler view and run some queries
- Alternatively, you can create an HDFS mount on your Mac and drag/drop the documents directly to /user/solr/data/rfi_raw dir in HDFS. In this scenario, you would run the script without the 'clean' argument to just run the mapreduce job without cleaning HDFS
Now go back to the Document Crawler view and run some queries
Create a Banana dashboard webapp. Banana should be accessible at the below, showing the default starting page
As an example, you can refer to the Twitter Banana dashboard here
- In case you need to remove the Solr/Document Crawler stacks from Ambari in the future, run below and then restart Ambari:
curl -u admin:admin -i -H 'X-Requested-By: ambari' -X DELETE
curl -u admin:admin -i -H 'X-Requested-By: ambari' -X DELETE