-
Notifications
You must be signed in to change notification settings - Fork 25
Quick Start
You will need git, Maven 3 (Maven 2 will not work), and Java JDK >= 8. Oracle Java or the OpenJDK version should work, but the full installation is needed as the headless
versions will not work (depending on the content being process and the run-time configuration, see #295).
NOTE: Despite the name, this is not a very quick start, as the Maven build is quite large.
Checkout this repository,
$ git clone git@github.com:ukwa/webarchive-discovery.git
or, if ssh is a problem, via
$ git clone https://github.com/ukwa/webarchive-discovery.git
change into the root folder:
$ cd webarchive-discovery
In principle, this should all work with the most recent versions of Solr, but as we are currently only able to run Java 7 at the UK Web Archive, this project is build against Solr 5. To avoid any compatibility issues, download version 5.5.4 from here. Or, from a shell:
$ curl -O http://archive.apache.org/dist/lucene/solr/5.5.4/solr-5.5.4.zip
Then you can unpack it and start Solr running:
$ unzip solr-5.5.4.zip
$ cd solr-5.5.4/
$ ./bin/solr start
This will fire up a suitable Solr instance, with a UI at http://localhost:8983/.
To create a Solr core based on the current schema, you can use:
$ ./bin/solr create_core -c discovery -d ../warc-indexer/src/main/solr/solr/discovery
The core should now be accessible http://localhost:8983/#/discovery.
Note that if there have been schema changes, you will need to wipe and create the core, e.g.
$ ./bin/solr stop
$ rm -fr server/solr/discovery
$ ./bin/solr start
$ ./bin/solr create_core -c discovery -d ../warc-indexer/src/main/solr/solr/discovery
For configuring a front-end client, the Solr endpoint is http://localhost:8983/discovery/select, e.g. this query should return all results in JSON format. Of course, right now, there will be no results as we've not indexed anything. Lets change that...
In webarchive-discovery, run:
$ mvn clean install
If this is taking too long due to the tests, you can use
$ mvn clean install -DskipTests
instead.
Occasional snapshots of the production code are available here. However, the last snapshot is rather out of date, so pre-built binaries are not available right now. We'll make binaries available when things have settled down a bit.
To index a test WARC, run
$ cd warc-indexer
$ java -jar target/warc-indexer-*-jar-with-dependencies.jar -s http://localhost:8983/solr/discovery/ src/test/resources/wikipedia-mona-lisa/flashfrozen-jwat-recompressed.warc.gz
Which will populate the Solr index with a few resources from a snapshot of the English Wikipedia page about the Mona Lisa.
At this point your Solr service should be running under port 8983 and the Mona Lisa data should have been indexed. The Solr UI at http://localhost:8983/#/discovery should look like
By selecting the Query action in the left hand column (highlighted in the image below) and then selecting the blue 'Execute Query' button, you can see the indexed data. Also highlighted is the number of documents found and the start position of the results. Finally, at the top of the image, the performed URL can be seen showing the settings that are used by default.
See Using the Solr query UI for more information.