Skip to content

SOLR_External_Fulltext_Search

Brad Bebee edited this page Feb 13, 2020 · 1 revision

This is example of setting up Blazegraph ExternalFullTextSearch with SOLR 6.1.0.

Install Java 8

SOLR 6.1.0 requires Java 8.

apt-get install python-software-properties
add-apt-repository ppa:webupd8team/java
apt-get update
apt-get install oracle-java8-installer

Download Solr 6.1.0

Download. Then run as root:

cd /opt
wget http://mirrors.ocf.berkeley.edu/apache/lucene/solr/6.1.0/solr-6.1.0.tgz
tar zxf solr-6.1.0.tgz
cd solr-6.1.0

Start SOLR.

root@blazegraph:/opt/solr-6.1.0# ./bin/solr start
Waiting up to 30 seconds to see Solr running on port 8983 [/]  
Started Solr server on port 8983 (pid=16296). Happy searching!

SOLR Setup

Create a directory for the Core Create the directories for the configuration and data using the default basic_configs.**

cd /opt/solr-6.1.0
mkdir -p server/solr/blazegraph/conf
mkdir -p server/solr/blazegraph/data
cp -rf server/solr/configsets/basic_configs/conf/* server/solr/blazegraph/conf/

Create the Core

Now create the new CORE named blazegraph to index the data:

curl -F action=CREATE \
-F name=blazegraph \
-F instanceDir=/opt/solr-6.1.0/server/solr/blazegraph \
-F config=solrconfig.xml \
-F dataDir=data \
http://localhost:8983/solr/admin/cores

SOLR Indexing

The next step is to load the data. In this example, we have written a small shell script with a PERL REGEX to extract the rdfs:label from data in the N-Triples format and format it as a JSON documents to be indexed into SOLR. JSON was chosen as SOLR's loader proved more robust to special characters than the CSV representation.

The URI (Subject) is sorted in the id field and the text of the english label is stored in the label_t field. The _t means that it is a dynamic SOLR schema field of type text.

label2JSON.sh

#!/bin/bash

echo "[ "
cat ${1:-/dev/stdin}  | grep "rdf-schema#label" | grep "\@en" | grep -v "\@en\-" | \
perl -n -e '/<([^>]+)>[^<]+<([^>]+)>.*\"(.*)\"@.*$/ && printf("%s { \"id\" : \"%s\", \"label_t\":  \"%s\" }\n", $comma, $1, $3); $comma = " , "'
echo " ]"

Load the data using the label2JSON.sh script

Use the SOLR post tool.

zcat /data/rdf/rdfdata.nt.gz | \
./label2JSON.sh  | \
./bin/post -type application/json -c blazegraph -out yes -d 

Wait for this to complete. You may want to run it nohup or wrapped in a script.

Example Queries

Now, you may use the ExternalFreetextSearch within your SPARQL Queries.

PREFIX fts: <http://www.bigdata.com/rdf/fts#>
SELECT ?res ?score ?snippet WHERE {
  ?res fts:search "Blazegraph" .
  ?res fts:endpoint "http://localhost:8983/solr/blazegraph/select" .
  ?res fts:endpointType  "SOLR" .
  ?res fts:timeout "100000" .
  ?res fts:score ?score .
  ?res fts:snippet ?snippet . 
  ?res fts:params "fl=id,label_t" .
  ?res fts:searchField "id" .
  ?res fts:fieldToSearch "label_t" .
  ?res fts:snippetField "label_t" .
  ?res fts:searchResultType "URI" .
}
Clone this wiki locally