-
Notifications
You must be signed in to change notification settings - Fork 173
SOLR_External_Fulltext_Search
This is example of setting up Blazegraph ExternalFullTextSearch with SOLR 6.1.0.
SOLR 6.1.0 requires Java 8.
apt-get install python-software-properties
add-apt-repository ppa:webupd8team/java
apt-get update
apt-get install oracle-java8-installer
Download. Then run as root:
cd /opt
wget http://mirrors.ocf.berkeley.edu/apache/lucene/solr/6.1.0/solr-6.1.0.tgz
tar zxf solr-6.1.0.tgz
cd solr-6.1.0
root@blazegraph:/opt/solr-6.1.0# ./bin/solr start
Waiting up to 30 seconds to see Solr running on port 8983 [/]
Started Solr server on port 8983 (pid=16296). Happy searching!
Create a directory for the Core Create the directories for the configuration and data using the default basic_configs.**
cd /opt/solr-6.1.0
mkdir -p server/solr/blazegraph/conf
mkdir -p server/solr/blazegraph/data
cp -rf server/solr/configsets/basic_configs/conf/* server/solr/blazegraph/conf/
Now create the new CORE named blazegraph to index the data:
curl -F action=CREATE \
-F name=blazegraph \
-F instanceDir=/opt/solr-6.1.0/server/solr/blazegraph \
-F config=solrconfig.xml \
-F dataDir=data \
http://localhost:8983/solr/admin/cores
The next step is to load the data. In this example, we have written a small shell script with a PERL REGEX to extract the rdfs:label from data in the N-Triples format and format it as a JSON documents to be indexed into SOLR. JSON was chosen as SOLR's loader proved more robust to special characters than the CSV representation.
The URI (Subject) is sorted in the id field and the text of the english label is stored in the label_t field. The _t means that it is a dynamic SOLR schema field of type text.
label2JSON.sh
#!/bin/bash
echo "[ "
cat ${1:-/dev/stdin} | grep "rdf-schema#label" | grep "\@en" | grep -v "\@en\-" | \
perl -n -e '/<([^>]+)>[^<]+<([^>]+)>.*\"(.*)\"@.*$/ && printf("%s { \"id\" : \"%s\", \"label_t\": \"%s\" }\n", $comma, $1, $3); $comma = " , "'
echo " ]"
Use the SOLR post tool.
zcat /data/rdf/rdfdata.nt.gz | \
./label2JSON.sh | \
./bin/post -type application/json -c blazegraph -out yes -d
Wait for this to complete. You may want to run it nohup or wrapped in a script.
Now, you may use the ExternalFreetextSearch within your SPARQL Queries.
PREFIX fts: <http://www.bigdata.com/rdf/fts#>
SELECT ?res ?score ?snippet WHERE {
?res fts:search "Blazegraph" .
?res fts:endpoint "http://localhost:8983/solr/blazegraph/select" .
?res fts:endpointType "SOLR" .
?res fts:timeout "100000" .
?res fts:score ?score .
?res fts:snippet ?snippet .
?res fts:params "fl=id,label_t" .
?res fts:searchField "id" .
?res fts:fieldToSearch "label_t" .
?res fts:snippetField "label_t" .
?res fts:searchResultType "URI" .
}