-
Notifications
You must be signed in to change notification settings - Fork 173
Blazegraph_FAQ
If you understand Sesame, then you are no doubt familiar with the concept of a SAIL (Storage and Inference Layer). Well, we have implemented a SAIL over bigdata. So all you have to do is take the code you’ve written for the Sesame API and instantiate a different SAIL class, specifically:
com.bigdata.rdf.sail.BigdataSail
You can get this Sesame implementation by either downloading the source tree from #SVN or just downloading the binary and/or source release from the bigdata sourceforge download page.
We’ve created some configuration files that represent various common “modes” with which you might want to run bigdata:
- Full Feature Mode. This turns on all of bigdata’s goodies - statement identifiers, free-text index, incremental inference and truth maintenance. This is how you would use bigdata in a system that requires statement-level provenance, free-text search, and incremental load and retraction.
- RDF-Only Mode. This turns off all inference and truth maintenance for when you just need to store triples.
- Fast Load Mode. This is how we run bigdata when we are evaluating load and query performance, for example with the Lehigh University Benchmark (LUBM) harness. This turns off some features that are unnecessary for this type of evaluation (statement identifiers and the free text index), which increases throughput. This mode still does inference, but it is database-at-once instead of incremental. It also turns off the recording of justification chains, meaning it is an extremely inefficient mode if you need to retract statements (all inferences would have to be wiped and re-computed). This is a highly specialized mode for highly specialized problem sets.
You can find these and other modes in the form of property files in the bigdata source tree, in the “bigdata-sails” module, at:
bigdata-sails/src/samples/com/bigdata/samples [1]
Or let us help you devise the mode that is right for your particular problem. We offer development support, production support, and custom services around the platform.
[1] https://github.com/blazegraph/database/tree/master/bigdata-sails/src/samples/com/bigdata/samples
We've set up three modes for bigdata that configure the store properly for triples, triples with provenance, and quads. Look for the TRIPLES_MODE, TRIPLES_MODE_WITH_PROVENANCE, and QUADS_MODE on AbstractTripleStore.Options and BigdataSail.Options.
Currently bigdata does not support inference or provenance for quads, so those features are automatically turned off in QUADS_MODE.
Bigdata does not support quad mode inference out of the box. The basic issue for inference with quads is that there is no standard concerning which named graph data should be combined (data and ontologies) when performing quad mode inference and where to write the new entailments (inferences).
People often ask about "quads plus inference." Our question in return is always, "what are you trying to accomplish?" Sometimes people use quads to support provenance - bigdata has a dedicated mode for this. Sometimes people use quads to have multiple graphs in the same database, but you have an effectively unlimited number of distinct triple or quad stores in each bigdata instance.
Here are some possible approaches to problems that either appear to require quads plus inference or that actually do require quads plus inference:
- Use property paths for runtime inference. You can cover an interesting subset of inference through property path expansions. If you combine property paths with inference, then you can explicitly tradeoff eager materialization against runtime evaluation.
- Use triple mode with inference, but store multiple triple store instances in the same journal. We have customers with 15,000 triple stores in a single journal. This option works well if you are using quad mode to circumvent a limit in the number of triple mode instances you can use with some platforms. You can also query across those triple store instances using SPARQL federated query.
- If you are using quad mode to track provenance, then the statement identifiers (SIDs) mode allows you to track statement level provenance without the overhead of quad mode indices.
- You can use an external process to explicitly manage the inference process by combining various named graphs within a journal or temporary store and applying the delta for the update. The output delta can be recovered using a change log listener and then conveyed as a simple update to the appropriate named graphs and target journals. This pattern is used by several large customers to manage updates, sometimes as part of a map/reduce job which collects, transforms, and organizes the update process. This can also be used to decouple the inference workload from the query workload, making it possible to scale both processes independently for high data volume and data rate systems. (You can scale the query workload linearly using the highly available replication cluster.)
I’ve picked the Blazegraph configuration setting I want to work with and I want to use Sesame. Help me write some code.
It’s easy. For the most part it’s the same as any Sesame 2 repository. This code is taken from SampleCode.java
// use one of our pre-configured option-sets or "modes"
Properties properties =
sampleCode.loadProperties("fullfeature.properties");
// create a backing file for the database
File journal = File.createTempFile("bigdata", ".jnl");
properties.setProperty(
BigdataSail.Options.FILE,
journal.getAbsolutePath()
);
// instantiate a sail and a Sesame repository
BigdataSail sail = new BigdataSail(properties);
Repository repo = new BigdataSailRepository(sail);
repo.initialize();
We now have a Sesame repository that is ready to use. Anytime we want to “do” anything (load data, query, delete, etc), we need to obtain a connection to the repository. This is how the Sesame API is usually used :
RepositoryConnection cxn = repo.getConnection();
cxn.setAutoCommit(false);
try {
... // do something interesting
cxn.commit();
} catch (Exception ex) {
cxn.rollback();
throw ex;
} finally {
// close the repository connection
cxn.close();
}
Make sure to always use autoCommit=false! Otherwise the SAIL automatically does a commit after every single operation! This causes severe performance degradation and also causes the bigdata journal to grow very large.
Inside that “do something interesting” section you might want to add a statement like the following:
Resource s = new URIImpl("http://www.bigdata.com/rdf#Mike");
URI p = new URIImpl("http://www.bigdata.com/rdf#loves");
Value o = new URIImpl("http://www.bigdata.com/rdf#RDF");
Statement stmt = new StatementImpl(s, p, o);
cxn.add(stmt);
Or maybe you’d like to load an entire RDF document:
String baseURL = ... // the base URL for the document
InputStream is = ... // input stream to the document
Reader reader = new InputStreamReader(new BufferedInputStream(is));
cxn.add(reader, baseURL, RDFFormat.RDFXML);
Once you have data loaded you might want to read some data from your database. Note that by casting the statement to a “BigdataStatement”, you can get at additional information like the statement type (Explicit, Axiom, or Inferred):
URI uri = ... // a Resource that you’d like to know more about
RepositoryResult<Statement> stmts =
cxn.getStatements(uri, null, null, true /* includeInferred */);
while (stmts.hasNext()) {
Statement stmt = stmts.next();
Resource s = stmt.getSubject();
URI p = stmt.getPredicate();
Value o = stmt.getObject();
// do something with the statement
// cast to BigdataStatement to get at additional information
BigdataStatement bdStmt = (BigdataStatement) stmt;
if (bdStmt.isExplicit()) {
// do one thing
} else if (bdStmt.isInferred()) {
// do another thing
} else { // bdStmt.isAxiom()
// do something else
}
}
Of course one of the most interesting things you can do is run high-level queries against the database. Sesame 2 repositories support the open-standard query language SPARQL[1] and a native Sesame query language SERQL[2]. Formulating high-level queries is outside the scope of this document, but assuming you have formulated your query you can execute it as follows:
final QueryLanguage ql = ... // the query language
final String query = ... // a “select” query
TupleQuery tupleQuery = cxn.prepareTupleQuery(ql, query);
tupleQuery.setIncludeInferred(true /* includeInferred */);
TupleQueryResult result = tupleQuery.evaluate();
// do something with the results
Some find “construct” queries to be more useful, they allow you to grab a real subgraph from your database:
// silly construct queries, can't guarantee distinct results
final Set<Statement> results = new LinkedHashSet<Statement>();
final GraphQuery graphQuery = cxn.prepareGraphQuery(ql, query);
graphQuery.setIncludeInferred(true /* includeInferred */);
graphQuery.evaluate(new StatementCollector(results));
// do something with the results
for (Statement stmt : results) {
...
}
While we’re at it, using the bigdata free text index is as simple as writing a high-level query. Bigdata uses a magic predicate to indicate that the free-text index should be used to find bindings for a particular variable in a high-level query. The free-text index is a Lucene style indexing that will match whole words or prefixes.
RepositoryConnection cxn = repo.getConnection();
cxn.setAutoCommit(false);
try {
cxn.add(new URIImpl("http://www.bigdata.com/A"), RDFS.LABEL,
new LiteralImpl("Yellow Rose"));
cxn.add(new URIImpl("http://www.bigdata.com/B"), RDFS.LABEL,
new LiteralImpl("Red Rose"));
cxn.add(new URIImpl("http://www.bigdata.com/C"), RDFS.LABEL,
new LiteralImpl("Old Yellow House"));
cxn.add(new URIImpl("http://www.bigdata.com/D"), RDFS.LABEL,
new LiteralImpl("Loud Yell"));
cxn.commit();
} catch (Exception ex) {
cxn.rollback();
throw ex;
} finally {
// close the repository connection
cxn.close();
}
String query = "select ?x where { ?x <"+BNS.SEARCH+"> \"Yell\" . }";
executeSelectQuery(repo, query, QueryLanguage.SPARQL);
// will match A, C, and D
You can find all of this code and more in the source tree at bigdata-sails/src/samples/com/bigdata/samples.[3]
[1] http://www.w3.org/TR/rdf-sparql-query/
[2] http://www.openrdf.org/doc/sesame/users/ch06.html
[3]https://github.com/blazegraph/database/tree/master/bigdata-sails/src/samples/com/bigdata/samples
You claim that you've "solved" the provenance problem for RDF with statement identifiers. Can you show me how that works?
Please see the Reification Done Right page (A guide to using efficient statements about statements in bigdata. Aka RDF* and SPARQL*).
Note: The older RDF/XML interchange for the statement identifiers (SIDs) mode is no longer available.