Skip to content

big-data-europe/mu-query-rewriter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mu Query Rewriter

The Query Rewriter is a proxy service for enriching and constraining SPARQL queries before they are sent to the database. It functions as an authorization service in the mu-semtech microservice architecture, enabling in-database access control and authorization-aware caching.

A sandbox interface for writing constraints is provided by https://github.com/big-data-europe/mu-query-rewriter-sandbox

A basic working example and testing environment is provided by https://github.com/big-data-europe/graph-acl-basics/

Introduction

A constraint is expressed as a standard SPARQL CONSTRUCT query, which conceptually represents an intermediate 'constraint' graph. An incoming query is optimally rewritten to a form which, when run against the full database, is equivalent to the original query being run against the constraint graph. Constraining queries in this way allows shared logic to be abstracted almost to the database level, simplifying the logic handled by each microservice.

rewriter diagram

The main use case is modeling access rights directly in the data, so that an incoming query is effectively run against the subset of data which the current user has permission to query or update. Using Annotations (see below), the Rewriter can return authorization-aware cache-keys and clear-keys to the mu-cache. When access rights can be fully resolved at rewrite-time (using functional properties and intermediate queries, see below), the rewriter can return an error signaling no access. When access can only be resolved in the database, an unauthorized query will return the empty set.

There are also simpler use cases, such as using multiple graphs to model data so that individual microservices do not need to be aware of the rules determining which triples are stored in which graph.

Examples

In the following example, the constraint defines a model where bikes and cars are stored in separate graphs, and users can be authorized to see one or both of the types.

When a microservice in the mu-semtech architecture (so the identifier has assigned a mu-session-id) makes the query, the rewriter will send the rewritten query the database.

Constraint Query Rewritten Query
Functional properties: rdf:type
Unique variables: ?user

CONSTRUCT {
  ?a ?b ?c
}
WHERE {
  GRAPH ?graph {
   ?a ?b ?c;
      a ?type
  }
  GRAPH <auth> {
   <SESSION> muauth:account ?user.
   ?user muauth:authorizedFor ?type
  }
  VALUES (?graph ?type){
    (<cars> <Car>)
    (<bikes> <Bike>)
  }
}

SELECT *
WHERE {
  ?s a <Bike>;
     <hasColor> ?color.
}

SELECT ?s ?color
WHERE {
  GRAPH ?graph23694 {
    ?s a <Bike>;
       <hasColor> ?color.
  }
  GRAPH <auth> {
   <session123456> muauth:account ?user.
   ?user muauth:authorizedFor <Bike>
  }
  VALUES (?graph23694) { (<bikes>) }
}

If we want to query the database for ?user at rewrite time, we declare muauth:account to be a transient functional property. ("Transient" means it is not cached between calls.) If muauth:authorizedFor is also functional and the user is not authorized to see <Bike>s, this will be queried and known at rewrite time, and the query will fail before being sent to the database.

Constraint Query Rewritten Query
Functional properties: rdf:type, muauth:authorizedFor
Transient functional properties: muauth:account
Unique variables: ?user

CONSTRUCT {
  ?a ?b ?c
}
WHERE {
  GRAPH ?graph {
   ?a ?b ?c;
      a ?type
  }
  GRAPH <auth> {
   <SESSION> mu:account ?user.
   ?user muauth:authorizedFor ?type
  }
  VALUES (?graph ?type){
    (<cars> <Car>)
    (<bikes> <Bike>)
  }
}

SELECT *
WHERE {
  ?s a <Bike>;
     <hasColor> ?color.
}

SELECT ?s ?color
WHERE {
  GRAPH ?graph23694 {
    ?s a <Bike>;
       <hasColor> ?color.
  }
  GRAPH <auth> {
   <session123456> mu:account <user4532>.
   <user4532> muauth:authorizedFor <Bike>
  }
  VALUES (?graph23694) { (<bikes>) }
}

Running the Proxy Service

The Query Rewriter runs as a proxy service between the application and the database. It exposes a SPARQL endpoint /sparql that accepts GET and POST requests following the SPARQL specifications, and passes on all received headers to the database.

Configuration

The Query Rewriter supports the following environment variables:

  • MU_SPARQL_ENDPOINT: SPARQL read endpoint URL. Default: http://database:8890/sparql in Docker, and http://localhost:8890/sparql outside Docker.`.
  • MU_SPARQL_UPDATE_ENDPOINT: SPARQL update endpoint. Same defaults as preceding.
  • PORT: the port to run the application on, defaults to 8890.
  • PLUGIN: plugin filename (without '.scm' extension), to be loaded from /config in Docker and ./config/rewriter locally.
  • CACHE_QUERY_FORMS: when "true" (default), will cache query forms. This feature is experimental (see below).
  • QUERY_FUNCTIONAL_PROPERTIES: when "true" (default), query the database for functional properties for known subjects.
  • CALCULATE_ANNOTATIONS: when "true" (default), annotations will be calculated and returned in the headers.
  • QUERY_ANNOTATIONS: when "true" (default), variable annotations will be queried in the database.
  • SEND_DELTAS: when "true" and a subscribers.json file is provided, will send deltas.
  • DEBUG: when "true", run Scheme code interpreted.
  • DEBUG_LOGGING: when "true", turn on verbose debug logging (mostly timing and performance).
  • MESSAGE_LOGGING: turns basic logging on or off.
  • PRINT_SPARQL_QUERIES: when "true", print all SPARQL queries.

These can also be set in the plugin file using the Scheme API below.

Example docker-compose file

version: "2"
services:
  db:
    image: tenforce/virtuoso:1.0.0-virtuoso7.2.4
    environment:
      SPARQL_UPDATE: "true"
      DEFAULT_GRAPH: "http://mu.semte.ch/application"
    ports:
      - "8890:8890"
    volumes:
      - ./data/db:/data
  rewriter:
    image: nathanielrb/mu-graph-rewriter
    links:
      - db:database
      - laq:laq
    environment:
      DEBUG_LOGGING: "true"
      PLUGIN: "authorization"
    volumes:
      - ./config/rewriter:/config
    ports:
      - "4027:8890"
  my-service:
    image: my/service
    links:
      - rewriter:
  laq: # to test deltas
    image: flowofcontrol/list-all-requests

Basic Logic

A constraint is expressed as a SPARQL CONSTRUCT statement of one triple, called the "matched triple". The matched triple is matched against each triple in the incoming query, and the constraint's WHERE clause is rewritten with each match substitution, calculating minimal dependencies between the constrained variables to simplify the rewritten query.

Unique variables are only rewritten once for the whole query, regardless of dependency relationships between variables.

Functional properties are unique, and if a subject has two values for a functional property in the same block, an error is signaled. When QUERY_FUNCTIONAL_PROPERTIES is "true", functional properties are queried in the database for known subjects.

Queried properties are like functional properties but without the uniqueness restriction. For known subjects and objects, the triple is verified against the database:

<person123> ex:queriedProp <someval>

Annotations

Annotations are used to define application-specific cache-keys and clear-keys for the mu-cache. They are defined as an extension to the SPARQL 1.1 standard, and can take two forms, constant annotations: @access Label and variable annotations: @access Label(?var).

{
 ?a ?b ?c.
 ?a rdf:type ext:Comment.
 {
  @access adminComment
  ?user muauth:hasRole <http://ex/admin>.
}
UNION
{
 ?a ?b ?c.
 ?a rdf:type ?type.
 VALUES ?type { ext:Route ext:Hotel }
 {
  @access adminObject(?type)
  ?user muauth:hasRole <http://ex/admin>.
 }

Two headers are returned. mu-cache-annotations reports constant annotations, and variable annotations along with all possible values as known at rewrite time (not querying the database). mu-queried-cache-annotations reports actual values of variable annotations in the database.

Mu-Cache-Annotations: "adminComment,adminObject <http://mu.semte.ch/vocabularies/ext/Route> <http://mu.semte.ch/vocabularies/ext/Hotel>"
Mu-Queried-Cache-Annotations: "adminObject <http://mu.semte.ch/vocabularies/ext/Route>,adminObject <http://mu.semte.ch/vocabularies/ext/Hotel>"

Deltas

When the SEND_DELTAS parametre is "true" and a subscribers.json file is provided (see example in ./config/rewriter/subscribers.json), deltas are sent on all update queries.

The deltas are sent as JSON, following the format established by the mu-delta-service:

[
 {
  "graph":"http://mu.semte.ch/application",
  "delta": {
   "inserts":[
     {
      "s":"http://data-hub.toerismevlaamsbrabant.be/hotels/5B0C1AA33C7DF9000C000003",
      "p":"http://mu.semte.ch/vocabularies/ext/addedBy",
      "o":"http://data-hub.toerismevlaamsbrabant.be/users/5B0C193C3C7DF9000C000001"
     }
   ]
  }
 }
]

Cache Forms

The Rewriter comes with an experimental cache that caches the rewritten form of queries that are equivalent modulo full URIs and literal strings. In the current implementation, this is fairly risky, and cannot stand up to pathological (or even slightly wierd) cases. However, the speedup is considerable, and with proper precautions it is usable. A correct implementation is planned as a next step.

Limitations and Exceptions

Due to the complexity of the SPARQL 1.1 grammar, not all SPARQL queries are fully supported.

The property paths *, + and ? are constrained identically to the corresponding single-jump triple, e.g., ?s ?p* ?o is considered subject to the same constraints as ?s ?p ?o.

Writing Plugins

The mu-query-writer-sandbox provides a UI for writing and testing Query Rewriter plugins. The graph-acl-basics repository provides a full working example for experimentation.

This section describes how to write plugins directly in Chicken Scheme.

Procedures

[procedure] (define-constraint mode constraint)

mode is a symbol, and can take the values 'read/write, 'read or 'write

constraint can be a string or a procedure of zero arguments ("thunk") returning a string.

Parameters

Most of the parameters can be set as environment variables and in the sandbox, as described above (see ./framework/settings.scm). A few, however, can only be set in the Scheme code.

[parameter] *headers-replacements*

List of template forms for the constraint query that will be replaced dynamically with the matching header. Each element takes the form '(("<TEMPLATE>" header-name string)) or '(("<TEMPLATE>" header-name uri)). Defaults to '(("<SESSION>" mu-session-id uri)).

[parameter] *optimize-constraint-cache-headers*

List of headers for determining the duration of the cached constraint. As resolving the constraint can be time-consuming when there are many headers replacements and functional properties, this can be important to performance. Defaults to (*optimize-constraint-cache-headers* '(mu-session-id mu-call-id)), which means that all calls with the same mu-session-id and mu-call-id will share the cached value.

Example

(*functional-properties* '(rdf:type))

(*query-functional-properties?* #t)

(*unique-variables* '(?user))

(define-constraint  
  'read/write 
  (lambda ()    "
PREFIX mu: <http://mu.semte.ch/vocabularies/core/>
PREFIX graphs: <http://mu.semte.ch/school/graphs/>
PREFIX school: <http://mu.semte.ch/vocabularies/school/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

CONSTRUCT {
 ?a ?b ?c.
}
WHERE {
 GRAPH <authorization> {
  <SESSION> mu:account ?user
 }
 GRAPH ?graph {
  ?a ?b ?c.
  ?a rdf:type ?type.
 }
 VALUES (?graph ?type) { 
  (graphs:grades school:Grade) 
  (graphs:subjects school:Subject) 
  (graphs:classes school:Class) 
  (graphs:people foaf:Person) 
 }
}  "))

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages