discretize-sparql
is a command-line tool to discretize numeric values in RDF datasets via SPARQL Update operations. Discretization (also known as binning) converts continuous numeric values into discrete intervals. This is typically useful for data mining tools that operate on categorical data. For example, discretization is required for association rule mining with EasyMiner, outlier detection with FPM, or tensor factorization with RESCAL.
This tool wraps the EasyMiner-Discretization library.
Use a released executable or compile using Leiningen and lein-binplus:
git clone https://github.com/jindrichmynarz/discretize-sparql.git
cd discretize-sparql
lein bin
You can run the created executable file and observe the command-line parameters:
target/discretize_sparql --help
The tool supports the following parameters:
-e
,--endpoint
: URL of the SPARQL endpoint to retrieve data from. The endpoint must allow SPARQL Update operations.-a
,--auth
: Endpoint's authorization written asusername:password
. The tool currently supports HTTP Digest authentication, which is used by Virtuoso.-u
,--update
: Path to SPARQL Update operation. See more about this below.-m
,--method
: Method of discretization to use. The supported methods areequidistance
,equifrequency
, andequisize
. Equidistant discretization creates intervals of the same size. Equifrequent discretization creates intervals with approximately the same number of members. Equisize discretization creates intervals based on minimum support.-b
,--bins
: Number of bins (intervals) to generate. Required forequidistance
andequifrequency
methods.-s
,--min-support
: Minimum support required for a generated interval. Required forequisize
method.-g
,--graph
: IRI or URN of the named graph to which intervals will be loaded.-p
,--page-size
(default =10000
): Number of results to fetch in one request.--parallel
(default =false
): Execute SPARQL queries in parallel.--strict
(default =false
): Fail if not all discretized values are numeric.-h
,--help
(default =false
): Display help information.
The most important parameter is --update
, which provides a path to a SPARQL Update operation that defines the input and output data for a discretization task. This operation employs wishful thinking. Its WHERE clause must contain the variable ?value
, which selects numeric values to discretize, and the variable ?interval
, which will be assigned the intervals generated by the tool. You're free to do what you want with these variables in the operation. Perhaps you want to delete the ?value
and insert ?interval
in its place. Or you may insert ?interval
along the original values. Let's have a look at an example of such operation:
PREFIX pc: <http://purl.org/procurement/public-contracts#>
PREFIX schema: <http://schema.org/>
WITH <http://linked.opendata.cz/resource/dataset/isvz.cz>
DELETE {
?resource schema:price ?value .
}
INSERT {
?resource schema:price ?interval .
}
WHERE {
[] pc:estimatedPrice ?resource .
?resource schema:priceCurrency "CZK" ;
schema:price ?value .
}
The WHERE
clause in this operation selects estimated prices (pc:estimatedPrice
) in Czech crowns (schema:priceCurrency "CZK"
). The INSERT
clause inserts the generated ?interval
, while the DELETE
clause deletes the original ?value
.
Under the hood, this operation is rewritten to SELECT queries that retrieve the values to discretize and a SPARQL Update operation that converts the intervals.
The generated intervals are represented as instances of schema:QuantitativeValue
. The bounds of the intervals are described using schema:minValue
for the lower bound and schema:maxValue
for the upper bound. Classes from the SemanticScience Integrated Ontology are used to determine whether the bounds are open or closed. The intervals are identified with UUID-based URNs. Here's an example:
@prefix schema: <http://schema.org/> .
@prefix sio: <http://semanticscience.org/resource/SIO_> .
<urn:uuid:4E98F3EE-2861-4A4B-A39C-487A7018165E> a schema:QuantitativeValue,
sio:001254, # Left-closed interval
sio:001252 ; # Right-open interval
schema:minValue 0 ;
schema:maxValue 1000 .
The intervals are loaded into a named graph provided via the --graph
parameter. If this parameter is missing, the tool attempts to guess a named graph to load the interval to. It uses the graph specified by WITH
, USING
, or in the INSERT
clause. If no graph is found, the tool asks you to provide it explicitly via --graph
.
Virtuoso has an issue with keeping the precision of xsd:decimal
. As a result of the precision loss, some decimal numbers may end up not being discretized. In order to avoid this issue, the tool rounds the bounds of intervals to the maximum decimal precision supported by Virtuoso in xsd:decimal
.
Development of this tool was supported by the by the H2020 project no. 645833 (OpenBudgets.eu).
Copyright © 2017 Jindřich Mynarz
Distributed under the Eclipse Public License either version 1.0.