Examples of spark-lucenerdd.
The following pairs of datasets are used here to demonstrate the accuracy/quality of the record linkage methods. Note that the goal here is to demonstrate the user-friendliness of the spark-lucenerdd library and no optimization is attempted.
Dataset | Domain | Attributes | Accuracy (top-1) | References |
---|---|---|---|---|
DBLP vs ACM article | Bibliographic | title, authors, venue, year | 0.98 | Benchmark datasets for entity resolution |
DBLP vs Scholar article | Bibliographic | title, authors, venue, year | 0.953 | Benchmark datasets for entity resolution |
Amazon vs Google products | E-commerce | name, description, manufacturer, price | 0.58 | Benchmark datasets for entity resolution |
Abt vs Buy products | E-commerce | name, description, manufacturer, price | 0.64 | Benchmark datasets for entity resolution |
The reported accuracy above is by selecting as the linked entity: the first result from the top-K list of results.
All datasets are available in Spark friendly Parquet format here; original datasets are available here.
This example loads all countries from a parquet file containing fields "name" and "shape" (shape is mostly polygons in WKT)
val allCountries = spark.read.parquet("data/spatial/countries-poly.parquet")
then, it load all capitals from a parquet file containing fields "name" and "shape" (shape is mostly points in WKT)
val capitals = spark.read.parquet("data/spatial/capitals.parquet")
A ShapeLuceneRDD instance is created on the countries and a linkageByRadius
is performed on the capitals. The output is presented in the logs.
Install Java, SBT and clone the project
git clone https://github.com/zouzias/spark-lucenerdd-examples.git
cd spark-lucenerdd-examples
sbt compile assembly
Download and extract apache spark under your home directory, update the spark-submit.sh
script accordingly and run
./spark-linkage-*.sh
to run the record linkage examples and ./spark-search-capitalts.sh
to run a search example.
Setup docker and assuming that you have a docker machine named default
, type
./startZeppelin.sh
To start an Apache Zeppelin with preloaded notebooks.