. . . .
,`,`,`,`,
. . . . `\`\`\`\;
`\`\`\`\`, ~|;!;!;\!
~\;\;\;\|\ (--,!!!~`! .
(--,\\\===~\ (--,|||~`! ./
(--,\\\===~\ `,-,~,=,:. _,//
(--,\\\==~`\ ~-=~-.---|\;/J, Welcome to the Unicorn Database
(--,\\\((```==. ~'`~/ a | BigTable, Document and Graph
(-,.\\('('(`\\. ~'=~| \_. \ Full Text Search
(,--(,(,(,'\\. ~'=| \\_;>
(,-( ,(,(,;\\ ~=/ \ Haifeng Li
(,-/ (.(.(,;\\,/ ) ADP Innovation Lab
(,--/,;,;,;,\\ ./------.
(==,-;-'`;' /_,----`. \
,.--_,__.-' `--. ` \
(='~-_,--/ , ,!,___--. \ \_)
(-/~( | \ ,_- | ) /_|
(~/((\ )\._, |-' _,/ /
\\)))) / ./~. | \_\;
,__///// / / ) /
'===~' | | (, <.
/ / \. \
_/ / \_\
/_!/ >_\
Welcome to Unicorn Shell; enter ':help<RETURN>' for the list of commands.
Type ":quit<RETURN>" to leave the Unicorn Shell
Version 2.1.0, Scala 2.11.7, SBT 0.13.8, Built at 2016-07-04 03:42:14.238
===============================================================================
unicorn>
Unicorn is a simple and flexible abstraction of BigTable-like database such as Cassandra, HBase, Accumulo, and RocksDB. Beyond a unified interface to various database systems, Unicorn provides easy-to-use document data model and MongoDB-like API. Moreover, Unicorn supports directed property multigraphs and documents can just be vertices in a graph.
Agility, flexibility and easy to use are our key design goals. With different storage engine, we can achieve different strategies on consistency, replication, etc.
With the built-in document and graph data models, developers can focus on the business logic rather than work with tedious key-value pair manipulations. Of course, developers are still free to use key-value pairs for flexibility in some special cases.
Unicorn is implemented in Scala and can be used as a client-side library
without overhead. Unicorn also provides a shell for quick access of database.
The code snippets in this document can be directly run in the Shell.
A HTTP API, in the module Rhino
, is also provided to non-Scala users.
For analytics, Unicorn data can be exported as RDDs in Spark. These RDDs can also be converted to DataFrames or Datasets, which support SQL queries. Unicorn graphs can be analyzed by Spark GraphX too.
To use Unicorn as a library, add the following to SBT build file.
libraryDependencies += "com.github.haifengl" % "unicorn-unibase_2.11" % "2.1.1"
If you need additional HBase-only features, please link to the module Narwhal
.
libraryDependencies += "com.github.haifengl" % "unicorn-narwhal_2.11" % "2.1.1"
With the module Narwhal
that is specialized for HBase, advanced features such
as time travel, rollback, counters, server side filter, Spark integration, etc. are available.
To support the document model, Unicorn has a very rich and advanced JSON library. With it, the users can operate JSON data just like in JavaScript. Moreover, it supports JSONPath for flexibly analyse, transform and selectively extract data out of JSON objects. Meanwhile, it is type safe and may capture many errors during the compile time. To use only the JSON library,
libraryDependencies += "com.github.haifengl" % "unicorn-json_2.11" % "2.1.1"
Get pre-packaged Unicorn in universal tarball from the releases page.
If you would like to build Unicorn from source, please first install Java, Scala and SBT. Then clone the repo and build the package:
git clone https://github.com/haifengl/unicorn.git
cd unicorn
./unicorn.sh
which also starts the Shell.
Unicorn runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS).
All you need is to have Java installed on your system PATH
,
or the JAVA_HOME
environment variable pointing to a Java installation.
If you want to work on the Unicorn code with IntelliJ, please first install SBT plugin. To import the project, on the main menu select File | New | Project from Existing Sources.
Unicorn comes with an interactive shell. In the home directory of Unicorn, type
./bin/unicorn
to enter the shell, which is based on Scala interpreter. So you can run
any valid Scala expressions in the shell. In the simplest case, you can
use it as a calculator. Besides, all high-level Unicorn operators are
predefined in the shell. Be default, the shell uses up to 4GB memory.
If you need more memory to handle large data, use the option -J-Xmx
.
For example,
./bin/unicorn -J-Xmx8192M
You can also modify the configuration file ./conf/unicorn.ini
for
the memory and other JVM settings.
In the shell, type :help to print Scala interpreter help information. To exit the shell, type :quit.
Suppose that hbase-site.xml
and hbase-default.xml
can be found on the CLASSPATH
,
one can connect to HBase as simple as
val db = HBase()
The user may also pass a Configuration object to HBase()
if HBase configuration files are not in the CLASSPATH
.
To connect to Cassandra, please enable the Thrift API by configuring
start_rpc
to true
in the cassandra.yaml
file, which is false
by default
after Cassandra 2.0.
val db = Cassandra("127.0.0.1", 9160)
To connect to Accumulo,
val db = Accumulo("instance", "zookeeper", "user", "password")
where the first parameter is the Accumulo instance name and the second is the ZooKeeper connection string.
An interesting feature of Accumulo is to create a mock instance that holds all data in memory, and will not retain any data or settings between runs. It presently does not enforce users, logins, permissions, etc. This is very convenient for test. This doesn't even require an installation of Accumulo. To create a mock instance, simply do
val db = Accumulo()
RocksDB is an embeddable persistent key-value store for fast storage. RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads.
If your data can be fit in one machine and low latency (in microseconds) is important to your applications, RocksDB is a great choice. Especially for graph database use cases, a graph traversal may touch thousands or event millions vertices, the cost of IPC to a distributed database will be too high. In this case, RocksDB will easily outperformance other distributed storage engine.
There is no concept of tables in RocksDB. In fact, a RocksDB is like a table in HBase. Therefore, we create a higher level concept of database that contains multiple RocksDB databases in a directory. Each RocksDB is actually a subdirectory, which is encapsulated in RocksTable. To create RocksDB, simply provides a directory path.
val db = Unibase(RocksDB.create("/tmp/unicorn-twitter"))
To use an existing database,
val db = Unibase(RocksDB("/tmp/unicorn-twitter"))
With a database instance, we can create, drop, truncate, and compact a table. We can also test if a table exists.
db.createTable("test_table", "cf1", "cf2")
val table = db("test_table")
table("row1", "cf1", "c1") = "data"
db.dropTable("test_table")
Here we first create a table with two column families.
We can also pass a Properties
object for additional table
configuration, which usually depends on the backend system.
For example, sharding and replication strategies, compression,
time-to-live (TTL), etc.
Then we get the table object, put some data, and finally drop the table.
In what follows, we will go through the data manipulation APIs.
In BigTable-like databases, data are stored in tables, which are made of rows and columns. Columns are grouped into column families. A column name is made of its column family prefix and a qualifier. The column family prefix must be composed of printable characters. The column qualifiers can be made of any arbitrary bytes. In Cassandra and HBase, column families must be declared up front at schema definition time whereas new columns can be added to any column family without pre-announcing them. In contrast, column family are not static in Accumulo and can be created on the fly. The only way to get a complete set of columns that exist for a column family is to process all the rows.
A cell’s content is an uninterpreted array of bytes. And table
cells are versioned. A (row, column, version)
tuple exactly specifies
a cell. The version is specified using a long
integer. Typically this
long
contains timestamps.
The trait unicorn.bigtable.BigTable
defines basic operations on
a table such as Get
, Put
, and Delete
. The corresponding
implementations are in unicorn.bigtable.cassandra.CassandraTable
,
unicorn.bigtable.hbase.HBaseTable
, and unicorn.bigtable.accumulo.AccumuloTable
.
/** Get one or more columns of a column family. If columns is empty, get all columns in the column family. */
def get(row: ByteArray, family: String, columns: ByteArray*): Seq[Column]
/** Get all columns in one or more column families. If families is empty, get all column families. */
def get(row: ByteArray, families: Seq[(String, Seq[ByteArray])] = Seq.empty): Seq[ColumnFamily]
/** Upsert a value. */
def put(row: ByteArray, family: String, column: ByteArray, value: ByteArray, timestamp: Long): Unit
/** Upsert values. */
def put(row: ByteArray, family: String, columns: Column*): Unit
/** Upsert values. */
def put(row: ByteArray, families: Seq[ColumnFamily] = Seq.empty): Unit
/** Delete the columns of a row. If columns is empty, delete all columns in the family. */
def delete(row: ByteArray, family: String, columns: ByteArray*): Unit
/** Delete the columns of a row. If families is empty, delete the whole row. */
def delete(row: ByteArray, families: Seq[(String, Seq[ByteArray])] = Seq.empty): Unit
Setting timestamp
as 0L
, Put
creates a new version
of a cell with the server’s currentTimeMillis (for Cassandra, it is
caller's machine time). But the user may specify the version
on a per-column level. The user-provided version may be a time
in the past or the future, or a non-time purpose long value.
def update(row: ByteArray, family: String, column: ByteArray, value: ByteArray): Unit
The helper function update
is provided as a syntactic sugar
so that the user can put a value in the way
table(row, family, column) = value
In this case, timestamp is always set as the current machine time.
Delete
can happen on a specific version of a cell or all versions.
Deletes work by creating tombstone markers. Once a tombstone marker
is set, the "deleted" cells become effectively invisible for Get
operations but are not immediately removed from store files.
Note that there is a snag with the tombstone approach, namely
"deletes mask puts". Once a tombstone marker is set, even puts
after the delete will be masked by the delete tombstone.
Performing the put will not fail. However when you do a Get
,
the Put
has no effect but will start working after the major
compaction, which will really remove deletes and tombstone markers.
For these APIs, there are also corresponding batch mode operations that work on multiple rows. However, the implementation may or may not optimize the batch operations. In particular, Accumulo does optimize it in parallel.
Besides the basic operations, advanced features that are not available on all backend systems are organized into various traits.
By default, when doing a Get
, the cell whose version has the
largest value is returned. It is possible to return more than
one version or to return versions other than the latest. These
special methods are defined in trait TimeTravel
, which is supported
by HBase.
HBase and Accumulo support the Scan
operation that fetches zero or
more rows of a table. In fact, A Get
is simply a Scan
limited by
the API to one row in these systems. The trait RowScan
provides
operations to scan the whole table, or a range specified by the start
and stop row key, or rows with a given prefix.
HBase provides advanced filtering functionality when reading data
using Get
or Scan
operations, which return a subset of results
to the client. While this does not reduce server-side IO, it does
reduce network bandwidth and reduces the amount of data the client
needs to process.
The filter operators (Equal
, NotEqual
, Greater
, GreaterOrEqual
,
Less
, LessOrEqual
) and logic operators (And
, Or
) are defined
in trait ScanFilter
. The enhanced Get
and Scan
operators with
filter parameter are defined in FilterScan
.
In BigTable, a row could have millions columns. Cassandra actually
supports up to 2 billions columns. In such a wide columnar environment,
intra row scan is an useful operation in some use cases. Both HBase
and Cassandra support IntraRowScan
trait that can scan columns of
given range in a row (inside a column family).
HBaseTable
implements the Rollback
trait that defines the methods to
rollback cell(s) to previous version.
A counter is a special column used to store a 64 bit integer that is changed in increments. Both HBase and Cassandra support counters.
To use counters in Cassandra, the user has to define a column family whose columns will act as counters.
HBase support the Append
operation that appends data to a cell.
Note that this operation does not appear atomic to readers. Appends
are done under a single row lock, so write operations to a row are
synchronized, but readers do not take row locks so get and scan operations
can see this operation partially completed.
Accumulo and HBase support cell level security that provides fine grained access control. Cells can have visibility labels, which is used to determine whether a given user meets the security requirements to read the value. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while preserving data confidentiality.
Security labels consist of a set of user-defined tokens that are required
to read the value the label is associated with. The set of tokens required
can be specified using syntax that supports logical AND
and OR
combinations of tokens, as well as nesting groups of tokens together.
Although Unicorn provides a modular and unified interface to various BigTable-like systems, it is still a very low level API to manipulate data. A more productive way to use Unicorn is through the rich, flexible, and easy-to-use document data model. A document is essentially a JSON object with a unique key (corresponding to the row key). With document data model, the application developers will focus on the business logic while Unicorn efficiently maps documents to key-value pairs in BigTable. In this section, we first introduce Unicorn's JSON data types and APIs. In the next section, we will discuss the document API, which is compatible with MongoDB.
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages,
JSON has only types of string
, number
, boolean
, object
, array
, and null
.
Unicorn includes additional types such as date
, int
, long
, double
, counter
,
binary
, UUID
, ObjectId
(as in BSON), etc.
In Unicorn, it is very easy to parse a JSON object:
val doc =
json"""
{
"store": {
"book": [
{
"category": "reference",
"author": "Nigel Rees",
"title": "Sayings of the Century",
"price": 8.95
},
{
"category": "fiction",
"author": "Evelyn Waugh",
"title": "Sword of Honour",
"price": 12.99
},
{
"category": "fiction",
"author": "Herman Melville",
"title": "Moby Dick",
"isbn": "0-553-21311-3",
"price": 8.99
},
{
"category": "fiction",
"author": "J. R. R. Tolkien",
"title": "The Lord of the Rings",
"isbn": "0-395-19395-8",
"price": 22.99
}
],
"bicycle": {
"color": "red",
"price": 19.95
}
}
}
"""
The interpolator json
parse a string to JsObject
. It is also okay to embed
variable references directly in processed string literals.
val x = 1
json"""
{
"x": $x
}
"""
If the string is not a JSON object but any other valid JSON expression, one may
use parseJson
method to convert the string to a JsValue
.
"1".parseJson
The json
interpolator can only be applied to string literals. If you want to
parse a string variable, the parseJson
method can always be employed. If you know the
string contains a JSON object, you may also use the method parseJsObject
.
val s = """{"x":1}"""
s.parseJsObject
To serialize a JSON value (of type JsValue
) in compact mode, you can just use toString
.
To pretty print, use the method prettyPrint
.
doc.toString
doc.prettyPrint
With a JsObject
or JsArray
, you can refer to the individual elements
with a variation of array syntax, like this:
doc("store")("bicycle")("color")
// Use symbol instead of string
doc('store)('bicycle)('color)
Note that we follow Scala's array access convention by ()
rather than
[]
in JavaScript.
Besides, you can use the dot notation to access its fields/elements just like in JavaScript:
doc.store.bicycle.color
doc.store.book(0).author
It is worth noting that we didn't define the type/schema of the document while Scala is a strong type language. In other words, we have both the type safe features of strong type language and the flexibility of dynamic language in Unicorn's JSON library.
If you try to access a non-exist field, JsUndefined
is returned.
unicorn> doc.book
res11: unicorn.json.JsValue = undefined
Although there are already several nice JSON libraries for Scala, the JSON
objects are immutable by design, which is a natural choice for a functional
language. However, Unicorn is designed for database, where data mutation is
necessary. Therefore, JsObject
and JsArray
are mutable data structures in
Unicorn. You can set/add a field just like in JavaScript:
json.store.bicycle.color = "green"
To delete a field from JsObject
, use remove
method:
doc.store.book(0) remove "price"
It is same as setting it JsUndefined
:
doc.store.book(0).price = JsUndefined
To delete an element from JsArray
, the remove
method will effectively
remove it from the array. However, setting an element to undefined
doesn't
reduce the array size.
// delete the first element and array size is smaller
doc.store.book.remove(0)
// set the first element to undefined but array size keeps same
doc.store.book(0) = JsUndefined
It is also possible to append an element or another array to JsArray
:
val a = JsArray(1, 2, 3, 4)
a += 5
a ++= JsArray(5, 6)
Common iterative operations such as foreach
, map
, reduce
can be applied to
JsArray
too.
doc.store.book.asInstanceOf[JsArray].foreach { book =>
println(book.price)
}
Because Scala is a static language, it is impossible to know
doc.store.book
is an array at compile time. So it is typed
as generic JsValue
, which is the parent type of specific
JSON data types. Therefore, we use asInstanceOf[JsArray]
to convert it to JsArray
in order to use foreach
.
With Unicorn, we can also look up field in the current object and all descendants:
unicorn> doc \\ "price"
res29: unicorn.json.JsArray = [8.95,12.99,8.99,22.99,19.95]
For more advanced query operations, JSONPath can be employed.
JSONPath is a means of using XPath-like syntax to query JSON structures. JSONPath expressions always refer to a JSON structure in the same way as XPath expression are used in combination with an XML document.
val jspath = JsonPath(doc)
Since a JSON structure is usually anonymous and doesn't necessarily
have a "root member object" JSONPath assumes the abstract name
$
assigned to the outer level object. Besides, @
refers to the
current object/element.
// the authors of all books in the store
jspath("$.store.book[*].author")
// all authors
jspath("$..author")
// all things in store
jspath("$.store.*")
// the price of everything in the store
jspath("$.store..price")
// the third book
jspath("$..book[2]")
// the last book in order
jspath("$..book[-1:]")
// the first two books
jspath("$..book[0,1]")
jspath("$..book[:2]")
// filter all books with isbn number
jspath("$..book[?(@.isbn)]")
//filter all books cheaper than 10
jspath("$..book[?(@.price<10)]")
// all members of JSON structure
jspath("$..*")
Our JSONPath parser supports all queries except for queries
that rely on expressions of the underlying language
like $..book[(@.length-1)]
. However, there’s usually
a ready workaround as you can execute the same query using $..book[-1:]
.
Another deviation from JSONPath is to always flatten the results
of a recursive query. Using the bookstore example,
typically a query of $..book
will return an array with one element,
the array of books. If there was another book
array somewhere in the document, then $..book
will return an array
with two elements, both arrays of books.
However, if you were to query $..book[2]
for our example,
you would get the second book in the first array,
which assumes that the $..book
result has been flattened.
In Unicorn, we always flatten the result of recursive queries
regardless of the context.
It is also possible to update fields with JSONPath. Currently, we support only child and array slice operators for update.
jspath("$['store']['book'][1:3]['price']") = 30.0
Unicorn's document APIs are defined in package unicorn.unibase
,
which is independent of backend storage engine. Simply pass
a reference to storage engine to Unibase()
, which provides the
interface of document model.
In relational database, a table is a set of tuples that have the same attributes. In Unibase, a table contains documents, which are JSON objects and may have nested objects and/or arrays. Besides, the tables in Unibase do not enforce a schema. Documents in a table may have different fields. Typically, all documents in a collection are of similar or related purpose. Moreover, documents in a table must have unique IDs/keys, which is not necessary in relational databases.
In MongoDB, such a group of documents is called collection. In order to avoid the confusion with Java/Scala's collection data structures, Unibase simply calls it table.
val db = Unibase(Accumulo())
db.createTable("worker")
val workers = db("worker")
In above, we create a table worker
in Unibase.
Then we create a JSON object and insert it into
the worker table as following. Note that we explicitly create
the JsObject
by specifying the fields and values
instead of parsing from a JSON string. This way provides
fine controls on the data types of fields.
val joe = JsObject(
"name" -> "Joe",
"gender" -> "Male",
"salary" -> 50000.0,
"address" -> JsObject(
"street" -> "1 ADP Blvd",
"city" -> "Roseland",
"state" -> "NJ",
"zip" -> "07068"
),
"project" -> JsArray("HCM", "NoSQL", "Analytics")
)
val key = workers.upsert(joe)
Each document should have a field _id
as the primary key.
If it is missing, the upsert
operation will generate a
random UUID as _id
, which is returned and also added into
the input JSON object:
unicorn> joe.prettyPrint
res1: String =
{
"address": {
"city": "Roseland",
"state": "NJ",
"zip": "07068",
"street": "1 ADP Blvd"
},
"name": "Joe",
"gender": "Male",
"salary": 50000.0,
"project": ["HCM", "NoSQL", "Analytics"],
"_id": "cc8a6a6d-4305-4b77-a776-6f70f2306d06"
}
If the input object includes _id
and the table already
contains a document with same key, the document will be
overwritten by upsert
. If this is not the preferred
behavior, one may use insert
, which checks if the
document already exists and throws an exception
if so.
To make the code future proof, it is recommended to use
the constant value $id
, defined in Unibase
package object, instead of
_id
in the code.
Besides UUID, one may also Int
, Long
, Date
, String
,
and BSON' ObjectId
(12 bytes including 4 bytes timestamp,
3 bytes machine id, 2 bytes process id, and 3 bytes incrementer)
as the primary key. One may even use
a complex JSON data type such as object or array as the
primary key. However, this is NOT recommended because primary
keys cannot be updated once a document inserted. To achieve
similar effects, the old document has to be delete and
inserted again with new key. However, the old document
will be permanently deleted after the major compaction,
which may not be desired. Even before the document be
permanently deleted, the time travel functionality is
broken for this document.
To get the document back, simply treat the table as a
map and use _id
as the key:
unicorn> workers(key).get.prettyPrint
res3: String =
{
"address": {
"city": "Roseland",
"state": "NJ",
"zip": "07068",
"street": "1 ADP Blvd"
},
"name": "Joe",
"gender": "Male",
"salary": 50000.0,
"project": ["HCM", "NoSQL", "Analytics"],
"_id": "cc8a6a6d-4305-4b77-a776-6f70f2306d06"
}
If the document doesn't exist, None
is returned.
To update a document, we use a MongoDB-like API:
val update = JsObject(
"$id" -> key,
"$set" -> JsObject(
"salary" -> 100000.0,
"address.street" -> "5 ADP Blvd"
),
"$unset" -> JsObject(
"gender" -> JsTrue
)
)
workers.update(update)
The $set
operator replaces the value of a field with
the specified value, provided that the new field does not
violate a type constraint (and the document key _id
should not be set). If the field does not exist, $set
will add a new field with the specified value.
To specify a field in an embedded object or in an array,
use dot notation. To be compatible to MongoDB, we concatenate
the array name with the dot (.) and zero-based index position
to specify an element of an array by the zero-based index position,
which is different from the []
convention in JavaScript and JSONPath.
In MongoDB, $set
will create the embedded objects as needed to fulfill
the dotted path to the field. For example, for a $set {"a.b.c" : "abc"}
,
MongoDB will create the embedded object "a.b" if it doesn't exist.
However, we don't support this behavior because of the performance considerations.
We suggest the the alternative syntax {"a.b" : {"c" : "abc"}}
, which has the
equivalent effect.
If you want to use json
string interpolation to create a
JSON object for update
, remember to escape $set
and $unset
by double
dollar sign, e.g. $$set
and $$unset
.
To delete a document, use the method delete
with the document key:
workers.delete(key)
When creating a table, we may declare it append only. Such a table is write-once (i.e. the same document is never updated). Any updates to existing documents will throw exceptions. Deletes are not allowed either.
db.createTable("stock", appendOnly = true)
val prices = db("stock")
val trade = json"""
{
"ticker": "GOOG",
"price": 700.0,
"timestamp": ${System.currentTimeMillis}
}
"""
val key = prices.upsert(trade)
prices.update(JsObject(
"_id" -> key,
"$set" -> JsObject(
"price" -> JsDouble(800.0)
)
))
java.lang.UnsupportedOperationException
at unicorn.unibase.Table.update(Table.scala:320)
... 52 elided
Multi-tenant tables are regular tables that enables views to be created over the table across different tenants. This option is useful to share the same physical BigTable table across many different tenants.
To use a multi-tenant table, the user must firstly
set the tenant id, which cannot be undefined
, null
,
boolean
, counter
, date
, or double
.
The tenats only see their data in such tables.
val workers = db("worker")
workers.tenant = "IBM"
val ibmer = workers.upsert(json"""
{
"name": "Tom",
"age": 40
}
""")
workers.tenant = "Google"
val googler = workers.upsert(json"""
{
"name": "Tom",
"age": 30
}
""")
Because the tenant is "Google" now, the data of tenant "IBM" are not visible.
unicorn> workers(ibmer)
res5: Option[unicorn.json.JsObject] = None
unicorn> workers(googler)
res6: Option[unicorn.json.JsObject] = Some({"name":"Tom","age":30,"_id":"545ed4d1-280c-4b6a-a3cc-e0a3c5fc5b43"})
Switch back to "IBM", the view is different:
unicorn> workers.tenant = "IBM"
workers.tenant: Option[unicorn.json.JsValue] = Some(IBM)
unicorn> workers(ibmer)
res8: Option[unicorn.json.JsObject] = Some({"name":"Tom","age":40,"_id":"2b7fb69f-810f-4ca7-a70f-5db767bc8e49"})
unicorn> workers(googler)
res9: Option[unicorn.json.JsObject] = None
As a client-side solution, Unicorn does not enforce security on multi-tenant tables. In fact, there are no concepts of user and/or role. It is the application's responsibility to ensure the authorization and the authentication on the access of multi-tenant tables.
When Unicorn creates a document table, it creates only one column family by default.
db.createTable("worker",
families = Seq(Unibase.DocumentColumnFamily),
locality = Map().withDefaultValue(Unibase.DocumentColumnFamily))
where the parameter families
is the list of column families,
and the parameter locality
, a map, tells Unicorn how to
map the data to different column families. Because we have
only one column families here, we simply set the default
value of map is the only column family.
This schema can be customized. When documents in a table have a lot of fields and only a few fields are needed in many situations, it is a good idea to organize them into different column families based on business logic and access patterns. Such a design is more efficient because the storage engine needs scan only the necessary column family. It also limits the network data transmission.
db.createTable("worker",
families = Seq(
"id",
"person",
"address",
"project"),
locality = Map(
$id -> "id",
"address" -> "address",
"project" -> "project"
).withDefaultValue("person"))
For simplicity, Unicorn uses only the top level fields of documents to determine the locality mapping.
val joe = JsObject(
"name" -> "Joe",
"gender" -> "Male",
"salary" -> 50000.0,
"address" -> JsObject(
"street" -> "1 ADP Blvd",
"city" -> "Roseland",
"state" -> "NJ",
"zip" -> "07068"
),
"project" -> JsArray("HCM", "NoSQL", "Analytics")
)
val key = workers.upsert(joe)
We can retrieve partial documents as following, which is known as "projection" in relational database and MongoDB.
unicorn> workers(key)
res17: Option[unicorn.json.JsObject] = Some({"address":{"city":"Roseland","state":"NJ","zip":"07068","street":"1 ADP Blvd"},"project":["HCM","NoSQL","Analytics"],"_id":"6df63cf3-e4dd-4381-8276-ac0c0626dc78"})
unicorn> workers(key, "address")
res19: Option[unicorn.json.JsObject] = Some({"address":{"city":"Roseland","state":"NJ","zip":"07068","street":"1 ADP Blvd"},"_id":"6df63cf3-e4dd-4381-8276-ac0c0626dc78"})
unicorn> workers(key, "project")
res20: Option[unicorn.json.JsObject] = Some({"project":["HCM","NoSQL","Analytics"],"_id":"6df63cf3-e4dd-4381-8276-ac0c0626dc78"})
You can retrieve data from multiple column families too.
unicorn> workers(key, "name", "address")
res21: Option[unicorn.json.JsObject] = Some({"address":{"city":"Roseland","state":"NJ","zip":"07068","street":"1 ADP Blvd"},"name":"Joe","gender":"Male","salary":50000.0,"_id":"6df63cf3-e4dd-4381-8276-ac0c0626dc78"})
However, there is a semantic difference from regular projection
as you may notice in the above. Even though the user asks
for only name
, all other fields in the same column family
are returned. This is due to the design of BigTable and the mapping
from document fields to columns.
For example, if a specified
field is a nested object, there is no easy way to read only the specified object in BigTable.
Intra-row scan may help but not all BigTable implementations support it. And if there are multiple
nested objects in request, we have to send multiple Get
requests, which is not efficient. Instead,
we return the whole object of a column family if some of its fields are in request. This is usually
good enough for hot-cold data scenario. For instance of a table of events, each event has a
header in a column family and event body in another column family. In many reads, we only need to
access the header (the hot data). When only user is interested in the event details, we go to read
the event body (the cold data). Such a design is simple and efficient. Another difference from MongoDB is
that we do not support the excluded
fields.
We may also run Unicorn code as a shell script or batch command. The following bash script can be run directly from the command shell:
#!/bin/bash
exec unicorn -nc "$0" "$@"
!#
import unicorn.json._
import unicorn.bigtable._
import unicorn.bigtable.accumulo._
import unicorn.bigtable.hbase._
import unicorn.unibase._
val db = Unibase(Accumulo())
db.createTable("worker")
val workers = db("worker")
val joe = JsObject(
"name" -> "Joe",
"gender" -> "Male",
"salary" -> 50000.0,
"address" -> JsObject(
"street" -> "1 ADP Blvd",
"city" -> "Roseland",
"state" -> "NJ",
"zip" -> "07068"
),
"project" -> JsArray("HCM", "NoSQL", "Analytics")
)
workers.upsert(joe)
Advanced document API with HBase is available in the package
unicorn.narwhal
. To use these features, the user should
use the class Narwhal
and HTable
, which are subclasses
of Unibase
and Table
, respectively.
val db = new Narwhal(HBase())
db.createTable("narwhal")
val rich = json"""
{
"owner": "Rich",
"phone": "123-456-7890",
"address": {
"street": "1 ADP Blvd.",
"city": "Roseland",
"state": "NJ"
},
"children": 2C
}
"""
val bucket = db("narwhal")
val key = bucket.upsert(rich)
The usage of Narwhal
and HTable
are similar to their
parent classes. Besides, additional operators are available
by taking advantage of HBase's features. Note that in the
above the field children
takes the value of 2C
. The suffix
C
indicates it is a counter. You may use JsCounter
to create
a counter directly too. With counters, we may update them
in an atomic operation. For a regular integer, we instead have
to read, update, and write back, which may cause consistency
problems.
val increment = json"""
{
"$$inc": {
"children": 1
}
}
"""
increment("_id") = key
bucket.update(increment)
val doc = bucket(key).get
println(doc.children)
This example will print out the new value, i.e. 3, of children
.
It is also possible to use a negative value in $inc
operations, which
effectively decreases the counter.
Another interesting feature is that we can rollback document fields back to previous values. If a user accidentally changes a value, don't worry. The old values are still in HBase and we can easily rollback.
val update = json"""
{
"$$set": {
"phone": "212-456-7890",
"gender": "M",
"address.street": "5 ADP Blvd."
}
}
"""
update("_id") = key
bucket.update(update)
In the above example, we update three fields phone
, gender
,
and address.street
. Note that gender
is actually a new filed.
Let's verify these fields be updated.
unicorn> bucket(key).get
res11: unicorn.json.JsObject = {"children":2,"address":{"city":"Roseland","state":"NJ","street":"5 ADP Blvd."},"owner":"Rich","gender":"M","_id":"0c354c07-6e25-4b30-9cf3-9a508b6868fd","phone":"212-456-7890"}
Now we rollback phone
and gender
.
val rollback = json"""
{
"$$rollback": {
"phone": 1,
"gender": 1
}
}
"""
rollback("_id") = key
bucket.update(rollback)
Printing out the document, we can find that phone
has the previous value and gender
disappears now.
Of course, address.street
still keeps the latest values.
unicorn> bucket(key).get
res15: unicorn.json.JsObject = {"children":4,"address":{"city":"Roseland","state":"NJ","street":"5 ADP Blvd."},"owner":"Rich","_id":"0c354c07-6e25-4b30-9cf3-9a508b6868fd","phone":"123-456-7890"}
This example shows how we can rollback $set
operations. We actually
can also rollback $unset
operations.
val update = json"""
{
"$$unset": {
"owner": 1,
"address": 1
}
}
"""
update("_id") = key
bucket.update(update)
val rollback = json"""
{
"$$rollback": {
"owner": 1,
"address": 1
}
}
"""
rollback("_id") = key
bucket.update(rollback)
Another cool feature of Narwhal is time travel. Because HBase stores multiple timestamped values, we can query the snapshot of a document at a given time point.
val asOfDate = new Date
val update = json"""
{
"$$set": {
"phone": "212-456-7890",
"gender": "M",
"address.street": "5 ADP Blvd."
},
"$$inc": {
"children": 1
}
}
"""
update("_id") = key
bucket.update(update)
bucket(asOfDate, key)
Besides the plain Get
, we can supply an as-of-date parameter in a time travel Get
,
which will retrieval document's value at that time point.
So far, we get documents by their keys. With HBase, we can also
query documents with method find
by filtering field values.
The find
method returns an iterator to the documents that match the query criteria.
bucket.upsert(json"""{"name":"Tom","age":30,"state":"NY"}""")
bucket.upsert(json"""{"name":"Mike","age":40,"state":"NJ"}""")
bucket.upsert(json"""{"name":"Chris","age":30,"state":"NJ"}""")
val it = bucket.find(json"""{"name": "Tom"}""")
it.foreach(println(_))
Optionally, the find
method takes the second parameter
for projection, which is an object that specifies the fields to return.
The find
method with no parameters returns all documents
from a table and returns all fields for the documents.
The syntax of filter object is similar to MongoDB.
Supported operators include $and
, $or
, $eq
, $ne
,
$gt
, $gte
(or $ge
), $lt
, $lte
(or $le
).
bucket.find(json"""
{
"$$or": [
{
"age": {"$$gt": 30}
},
{
"state": "NY"
}
]
}
""")
If you prefer SQL-like query, just do it as follows:
bucket.find("""age > 30 OR stage = "NY"""")
Note that the query operation is based on server side filters. Although it minimizes the network transmission, it is still a costly full table scan. If the table is multi-tenanted and each tenant does not have too much data (e.g. SaaS solutions for small business), however, the scan will be usually localized to one or a few nodes and often quite fast. Compared to secondary index, this approach does not have penalty on the write performance and still provides fairly good performance on queries in such a situation.
For general purpose queries, secondary index should be built to accelerate frequent queries. We will discuss our secondary index design in the below.
In fact, you can do a SQL-like query in the code or shell, which
returns a DataFrame
.
db.sql("""SELECT address.state,
COUNT(address.state),
MAX(age),
AVG(salary) as avg_salary
FROM worker
GROUP BY address.state
ORDER By avg_salary""")
Although filtering is done on the server side, all other heavy
computation such as SUM
, GROUP BY
, ORDER BY
, etc. are done
on the client side. So this is only useful for small operational
queries that involves a handful rows. For large scale analytics
queries, we should use Spark as discussed below.
For large scale analytics, Narwhal supports Spark. A table can be exported
to Spark as RDD[JsObject]
.
import org.apache.spark._
import org.apache.spark.rdd.RDD
val conf = new SparkConf().setAppName("unicorn").setMaster("local[4]")
val sc = new SparkContext(conf)
val db = new Narwhal(HBase())
val table = db("worker")
table.tenant = "IBM"
val rdd = table.rdd(sc)
rdd.count()
In the above example, we first create a SparkContext
. To export a table
to spark, simply pass the SparkContext
object to the rdd
method of a
table object. In this example, we only export the data of tenant IBM
.
Although Spark has filter functions on RDDs
, it is better to use
HBase's server side filter at beginning to reduce network transmission.
The rdd
method can take the second optional parameter for filtering,
same syntax as in find
.
val table = db("narwhal")
val rdd = table.rdd(sc, json"""
{
"$$or": [
{
"age": {"$$gt": 30}
},
{
"state": "NJ"
}
]
}
""")
rdd.count()
For analytics, SQL
is still the best language. We can easily
convert RDD[JsObject]
to a strong-typed DataFrame
to be
analyzed in SparkSQL
.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Worker(name: String, age: Int)
val workers = rdd.map { js => Worker(js.name, js.age) }
val df = sqlContext.createDataFrame(workers)
df.cache
df.show
df.registerTempTable("worker")
sqlContext.sql("SELECT * FROM worker WHERE age > 30").show
In the above example, we create a SQLContext
on top of SparkContext
to use SparkSQL features. Then we create a DataFrame
of case class
Worker
with type information. Since we will use this DataFrame
many times in analysis, we also cache
it with an in-memory columnar format.
To do SQL queries, we also register this DataFrame
as a temporary table.
Graphs are mathematical structures used to model pairwise relations between objects. A graph is made up of vertices (nodes) which are connected by edges (lines). A graph may be undirected, meaning that there is no distinction between the two vertices associated with each edge, or its edges may be directed from one vertex to another. Directed graphs are also called digraphs and directed edges are also called arcs or arrows.
A multigraph is a graph which is permitted to have multiple edges (also called parallel edges), that is, edges that have the same end nodes. The ability to support parallel edges simplifies modeling scenarios where there can be multiple relationships (e.g., co-worker and friend) between the same vertices.
In a property graph, the generic mathematical graph is often extended to support user defined objects attached to each vertex and edge. The edges also have associated labels denoting the relationships, which are important in a multigraph.
Unicorn supports directed property multigraphs. Documents from different tables can be added as vertices to a multigraph. It is also okay to add vertices without corresponding to documents. Each relationship/edge has a label and optional data (any valid JsValue, default value JsInt(1)).
Unicorn stores graphs in adjacency lists. That is, a graph is stored as a BigTable whose rows are vertices with their adjacency list. The adjacency list of a vertex contains all of the vertex’s incident edges (incoming and outgoing edges are in different column families).
Because large graphs are usually very sparse, an adjacency list is significantly more space-efficient than an adjacency matrix. Besides, the neighbors of each vertex may be listed efficiently with an adjacency list, which is important in graph traversals. With our design, it is also possible to test whether two vertices are adjacent to each other for a given relationship in constant time.
In what follows, we create a graph of gods, an example from Titan graph database.
val db = Unibase(Accumulo())
db.createGraph("gods")
val gods = db.graph("gods", new Snowflake(0))
Because each vertex in the graph must have a unique
64-bit ID, we should provide a (distributed) ID generator
when we go to mutate a graph. We currently
provide the Snowflake ID generator, designed by Twitter.
Each Snowflake worker should have a unique worker ID
so that multiple worker won't generate duplicate
IDs in parallel. In a small system, these worker IDs
may be hand picked. For a large system, it is better
coordinated by ZooKeeper. In this demo, we simply
use 0
as the worker ID in the shell. For production,
you may do things like
db.graph("gods", "zookeeper connection string")
If no ID generator is provided, db.graph("gods")
returns
a read only instance for graph traversal or analytics.
In the next, we will add several vertices with properties
stored in a JsObject
.
The function addVertex
returns the ID of type Long
of new vertex.
val saturn = gods.addVertex(json"""{"label": "titan", "name": "saturn", "age": 10000}""")
val sky = gods.addVertex(json"""{"label": "location", "name": "sky"}""")
val sea = gods.addVertex(json"""{"label": "location", "name": "sea"}""")
val jupiter = gods.addVertex(json"""{"label": "god", "name": "jupiter", "age": 5000}""")
val neptune = gods.addVertex(json"""{"label": "god", "name": "neptune", "age": 4500}""")
val hercules = gods.addVertex(json"""{"label": "demigod", "name": "hercules", "age": 30}""")
val alcmene = gods.addVertex(json"""{"label": "human", "name": "alcmene", "age": 45}""")
val pluto = gods.addVertex(json"""{"label": "god", "name": "pluto", "age": 4000}""")
val nemean = gods.addVertex(json"""{"label": "monster", "name": "nemean"}""")
val hydra = gods.addVertex(json"""{"label": "monster", "name": "hydra"}""")
val cerberus = gods.addVertex(json"""{"label": "monster", "name": "cerberus"}""")
val tartarus = gods.addVertex(json"""{"label": "location", "name": "tartarus"}""")
Of course, we will also add edges between vertices.
It is important that edges have direction. With
addEdge(jupiter, "father", saturn)
, we add an edge
from jupiter
to saturn
with label father
.
For this edge, jupiter
is called out vertex while
saturn
is in vertex. From the point view of vertex,
this edge is an outgoing edge of jupiter
while
the incoming edge of saturn
.
Besides, we may also associate any JsValue
to an edge.
If no value is provided, the default value is 1
.
gods.addEdge(jupiter, "father", saturn)
gods.addEdge(jupiter, "lives", sky, json"""{"reason": "loves fresh breezes"}""")
gods.addEdge(jupiter, "brother", neptune)
gods.addEdge(jupiter, "brother", pluto)
gods.addEdge(neptune, "lives", sea, json"""{"reason": "loves waves"}""")
gods.addEdge(neptune, "brother", jupiter)
gods.addEdge(neptune, "brother", pluto)
gods.addEdge(hercules, "father", jupiter)
gods.addEdge(hercules, "mother", alcmene)
gods.addEdge(hercules, "battled", nemean, json"""{"time": 1, "place": {"latitude": 38.1, "longitude": 23.7}}""")
gods.addEdge(hercules, "battled", hydra, json"""{"time": 2, "place": {"latitude": 37.7, "longitude": 23.9}}""")
gods.addEdge(hercules, "battled", cerberus, json"""{"time": 12, "place": {"latitude": 39.0, "longitude": 22.0}}""")
gods.addEdge(pluto, "brother", jupiter)
gods.addEdge(pluto, "brother", neptune)
gods.addEdge(pluto, "lives", tartarus, json"""{"reason": "no fear of death"}""")
gods.addEdge(pluto, "pet", cerberus)
gods.addEdge(cerberus, "lives", tartarus)
Correspondingly, deleteEdge
removes an edge and
deleteVertex
removes the vertex and all associated edges.
To retrieve a vertex and its edge, gods(jupiter)
will return
a Vertex
object containing its ID, properties, and associated
edges. One can also directly access the data of an edge by
gods(pluto, "brother", jupiter)
. If the edge doesn't exist,
None
is returned.
Although some graphs (e.g. Twitter graph) natively use Long
integer as
vertex IDS, many graphs use Strings as vertex IDs. In RDF
(Resource Description Framework), for example, vertices are encoded
as URI. In Unicorn, we can use strings as vertex keys too.
Internally, Unicorn translates them to the corresponding Long
vertex
IDs. However, it is recommended to use Long
vertex ID directly if
possible because it provides better performance.
It is also possible to add a document (in another table) as a vertex to a graph.
// key is the document key in the table "person"
val id = gods.addVertex("person", key)
If the person
table is multi-tenanted, remember to use tenant id
as the third argument. To access the vertex, it is simply as
gods("person", key)
.
For graph traversal, we support a Gremlin-like API. To start a traversal,
val g = gods.traversal
Then we can start with one or more vertex by the method v
,
g.v(saturn)
A Traversal is essentially an Iterator of vertices or edges.
On a vertex, we can call outE()
and inE()
to access its
outgoing edges or incoming edges, respectively. On an
edge, inV()
and outV()
returns its in vertex and out vertex,
respectively. For vertex, out()
is a shortcut to outE().inV()
and similarly in()
is shortcut to inE().outV()
. All these
functions may take a set of labels to filter relationships.
The following example shows how to get saturn's grandchildren's name.
g.v(saturn).in("father").in("father").name
For detailed information on Gremlin, please refer its website.
Beyond simple graph traversal, Unicorn supports DFS, BFS, A* search, Dijkstra algorithm, etc.
The below example searches for the shortest path between jupiter and cerberus with Dijkstra algorithm.
val path = GraphOps.dijkstra(jupiter, cerberus, new SimpleTraveler(gods)).map { edge =>
(edge.from, edge.label, edge.to)
}
path.foreach(println(_))
Note that this search is performed by a single machine. For very large graph, it is better to use some distributed graph computing engine such as Spark GraphX.
With HBase/Narwhal, we can export a graph to Spark GraphX for advanced analytics such as PageRank, triangle count, SVD++, etc.
import org.apache.spark._
val conf = new SparkConf().setAppName("unicorn").setMaster("local[4]")
val sc = new SparkContext(conf)
val graph = db.graph("gods")
val graphx = graph.graphx(sc)
// Run PageRank
val ranks = graphx.pageRank(0.0001).vertices
So far we access Unicorn through its Scala APIs. For other programming language users, we can manipulate documents with the HTTP API, which is provided by the Rhino module.
In the configuration file conf/rhino.conf
, the underlying BigTable database
engine should be configured in the section uncorn.rhino
. The configuration
file is in the format of Typesafe Config.
For example,
unicorn.rhino {
bigtable = "hbase"
accumulo {
instance = "local-poc"
zookeeper = "127.0.0.1:2181"
user = "root"
password = "secret"
}
cassandra {
host = "127.0.0.1"
port = 9160
}
}
In this example, we use HBase as the BigTable engine. Note that the configuration
of HBase is in its own configuration files, which should be in the CLASSPATH
of
Rhino. Sample configurations of Accumulo and Cassandra are provided in the
example for demonstration.
Currently, Rhino provides only data manipulation operations. Other operations such as table creation/drop should be done in the Unicorn Shell.
Method | URL | Operation |
---|---|---|
PUT | /table/[table name] | Insert |
POST | /table/[table name] | Upsert |
PATCH | /table/[table name] | Update |
DELETE | /table/[table name] | Delete |
DELETE | /table/[table name]/key | Delete |
GET | /table/[table name] | Get |
GET | /table/[table name]/key | Get |
The API is simple and easy to use. To insert a document,
use the PUT
method with the JSON object as entity-body.
curl -X PUT -H "Content-Type: application/json" -d '{"_id":"dude","username":"xyz","password":"xyz"}' http://localhost:8080/table/rhino_test_table
To read it back, simply use GET
method. If the key is a string,
the key can be part of the URI. Otherwise, the key should be
set in the entity-body. The same rule applies to the DELETE
method.
curl -X GET http://localhost:8080/table/unicorn_rhino_test/dude
To update a document, use the PATCH
method.
curl -X PATCH -H "Content-Type: application/json" -d '{"_id":"dude","$set":{"password":"abc"}}' http://localhost:8080/table/rhino_test_table
In case of multi-tenancy, the tenant id should be set in the header.
curl -X PUT -H "Content-Type: application/json" --header 'tenant: "IBM"' -d '{"_id":"dude","username":"xyz","password":"xyz"}' http://localhost:8080/table/rhino_test_table
curl -X GET --header 'tenant: "IBM"' http://localhost:8080/table/rhino_test_table/dude
curl -X GET --header 'tenant: "MSFT"' http://localhost:8080/table/rhino_test_table/dude
curl -X DELETE --header 'tenant: "IBM"' http://localhost:8080/table/rhino_test_table/dude