Skip to content

Cassandra

Eric Robertson edited this page May 11, 2016 · 8 revisions

Cassandra is a wide-row store NoSQL database.

Cassandra uses partitioners to distribute data across nodes within database nodes. Nodes are organized into hierarchy of data centers contained by clusters.

Cassandra separates applications by key space. This can have an equivalence to GeoWave namespace. A keyspace contains one or more column families.

Cassandra performs best when data needed to satisfy a given query is located in the same partition key.

Partitioning

The default partitioner is the Murmur3Partitioner partitioner. It may look tempting to use the ByteOrderedPartitioner, to align with Hbase or Accumulo behaviors for the types of range scans, fundamental to GeoWave's design. The ByteOrderedPartitioner is not recommended as it presents difficulties in load balancing and hot spotting on writes.

Compaction

Choosing the correct compaction strategy impacts GeoWave performance. Compaction consolitates SSTables, removing deleted rows and producing single versions of each updated and inserted row. SSTables are organized by sorted partition id. By default, LeveledCompactionStrategy should be used since GeoWave targets read performance. LeveledCompactionStrategy levels the size of SSTables, consolidating into higher level tables of non-overlapping data. Each level increases the permitted table size by a factor of 10. In low write and high read volume applications, typically one higher level SSTable is only processed. In high write volume applications, there are many level 0 SSTables requiring more disk operations. Thus, for streaming applications, the DateTieredCompactionStrategy strategy may be more appropriate, as it compacts based on age of data.

Primary Indexing

The primary index for a Cassandra table is the row key. Secondary indices in Cassandra use the same definition as GeoWave, maintained as separate B-trees the reference (point to) the data in a table sorted by a primary key.

Cassandra design is akin to a nested sorted map. Row keys reference a sorted map of column key/values. Both are unbounded, supporting wide rows. Columns can be organized into super columns (columns within columns) and composite columns (two keys combined into a single column). Cassandra allows 2 billion columns per row. Column key names should be short, as the name is replicated for each cell. It is permissible to place the value in the key, leaving the column value null.

Wide rows have two caveats. First, a row's columns are stored together on a node. A wide row can result in a hotspot. Second, wide rows cannot fit into memory. Row caching cannot be used.

Each row contains a Bloom Filter of the column names in the row. Thus, scans across a row should be efficient. Furthermore, rows contain a column index for columns that collectively expand beyond a configured size (a page size). During query processing, the column index is when determining an column offset into the row. This occurs when a query requests a start column (position) or descending order.

The primary search criteria for GeoWave is a set of ranges over a set of bytes. Dividing bytes into two parts is the recommended design. The first part, the higher order bytes, is used as the row key and the adapter id. The second part, the lower order bytes, are used as part of the column name. The column name is a composite key, which includes the data id.

 CREATE TABLE IDX (
  horowid bigint,
  adapterid text,
  lowrowid bigint,
  dataid text,
  data blob
  PRIMARY KEY ((horowid , adapterid), lowrowid, dataid)
 ) WITH CLUSTERING ORDER BY (lowrowid ,dataid ASC);

A concern with this approach is the amount size of the low and higher order columns. Too few bits in higher order columns results in too little rows and an overflow of columns. Too many bits in the higher order column results in queries with long 'in' clauses:

select * from IDC where (horowid,adapterid) in ((434343434,'123'),(04934034,'123'),(304932,'123'),(4340394349,'123'),...) and lowrowid >= 4554 and lowrowid <= 574;

Each howrowid, adapterid pair needs to be constructed (combinatorially) over the entire query range. Once a combination has reached a certain capacity (or estimated to reach), it is more efficient to perform an unrestricted row scan; the column scan is still valid.

In the table description, all the datatypes are stored in a single blob. It is certainly possible attemp to store some basic Java types for each attribute OR use Json to store the data. For user defined types, they may be stored as Blob. Note: Geometry is one such type that is not present on Cassandra. More on this later.

Secondary Indexing

Cassandra has a rich set of secondary capabilities. SSTable attached secondary indices are an effective architecture for handling text indexing.

Unlike the GeoWave API, Cassandra's query optimizer can choose the appropriate secondary index. When using Cassandra, it may be tempting to turn off GeoWave secondary indexing. This largely depends on whether a features attributes are exposed as part of the column name.

Visibility

Cassandra does not have native cell level security. Custom filters would be needed. There are a few approaches that could used to handle security. (1) Include security within the column name. (2) Include security in a column value describing security for all attributes (the default behavior used to identify security for attributes in the GeoSever plugin).

Server Side filtering

GeoWave has compensated for the lack of support for query side filtering and aggregation by adding custom filters and aggregators. Cassandra supports custom functions and aggregators (UDFs). The only issue is that they must be defined in terms of scripts provided to Cassandra using Cassandra's built in types (e.g. standard Java).

  CREATE OR REPLACE FUNCTION fLog (input double) CALLED ON NULL INPUT RETURNS double LANGUAGE java AS 'return Double.valueOf(Math.log(input.doubleValue()));';

One solution is to place GeoWave related components on the classpath of each Cassandra server. This is not recommended since new nodes would need to be provisioned identically.

GeoWave JAR file on Cassandra classpath.

In this case, two UDFs are needed:

   CREATE OR REPLACE FUNCTION geoWaveFilter(adapterid text, dataid text, input blob, filterexp text) CALLED ON NULL INPUT RETURNS Boolean LANGUAGE java AS 'return GeoWaveFilter.runFilter(adapterid, dataid, input, filterexp)';
   CREATE OR REPLACE FUNCTION geoWaveAggregate(adapterid text, dataid text, input blob, aggregateexp text) CALLED ON NULL INPUT  RETURNS blob LANGUAGE java AS 'return GeoWaveFilter.runFilter(adapterid, dataid, input, filterexp)';

The last argument of each function is a text or blob represention of the filter or aggregate, serialized by the PersistenUtils. Ugly, but flexible.

Without GeoWave JAR file on Cassandra classpath

An effort to identify each type of filter, data type and function can be used identify a proper translation to Cassandra's existing set of types, filters and aggregates. This leads a more readable and natural presentation of GeoWave data in Cassandra. For example, the CQLFilter can be translated into a CQL expression for Cassandra. For those types not present in Cassandra, client filtering is still needed.

Mix of the two approaches

Placing some GeoWave functionality on the Cassandra classpath to support Geometry allows some of the basic filters to be translated into Cassandra UDFs, such as geometry range checks (Polygons).

Recommended Approach

Support both. An initial offering performs client side filtering and the ugly UDF approach. As the product matures, improve the translation of filters to Cassandra CQL.