How to use the compression with Hawkular Metrics

Sending inserts to different HWKMETRICS nodes..

version.gorilla.compression

Distribute inserts to nodes with the same hashing using Infinispan
- Each node prosesses the insert and inserts it to the memory mapped file of the compressed series
  - This is to reduce memory usage and try to maximize the amount of stuff kept in the memory for next compaction
  - But needs persistence if there's a lot of timestamps sent during that time or memory is tight
    - Investigate Netty or Chronicle's ByteBuffer enhancements and some sort of lookup table
- JGroups is used by the EhCache replication (or was - Hazelcast is probably same company these days), this is one option also
- One more option could be to use AtomicMap, which allows safe reading (a snapshot)
  - We however accept rewriting the same timestamp again - replacing the old one, AtomicMap is based on unmutable objects..
  - Can we verify somehow that if the map changed during reading that we don't delete it?
  - Performance? Usually state machines are not enough performant
After time series is full (2 hours block time has passed), write the block to the Cassandra (or ditch Cassandra eventually..)
- This is technically what Facebook's Gorilla uses also, although they use just different memory boundaries
- Can we force this in Infinispan to happen on the key owning node?
When compacting, look for out-of-order writes and combine them to the compressed series (this can be done later also)
- Also, we could write the series to a final place later if using a method where we don't compress right away
One option is to write directly to Cassandra's data table (like these days) and read the rows afterwards.
- Data locality must be paid a lot of attention otherwise the performance will suffer and there's huge amount of data transfers
  - http://www.planetcassandra.org/blog/data-locality-w-cassandra-how-to-scan-the-local-token-range-of-a-table/
We need to store the tags somewhere also (like currently supported per datapoint)

First phase, modifications

Job scheduler

Create a job that is run every two hours
- Timestamp block starts in even hours (00, 02, 04, ..)
- Job runs on odd hours (01, 03, 05, ..)
It reads all the rows of the previous timestamp block (for example, 03 job reads everything written between 00-02)
- Creates a compressed block
- Writes it to the compressed table
- Deletes the originals from data table

Write path

Timestamp precision changes for the write path, can be done in the compression phase
- Such as second precision instead of millisecond precision
  - Depending on the implementation, this needs to be unparsed in the read path - or then we just mark milliseconds as zero and be happy with that (in case of second precision)
Create wrapper for the gorilla-tsc library to add compression headers and to read them:
- Type of compression that is used (1 byte) (bit mask?), allow different timestamp + value compression bundles
Create modifications to DataAccessImpl to write to the data_compressed table and delete the selected keys from data
Since the writes and deletes go to the same partition always, the delete and insert can be done in a single batch -> performance improvements and no data loss
Out-of-order writes happen as before

Reading path

Add c_value column to the data table queries to the DataAccessImpl
- Truncate the dates to previous 2 hour block when reading from known compressed time range
- Out-of-order writes are correctly processed when doing the sorting in the reading path
- And even if compression job dies, the parsing will still be correct
Create modifications to findDataPoints in the MetricsServiceImpl
- Sort each time if descending order for compressed datapoints - or there were out-of-order writes which need to be included
Create new Row -> List<DataPoint> (and same for others) functions with enough filtering information to only get the requested datapoints, or Observable, but we need to read all the points in any case
- Add ability to add matching tags also to the Datapoint

For other types than Gauge

Create modifications to gorilla compressor to allow long compression also, not just double
- This needs to be known in the decompressing.. set a bit in the header? EnumSet?
Create Availability -> compression path (enum literal number is the value) -> long compression
- Although efficient without changes, XOR is unnecessary here as compressing the leadingZeros and trailingZeros takes more space than the whole value would
  - Recommendation: store 0 bit if unchanged, otherwise set bit (1) + 4 bits for the potential values?
Need something in the gorilla-tsc to allow only reading the timestamp or only the value. And same for compressing also.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use the compression with Hawkular Metrics

Sending inserts to different HWKMETRICS nodes..

First phase, modifications

Job scheduler

Write path

Reading path

For other types than Gauge

Clone this wiki locally