How to organize your data correctly

If you want to organize your data in an efficient way you have to understand how the search internally works.

Indexes

All indexes are sorted ascending by the index value. This allows a binary search on index files to find a result quickly. The index ranges of the index files are kept in memory to speed up the search and keep the memory footprint low, so the database do not need to keep all the indexes in memory. Index range is defined by where the index value starts and ends in the file. In other words the first and last index value of an index file.

It is possible to create an unlimited number of indexes. You can query multiple indexes, but you have to consider if you query multiple indexes that the full result set of the matching indexes get loaded and merged. Keeping the index matches of each index small if you query on multiple indexes will decrease the merge time massively. If you cannot keep the index matches small you should use compound indexes.

Data

All the data is stored as JSON and compressed by Snappy. The decompression has a low performance impact on processing power, but increases the read rate from the disks massively. If a disk has a read rate of 100MB/s you can increase the read rate up to 500MB/s. Furthermore you can store more data on the same disk.

One thing more to improve the read rate is sorting the data. JumboDB tries to group read actions of near offsets in the same file to one read action. This causes less skips and allows it to read bigger blocks in one go. If you sort your data by the mainly queried index you can reach a significant speed up. Let me give an example: If you have some geographical data and want to search the data by a boundary box which is a normal view port of a maps api. All the points are within the same geographical region, so the chance to read near geographical points and data is very high. For speeding up your search you could sort the data by latitude and longitude with the geohash sorter provided by JumboDB.

The internal search process

Find offsets of data files

Find data by offsets

Inside data search thread

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to organize your data correctly

Indexes

Data

The internal search process

Clone this wiki locally