Heuristic for Choosing Index Spatial and Spatial Temporal Indices

This page discussions heuristics for choosing indices. Along with discussions focused on secondary indexing, this discussion will result in solutions integrated into a cost based optimizer.

The first requirement is determining a low-cost (computative resource) solution for choosing between a spatial index and a spatial temporal index without using statistics. Each index has an assigned set of dimensions: latitude, longitude and time. Within a bin, time is bounded. For example, a default bin is one year. The time bounds depend on the index strategy configuration.

Heuristic without Statistics

Ideally, an an effective index strategy reduces the size and number of ranges of cells (cubes) to search, indexed along the Hilbert space filling curve. A query with a temporal constraint may represent an elongated hyper-rectangle, resulting an many small ranges along of the Hilbert curve. In this case, a spatial index combined with distributed filtering is an optimal. In this scenario, time is the least constraining dimension.

In general, the idea is to make a judgement by comparing the least constraining dimensions. Statistics are not used. This means, when comparing a spatial index to a geo-spatial index, no judgement can be made about the area covered by time constraints in the spacial index. In the temporal index, the area can be ascertained from the query range and the dimension's constraints.

The number of cells that cover the range values for a specific dimension d of index i (d is in {latitude, longitude, time}) is 2^bits(d,i)/(SFCmax(d) - SFCmin(d)).

For example, along the longitudinal axis, at 20 bits of precision and a range of 360 degrees, the number of cells per degree is roughly 2913. A range Multiplied by the query range yields the number of cells to be scanned by the query scanner. For example, a query over a range 46-48 degrees 2913*2 requires a search of 5826 cells.

CELLS(d,i) = ceiling(QR(d)/(SFCmax(d) - SFCmin(d)) * 2^bits(d,i))

The maximum of the total possible cells for each dimension yields the least constraining dimension for a query and a given index. The problem is that the bits of precision per each strategy are not necessarily the same. For example, a geo-temporal index may assign 20 bits per dimension and a geospatial may use 30 bits per dimension. Furthermore, the absence of dimension definition in one strategy makes it difficult to compare another computation where the dimension is defined.

Log2(CELLS(d,i)) obtains the number of bits to represent each query using defined precision for that dimension. Dividing by the the bits of the that dimension provides a ratio of bits used by the dimension. We find the minimum ratio ( the most constraining dimension ).

 RATIO(d,i) = Log2(CELLS(d,i))

Finding the maximum always fails to select a strategy the defines MORE dimensions with lower precision. Consider that a query range QR in a dimension shared across two strategies A and B. The second strategy B also defines another dimension not shared by the first. Assume the 'unshared' dimension of strategy B to be the most constraining (e.g. narrow range) for the given QR. It will never be a maximum. Thus, comparing the first and second strategy only considers the shared dimension. But the precision of that dimension is lower in the second strategy B, so QR will translate to a region of space (spanned by cells) that is the same as or GREATER than that as defined by strategy A.

Simply comparing the minimum is insufficient. As a heuristic, we want to approximate the cost such that the bounding hyper cubes for a query from strategy with more dimensions and less precision is close but exceeds the actual cost.

Chosen Approach

The range, PBR(i,d), of a single bit for index i and dimension d is (SFCmax(d) - SFCmin(d))/2^bits(d,i). The number of bits used by a query is LOG2(QR(d)/PBR(i)) The number of bits from fixed from the left, FL(d,i), is bits(d,i) - LOG2(QR(d)/PBR(i)) The least constraining dimension uses the **least **amount of bits of fixed bits from the left. The right most used bits represent the variability over which the query range must cover. For example, half of the world latitude is 1 bit, 1/4 of the world is 2 bits etc. Assume a single byte, the left half the world is a range from 00000000 01111111. The leftmost bit (0 for the left halt of the world) does not change. Using the example of a quarter of the world, suppose the quarter is 11. The left most bits will be 11. The right six bits vary, giving a range of 11000000 to 11111111. The use of the bits per dimension are interleaved. Thus, the least constraining is the one that produces query ranges with the most least fixed bits from the left, defined the number of dimensions multiplied by FL(d,i). FLT(i) = |d in i| * FL(d,i) We want the query that maximizes FLT(i), the index that produces a range with most fixed bits from the left.

Other techniques explored

One possible implementation is comparing between each strategy the minimum ratio multiplied by the number of dimensions.

 COST(i) for QR = (Min RATIO(d,i) for all d of i) * |d|

Another approach is to compute area. The problem is that area is unknown for the missing dimension (e.g. time in a geospatial index). It cannot be assumed by any given precision. Thus, the heuristic is constrained to only compare what is known. The most constraining dimension is the easiest most likely candidate.

The adjusted query region per dimension is computed as follows:

AQR(d,i) = CELLS(d,i) * (SFCmax(d) - SFCmin(d)) / 2^bits(d, i).

In computation of the Hilbert curve, bits across each dimension are interleaved. Thus, the least constraining dimension occupies the most significant bit. Each bit splits the search space. Ideally, we want the search space split at the right most bit. Thus, we want to maximum the determining bit from the left.

The next step is determine the number of bits from the left. First, lets determine a ratio of the number of bits used per dimension. log( CELLS(d,i) ) is the number of bits used to represent a cell number.

  MB(d) = log( CELLS(d,i) ) / bits(d,i)
        = log(QR(d,i) * 2^bits(d,i)/(SFCmax(d) - SFCmin(d)))/bits(d,i)
        = (bits(d,i) - log((SFCmax(d) - SFCmin(d))/QR(d,i))) / bits(d,i)

...more to come...

Statistics Approach

Determine the most constraining dimension using a histogram. First determine the number of items that fall within the adjusted query range (from above AQR(d,i)) per dimension. Choose the index i which contains the smallest most value.

Brute Force Approach with Statistics

Compute the query ranges for each index. Then intersect with the histograms over row ranges for index. Choose the set of ranges for the smallest range set.

Provide feedback

Saved searches