-
Notifications
You must be signed in to change notification settings - Fork 190
DBScan
For a summary of DBScan, see https://en.wikipedia.org/wiki/DBSCAN.
The DBScan implementation in GeoWave is incrementally evolving. As of 0.9.0, DBScan is implemented using Map-Reduce jobs. Ideally, the algorithm will be moved into a Spark implementation.
The approach chosen meets the following requirements.
-
Geo-spatial (orthdromic) distance is part of the distance measurement. Two objects cannot be close if they are not Geo-spatial close.
-
Distance measures can be furthered refined to include other attributes. Two objects can be Geo-spatial close and be far along other dimensions.
-
The solution is scalable, The implementation is compatible with Hadoop 2.0 infrastructure.
-
Support Points, Polygons, Lines and Line String
-
Ideally, the solution is not restricted to Simple Features.
-
The output is a set of polygons representing dense regions along with an approximate count of the number of items in the dense region.
-
Leverage GeoWave's components, where possible. This is intended to be a GeoWave solution.
The last requirement is intended to take advantage of GeoWave's components. There are a number of general purpose DBScan algorithms available. Some perform quite well. However, assuming geo-spatial distances, GeoWave can offer some short-cuts that present performance and scale advantages.
Just prior to developing DBScan, an implementation for a subclass of Near-Neighbors algorithms was introduced. With the intent of reuse, the near-neighbor implementation lays the ground work for DBScan. The supported near-neighbors problem is summarized as follows.
For each item in the data set, find all other items in the same data set that
are considered neighbors by a provided distance function.
The distance function assumes geo-spatial proximity as one of the primary measures, meeting requirements 1 and 2. Incrementally applied, the infrastructure could be used to build a nearest neighbor graph, clusters, etc. An example use case is finding those items whose cardinality of the set of neighbors exceed some minimum bound.
The solution for near neighbors search involves partitioning the data such that neighboring items are in the same partition. The GeoWave indexing strategy creates geo-hashes as a partition identifier (Mapper key). The width of a grid cell, manually set by a parameter, determines the dimensions of the grid (thus, the upper bound on the number of partitions). The width of cell must exceed twice the maximum distance threshold for neighbors. Choosing a fine-grain grid reduces the size of the partitions and increase the number of partitions. Using the index strategy allows support of any mix of dimensions including time.
In each partition, the items are compared to each other using a distance function (i.e. Reducer). A pair of items are neighbors if they have a distance less than a provided maximum distance threshold. For items in a partition p, the run time has an upper bound , thus, .
The chosen partition for each item is called the primary partition of that item. Geometries other than points may be placed in more than one primary partition. Items that are rarely in the exact middle of a partition. Thus, to form the full set of neighbors, items are also submitted to neighbor partitions (within threshold distance). Items included in a neighbor partition are labeled as secondary. The distinction of primary and secondary is used in the search step within the partition. Secondary items are passive participants in the search. The basic algorithm is as follows:
primaries <- items in primary partition
neighorSet <- {}
for each item P_i primary in primaries
remove item from primaries
for each item P_j in primaries
dist = distance(I_i, I_j)
if (dist < delta) then add tuple (I_i, I_j, dist) to neighorSet
for each item O_j in secondaries
dist = distance(I_i, O_j)
if (dist < delta) then add tuple (I_i, O_j, dist) to neighorSet
return neighborSet
There may be duplicate outputs across partitions. This occurs in two cases. First, geometries other than points can be partitioned multiple times as primaries when they cross cell boundaries. Second, items that are secondary in one partition are always primary in another. Thus, the item appears in a tuple as the first item as primary and the second item as a secondary (e.g. (item1, item2, distance) and (item2, item1, distance)). The single pass of the algorithm does not provide treatment for the duplicates.
The algorithms supports a form of secondary partitioning. The computing infrastructure may not support MANY small partitions in the initial partitioning step (Map). This is evident if the shuffle phase takes an exceedingly long time to process. The solution is to increase the grid cell size. However, this increasew the cardinality of items in each partition p, thus adversely affecting run time. The solution is to apply secondary partitioning.
Secondary partitioning uses the SAME index strategy with a small grid size (as determined a by 'factor'). The secondary partitioning further divided the items into smaller partitions within the neighbor search phase. Thus the search algorithm is altered slightly.
for each item P_j in primaries within secondary partition S_p
...
for each item O_j in secondaries within secondary partition S_p
...
NOTE: Do not confuse secondary partitioning with assigned secondary partitions.
#DB Scan
DBScan is implemented upon the fore-mentioned algorithm. DBScan is iterative. The first iteration creates polygons over dense sets of items. Subsequent iterations consume polygons from prior iterations, to find and union close polygons (intersecting or within threshold distance). The grid cell size is increased for each iteration, as the polygons grow in size (the precision factor is decreased). The algorithm iterates until one of three activities occur.
- A maximum number of iterations occurs
- The number of resulting polygons from the current iteration does not change from the prior iteration, indicating no additional unions occurred.
- A single partition (e.g. the world) is created and processed in the current iteration (i.e. precision factor ~ 0).
Secondary partitioning is not used in subsequent iterations as the population of polygons is considerably smaller and there is a risk of creating large amounts of secondary partitions for each polygon.
Each geometry is counted as one when included in a cluster. As a geometry may span across multiple regions, only the closest point between two geometries is considered when forming the polygons. The closest point may be on a line segment. For lines and segments, only dense points along the those segments are considered. For polygons that intersect, the closest point may be within the geometry.
For dense regions, the number of items may be large. There are two techniques to handle this situation so that the upper bound for time and space is never reached. First, points that exceed the minimum (100, max(100, Minimum Neighbor Threshold)) are compressed into a single polygon. This reduces space requirements. Second, a pre-processing step, using the same approach compresses dense regions to single polygons, before proceeding the with algorithm. This effectively compresses many items with a single item. In dense areas within a secondary partition, this compressed polygon often encompasses to all items within the secondary partition.
NOTE: An option in consideration is to simply treat all items in a secondary partition as close without computing individual distances.
- Distance Threshold
- Distances per each dimension for primary and secondary partitioning
- Precision factor for primary indexing, from 0 to 1 where 1 is most granular. Effective to start around 0.9 to 1.0. The algorithm decrements by 0.15 (default) for each iteration.
- Maximum number of iterations
- Distance unit for geometry (e.g. meters)
- Partitioner class (extends Partitioner)
- Secondary Partitioner class
- Distance Function
- Cluster Item extraction function.