-
Notifications
You must be signed in to change notification settings - Fork 30
A data structure for efficient nearest-neighbor queries.
License
erdavila/M-Tree
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
An m-tree is a data structure which indexes objects according to their relative distances. It is efficient for nearest-neighbor queries. This implementation follows the content of the article http://www.vldb.org/conf/1997/P426.PDF, with the following highlights: * The data structure is the same described in the article. * The algorithm for adding data had adaptations to handle with some corner cases. * The algorithm for removing data was designed from scratch, since it was not described on the article. * Both query algorithms (by-range and k-nearest-neighbor) were merged into one. The query algorithm can have as criteria either the range or the maximum number of resulting items, or both. The algorithm processes the results as needed, as the resulting items are fetched. The results are returned in non-decreasing distance from the query data parameter. MTree is the core class that implements an m-tree. The data objects can be any object that are understood by the distance function. Some examples: * Data objects could be N-dimensional-space coordinates and the distance function could return the euclidean distance between two data objects. * Data objects could be regular strings and the distance function could calculate the edit (Levenshtein) distance between two strings. The distance function must be provided by the user. Two equal objects should not be added to the same MTree instance. There is no validation regarding this and, if done, the behavior of the tree is undefined. Given a data object type and a distance function, some aspects can directly impact the performance of an m-tree: the constraints on the number of children for the nodes and the support functions (split, promotion, and partition functions). An m-tree is implemented as a tree (really!? :-P) and the minimum and maximum number of children on each node can be customized. When the maximum capacity of a node is exceeded, a split function is used to split the node, which is replaced by two new nodes. The split function can be seen as being composed by a promotion function and a partition function. The promotion function must choose (promote) two children from the node that will be split. The partition function must partition the set of children into two sets corresponding to the promoted children. The two promoted children and their corresponding partitions will be used to create the two nodes that will replace the exceeding node. There are some pre-defined implementations for promotion and partition functions, and the user can implement new ones at will. The yet-to-be-implemented MTreeDict class wraps an MTree and a mapping structure, so it helps on keeping track of objects already added. The data objects work as keys for associated value objects, so MTreeDict can be used as a dictionary. The yet-to-be-implemented MTreeMultiDict works like the MTreeDict, but allows associating more than one value object to the same key. The number of values associated to the same key are taken into account when performing queries constrained by the number of results. See the README file specific for each language: * README-py for Python. * README-cpp for C++. For more information see: http://en.wikipedia.org/wiki/M-tree http://www.vldb.org/conf/1997/P426.PDF
About
A data structure for efficient nearest-neighbor queries.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published