-
Notifications
You must be signed in to change notification settings - Fork 158
Example GSoC proposal: Add Table indexing
The Table class is a central data structure in Astropy and is useful for manipulating tabular data in general. As of now, the Table class offers functionality for certain database operations analogous to those in SQL, such as grouping (splitting the table into chunks by common fields), joining (combining tables based on matching values), etc. It would be useful to add more advanced database operations to the Table class; in particular, adding indexing capability to Table would allow for increased efficiency in searching for individual records in a table, and primary key indexing (such as by time) would improve the functionality of Table in dealing with time series data. My proposal will involve implementing indexing functionality for Table while ensuring that adding and removing records from each table works correctly in terms of the internal indexing system and that addition/removal operations suffer minimal slowdown due to indexing.
In addition to serving as a general-purpose container for tabular data, the Tableclass in Astropy has useful functionality such such as metadata and missing data handling, and is designed to interface with other packages in the Astropy project; for example, tables can be input or output with astropy.io.fits andastropy.io.ascii. The Table class also supports some of the database operations present in SQL and Pandas, such as grouping, joining, and stacking tables together. However, Table currently lacks the capability to perform indexing on columns to improve lookup performance, which other database software systems provide. My proposal will aim to bring this functionality to Astropy and thereby increase the speed of table lookups as well as allowing for a system of unique row identifiers in an Astropy Table.
In an ordinary table without indexing, searching for a row based on one of its column values takes O(n), or linear time. For example, in a table of 100,000 astronomical observations where one column records the exact time at which an observation was taken, it might take up to 100,000 operations to search through the table for an observation taking place at 9:00 AM. The principle behind indexing is to maintain some sorted order for a given column, which will reduce the time complexity of a search for a given record. For example, if the entries are sorted by their observation time, a binary search on the table is possible and so the number of operations required is at most log2(100,000) = 17, rounded up. This is a clear computational speedup for large data sets, and the principle can be applied either to individual columns or to a "primary key" that uniquely identifies each table row. One drawback of table indexing is that insertions and removals of table records become more difficult as the sorted index order must be maintained, which slows down these operations to some degree. Therefore, one challenge in designing an indexing system is to minimize this unavoidable slowdown.
I will begin by assessing the current performance of database operations inTable using the open source asv benchmarking tool (https://github.com/spacetelescope/asv), which can be used for comparison later to determine the effectiveness of column indexing. After doing so and then writing high-level code to dispatch the task of column indexing to a separate data structure, I will implement this data structure and test it extensively. One possible data structure, used in MySQL, is a B-tree. A B-tree is similar to a binary search tree with the exception that each node can have more than two children, and each operation on a B-tree (searching, sequential access, insertion, and deletion) runs in O(log n) time. This is in contrast with a non-indexed table system, in which insertion and deletion are constant-time operations and searching is linear, or O(n). Thus, a B-tree is advantageous when the number of insertions or deletions is small relative to the number of expected searches. Another potential data structure is a hash table, which functions as an associative array mapping ID values to records.
Once this is done, I plan on integrating this new data structure together with the underlying data implementation in Table to complete a bare-bones indexing system for Table. At this stage, I will organize the column indexes in a non-clustered fashion, meaning that the sorting information for the column will be separate from the internal structure of the column itself. The next step will be to ensure that the indexing order remains consistent as the columns change or properties of the overall table change, and I plan to research any algorithms that might be used to minimize the time/memory impact of these changes. Another useful addition will be to designate a single column as the "primary key" for a table, meaning that each row can be uniquely identified by its value in the column in the same way that the Series class in Pandas associates each data point with a unique ID. Ideally, the primary key will be a "clustered" index, so that the data itself will be sorted internally based on the primary key values. A significant application of this primary key functionality will be the ability to create a Table for time series data and designate a time column as the primary key, so that there exists an efficient API for retrieving table rows by time value.
After further optimization of the structures previously implemented, I will make use of table indexing wherever possible by altering the database operations in Table(as well as another relevant Astropy code) to use table indexing. Finally, if there is time left over at the end of the summer after I have implemented the ideas defined above, I will implement additional features for the Table class suggested by my mentors, such as allowing for nested tables or the creation of dynamically computed columns. While it is possible that there will be insufficient time to begin work on any extra features, such additions to the Table class could definitely be useful.
Community bonding period (April 27 — May 24):
Become more familiar with the existing database functionality in Table, look at the Pandas codebase to understand how the Series class handles indexing, and read about B-trees (which might be used as an indexing data structure).
May 25 — May 31 (1 week):
Using asv, create benchmarks for current Table database operations and row retrieval by column value, which should scale linearly with number of rows. Write high-level code to keep a (not yet implemented) indexing data structure in memory and pass data from the Table class into this data structure. This will serve as a temporary means for creating an index by passing control to the indexing data structure through new methods (e.g. create_index(col)).
June 1 — June 14 (2 weeks):
After discussing with my mentors which data structure would make the most sense to use for indexing in Table, such as a B-tree or a hash table, implement this structure. Include a testing suite to ensure that the structure works properly, and take initial benchmarks of the structure's performance using asv.
June 15 — June 21 (1 week):
Without worrying about maintaining the integrity of the index as column or table properties change, incorporate the new indexing data structure into the Table class by writing a basic indexing API (e.g. a method create_index(col) and methods to select rows based on column values). Dispatch a separate data structure for each indexed column without affecting the internal data itself, i.e. make the indexes non-clustered.
June 22 — July 5 (2 weeks):
Draw on indexing implementations in Pandas and MySQL, as well as any other possible algorithms, to decide how to maintain the integrity of the index data structure upon row insertion/removal or any other change of table properties. Implement this functionality and test with ordinary data as well as corner cases.
July 6 — July 12 (1 week):
For the special case where a unique identifier for each row is useful, implement a method to designate a single column as a so-called "primary key". If feasible, make the primary key a clustered index, i.e. make the indexing information part of theTable itself. For example, this might mean sorting rows in memory based on their primary key values.
July 13 — July 19 (1 week):
Take another set of benchmarks and work on optimization of the indexing data structure, the algorithm(s) used to preserve the sorted order of the index, and any other time-critical operations. Rewrite performance bottlenecks in Cython.
July 20 — August 2 (2 weeks):
Go through the existing code in Table, particularly the basic database operations currently supported, and adjust these methods to make use of table indexing wherever doing so would improve performance. Also modify functions outside the astropy.table package that would benefit from the new indexing capability of Table. Document the new API for table indexing and write tests for all new changes to the Astropy code base.
August 3 — August 16 (2 weeks):
These two weeks will serve as a buffer period in case earlier steps of the proposal require more time, unforeseen difficulties arise, etc. If everything runs smoothly and the buffer period is unnecessary, then work on implementing additional features for the Table class. For example, it would be nice to have nested tables or Excel-style dynamically computed columns with dependencies.
August 17 (suggested pencils down date) — August 21 (final evaluation deadline):
Write documentation for all previous code and add any tests I may have missed during the main coding period. Look for potential bugs and fix any that arise.