Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
cleanup taxonomy code after refactor (#2446)
## Taxonomy Refactor Overview In an attempt to allow usage of NCBI taxid (motivation: CAMI benchmarking) and alternate hierarchical taxonomic ranks (motivation: LINS), I ended up refactoring the taxonomy code in a four-PR series. Taxonomic summarization results should not change. Minor caveat: I was previously obtaining `query_bp` in a hacky manner to allow gather <4.4 results. The class methods are more robust, and I'd like to stop supporting gather <4.4 results. To allow this, I had to add the `query_bp`, `ksize`, and `scaled` columns into some testing results to keep tests functioning. 1. #2437 modifies `LineagePair` from a two-item `collections.namedtuple` to a three-item `typing.NamedTuple` containing an additional field, `taxid`, for storing NCBI taxid information. It also introduces classes (`BaseLineageInfo`, `RankLineageInfo`), which move lineage manipulation (from `lca_utils.py`) to class methods in order to support robust summarization across compatible lineages (lineages of same hierarchical ranks). To ensure these can be used as dictionary keys, these classes are frozen. 2. #2439 introduces classes that facilitate reading, summarization, and writing of gather results. First, it updates three prior `collections.namedtuple`s to `dataclasses` used for storing information about the gather query (`QueryInfo`), summarized gather information for metagenome queries (`SummarizedGatherResult`) and classification information for genome queries (`ClassificationResult`). It introduces three new classes for reading and manipulating gather results. `GatherRow`, is used for reading a each row from a gather file and automatically checking for required columns. `TaxResult` is used for storing a single row from gather file, optionally (and ideally) with taxonomic information, stored as `LineageInfo` class from PR 1. `QueryTaxResult` is used for storing all `TaxResult`s associated with a single query. `QueryTaxResult` add methods to replicate the summarization previously done within `summarize_gather_at` in `tax_utils.py` and the classification thresholding in `genome` within `__main__.py`. 3. #2443 replaces the actual taxonomic summarization code in `tax/__main__.py` with code that uses the new classes. Modifies gather loading code to read using `GatherRow`, `TaxResult`, and `QueryTaxResult`. 4. #2446 removes old, unused functions that are rendered redundant by the new classes. Also removes associated tests. ## Additional details for this PR (#2446) - Delete old functions that aren't used outside of taxonomic summarization + associated tests - Including old `namedtuple`s: `QueryInf`, `SumGathInf`, `ClassInf` - Make sure any old comments/documentation make it into new code - Don't use unnecessary empty `()` for dataclasses
- Loading branch information