Releases: vanheeringen-lab/genomepy
Releases · vanheeringen-lab/genomepy
[0.16.1] - 2023-06-14
[0.16.0] - 2023-05-31
Added
genomepy search
now accepts the--exact
flaggenomepy.Annotation.attributes()
returns a list of all attributes from the GTF attributes column.- e.g. gene_name, gene_version
- nice to use with
genomepy.Annotation.from_attributes()
orgenomepy.Annotation.gtf_dict()
- When installing assemblies from older Ensembl release versions, a clearer error message is given if assembly cannot be found:
- if the release does not exist, options will be given
- if the assembly does not exist on the release version, all available options are given
- if the URL to the genome or annotation files is incorrect, the error message stays the same
- new config option:
ucsc_mirror
, options:eu
orus
.- the mirror should only affect download speed
- can be nice if the other mirror is down!
Changed
- function
get_division
is now a class method of EnsemblProvider - EnsemblProvider class methods
get_division
andget_version
now require an assembly name. - UCSC data is now downloaded over HTTPS instead of HTTP
Fixed
genomepy.install()
now returns aGenome
instance with updated annotation attributes.- now ignoring ~1600 assemblies from the Ensembl database with incorrect metadata
- no easy way to retrieve this data
[0.15.0] - 2023-02-28
Added
- you can now tune the cache expiration time in the config
- create a config with
genomepy config generate
, then tweak the values as desired.
- create a config with
- support for biopython >=1.80 with pyfaidx update
- raise an informative error when UCSC tools are missing
- this should only happen in Pip installations
Fixed
- disabling already disabled plugins no longer throws an error
- bgzipping fixes:
- bgzip works again with python>3.7 (openssl shenanigans. tabix was deprecated for htslib)
- genome index works with
genome install --bgzip
(a 2nd is created with the correct naming format) - export file works with
genome install --bgzip
genomepy.install_genome(bgzip=True)
returns a Genome class instance with correct paths
[0.14.0] - 2022-08-01
Added
- now using
filelock
for improved thread safety - now checking if every API/FTP/HTTP(S) is accessible before proceeding
- genomepy search improvements:
- text search now accepts regex, and multiple substrings (space separated) are unordered.
- taxonomy search now returns all hits that start with the given number.
Changed
- switched to
pyproject.toml
+hatchling
for packaging
Fixed
- updated the README and CLI documentation to mention the
Local
provider
[0.13.1] - 2022-06-21
Changed
- removed unused keys from Ensembl and UCSC databases to reduce their size
Fixed
- added a retry for initializing the diskcache (seq2science/issues/887)
- can now find ensembl urls for genomes not using url_names properly (#205)
[0.13.0] - 2022-06-02
Added
genomepy search
andgenomepy genomes
can now return the (unfiltered) absolute genome size with argument--size
Changed
- changed caching backend to
diskcache
(thread safe) - reduced the local cache size of NCBI (by about half)
- by only storing assembly summary columns actually used by genomepy
[0.12.0] - 2022-03-28
Added
genomepy.Annotation.lengths()
to retrieve the gene/transcript lengths.genomepy.Annotation.from_attributes()
can extract any sub-column that pesky attributes column
Changed
- updated Boyle-lab blacklists
genomepy.Annotation.genes()
default changed from bed (commonly containing transcript names) to gtf (gene names)
Fixed
- blacklists now work with GENCODE
query_mygene
no longer filters input.genomepy install
with local provider now understands you want the annotation if you pass a path to an annotation
[0.11.1] - 2022-01-06
Added
quiet
flag forgenomepy.Annotation
genomepy -v
flag
Changed
genomepy.Annotation
returns aFileNotFoundError
instead of aValueError
where appropriate.download_assembly_report
refactored. Now downloads the report for the exact same assembly accession (and not the nearest NCBI assembly).- broader unit tests for UCSC assembly accession scraping
Fixed
[0.11.0] - 2021-11-18
Added
- extened docstrings
- GENCODE support (GENCODE gene annotations with UCSC genomes)
- only contains the main chromosomes, no scaffolds or alternate haplotypes.
- only contains 4 assemblies (2 mouse, 2 human)
- excellent annotations for these regions & species though!
- Ensembl's GRCh37 can now be downloaded through genomepy
- Local fasta/gtf/gff(3)/bed file support
- you can install a local genome and/or annotation by providing local path(s) to
genomepy install
- if annotation downloading is requested, but not annotation path is provided,
a gtf/gff(3) annotation will be sought in the genome's source directory.
- if annotation downloading is requested, but not annotation path is provided,
- you can install a local genome and/or annotation by providing local path(s) to
Annotation.gtf_dict
creates a dictionary for any key-value pair in the GTF columns or attribute fields!- e.g.
Annotation.gtf_dict("seqname", "gene_name")
- e.g.
Changed
- Genome.track2fasta can now ignore comment lines (starting with
#
) - Genome.track2fasta will skip header lines (a warning will be printed)
- Genome.track2fasta will ignore regions that cannot be parsed (a warning will be printed)
- these fixes should improve
gimme scan
performance and feedback
- these fixes should improve
- UCSC annotation conversion tool settings tweaked. Better results with source gff files.
- Ensembl now uses HTTP instead of FTP (in some cases). This improves stability on some servers.
- tweaked search result alignment for clarity
- explained UCSC annotations in the README
- better file path handling (relative paths, user home and variables are expanded)
Annotation
now accepts a file/directory/genomepy name as first argument.- this merges 2 arguments into one.
Annotation.map_genes
now works without a README file- you can now set Annotation.tax_id manually.
Fixed
- Ensembl annotations from previous releases can now be downloaded as intended.
- Genome.track2fasta will skip regions that clearly dont make sense (start>end, and start<0)
Version 0.10.0
[0.10.0] - 2021-07-30
Added
- Annotation class, containing
- regex filter (
genomepy.Annotation.filter_regex()
) - sanitize functions (
genomepy.Annotation.sanitize()
)- option to skip filtering and/or matching the annotation to the genome (also on CLI)
- gene name remapping to various formats (
genomepy.Annotation.map_genes()
)- using MyGene.info. Can be queried separately (
genomepy.annotation.query_mygene()
)
- using MyGene.info. Can be queried separately (
- contig name remapping to other provider formats (
genomepy.Annotation.map_locations()
) - get the annotations, or gene locations, as dataframes (
genomepy.Annotation.gtf
,bed
orgene_coords()
respectively) - get the gene names as a list (
genomepy.Annotation.genes("gtf")
orgenomepy.Annotation.genes("bed")
)
- regex filter (
genomepy install
now attempts to install the NCBI assembly report- NCBI provider also indexes the NCBI
genbank_historical
summary genomepy search
now shows if the genome has an annotation- this slows down the results a bit
- to compensate, results are now shown as soon as they are found
- for UCSC, availability of any of the 4 annotations is shown
genomepy annotation
shows the first line(s) of each gene annotation.gtf- for developers:
- pre-commit-hooks for linting
- formatting/linting script
tests/format.sh
(optional argumentlint
) - isort & autoflake formatters
Changed
- provider module split per provider
- ProviderBase overhauled, now called Provider
- regex filtering separated from
Provider.download_genome
- utils module split into utils, files and online
- now using loguru for pretty logging
- accession
search
improved- now finds GCA and GCF accessions
- now ignores patch levels
genomepy install
automatic provider selection refactoredProvider.online_providers
returns a generator (faster!)
genomepy install
uses a combined filter function (faster!)genomepy install
only zips annotation files if the genome is zipped (with the bgzip flag) (faster!)- NCBI provider should be parsed faster (faster!)
- new dependency: pandas
- tests no longer format code
Fixed
- broken URLs should keep genomepy occupied for less long (check_url will immediately return on "Not Found" errors 404/450) (faster!)
- the
Genome
class now passes arguments to the parentFasta
class - the
Genome
class now regenerates the sizes and gaps files similarly to theFasta
class and its index (when the genome is younger) (faster!) - somewhat more pythonic tests