From ecd326170cd1ef6c52447a8b4a7038385d8288ab Mon Sep 17 00:00:00 2001 From: Leandro Ishi Date: Thu, 17 Aug 2023 20:10:11 +0100 Subject: [PATCH 1/9] Updating Changelog --- CHANGELOG.md | 46 +++++++++++++++++++++++++++++++--------------- 1 file changed, 31 insertions(+), 15 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 6b2280dc..85a977c9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,10 +7,12 @@ this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm ## [Unreleased] -## [0.10.0-alpha.1] +## [0.11.0-alpha.0] -### Changed +This version is a major release that breaks backwards compatibility with previous versions of `pandora`. +It improves `pandora` runtime performance by 15x and RAM usage by 20x; +### Changed - The `pandora` index changed from a set of files in a directory structure to a single, compressible and indexable `zip` file (`pandora` indexes now have the suffix `.panidx.zip`). This is now the single file that is produced by the `pandora index` command and is required as argument to all the other `pandora` commands. This index is self contained in @@ -18,26 +20,40 @@ the sense that it encodes all the information and metadata about it (e.g. which kmer size, etc). This new index provide the infrastructure for the next features and simplifies working with large reference pangenome collections, with a few million PRGs. This new index breaks backwards compatibility with previous `pandora` versions. The structure of this zip archive is as follows: - * `_prgs`: The PRGs themselves used as input to create this index; - * `_prg_names`: The names of the PRGs; - * `_prg_min_path_lengths`: the length of the shortest path through each PRG; + * `_prg_names`: The names of the PRGs used as input to create this index; + * `_prg_max_path_lengths`: the length of the longest path through each PRG; + * `_prg_lengths`: the length of the string representation of each PRG; * `_minhash`: the minimizer hash data structure; - * `_metadata`: metadata about the index (first line is window size, second is kmer size); + * `_metadata`: metadata about the index; * `*.gfa`: the several GFA files describing the minimizing kmer graph for each PRG; + * `*.fa`: the string representation of each PRG; - Minimum C++ standard upgraded from `C++11` to `C++14`; -- We now test whether the genotype confidence of a variant is greater than or equal to the threshold provided by `--gt-conf`. Previously we only tested if it was greater than. [[#320][320]] +- We now test whether the genotype confidence of a variant is greater than or equal to the threshold provided by +`--gt-conf`. Previously we only tested if it was greater than; ### Removed -- Removed CLI parameters `-w` and `-k` from the following `pandora` subcommands: `compare`, `discover`, `map`, +- Removed CLI parameters `-w`, `-k` and `--clean` from the following `pandora` subcommands: `compare`, `discover`, `map`, `seq2path`; - Removed `merge_index` subcommand; - +- Removed gene-DBG and noise-filtering modules; +- ### Fixed -- Several refactoring to the `pandora` index implementation; - +- Fixed a major bug on finding the longest path through PRGs; +- Several refactorings to the `pandora` index implementation; +- Optimisation of the `pandora` index data structure; + ### Added -- A memory-efficient way to load PRGs when indexing, where we don't need to load all PRGs at once to index them, but -just load on demand; +- A memory-efficient way to load PRGs when indexing and mapping, where we don't need to load all PRGs at once to process +them, but just load on demand (also known as lazy loading). This is particularly useful when working with very large +PanRGs; +- Random multimapping of reads if they map equally well to several graphs, reducing mapping bias. Added parameter +`--rng-seed` to `pandora map/compare/discover` commands to make multimapping deterministic, if required; +- A new parameter to deal with auto-updating error rate and kmer model (see `--auto-update-params` parameter in +`pandora map/compare/discover` commands); +- Three new parameters to control when a gene should be filtered out due to too low or too high coverage (see +`--min-abs-gene-coverage`, `--min-rel-gene-coverage` and `--max-rel-gene-coverage` parameters in +`pandora map/compare/discover` commands); + ## [0.10.0-alpha.0] @@ -170,8 +186,8 @@ their changes meticulously documented here. - k-mer coverage underflow bug in `LocalPRG` [[#183][183]] -[Unreleased]: https://github.com/rmcolq/pandora/compare/0.10.0-alpha.1...HEAD -[0.10.0-alpha.1]: https://github.com/rmcolq/pandora/compare/0.10.0-alpha.1...0.10.0-alpha.0 +[Unreleased]: https://github.com/rmcolq/pandora/compare/0.11.0-alpha.0...HEAD +[0.11.0-alpha.0]: https://github.com/rmcolq/pandora/compare/0.11.0-alpha.0...0.10.0-alpha.0 [0.10.0-alpha.0]: https://github.com/rmcolq/pandora/compare/0.10.0-alpha.0...0.9.2 [0.9.2]: https://github.com/rmcolq/pandora/compare/0.9.2...0.9.1 [0.9.1]: https://github.com/rmcolq/pandora/releases/tag/0.9.1 From b4e4e68e26348e096a775f177b2d0be030a482ab Mon Sep 17 00:00:00 2001 From: Leandro Ishi Date: Thu, 17 Aug 2023 20:19:14 +0100 Subject: [PATCH 2/9] Updating version in CMakeLists.txt --- CMakeLists.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 110b9a4a..05dc0118 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -12,7 +12,7 @@ HunterGate( # project configuration set(PROJECT_NAME_STR pandora) -project(${PROJECT_NAME_STR} VERSION "0.10.0.1" LANGUAGES C CXX) +project(${PROJECT_NAME_STR} VERSION "0.11.0" LANGUAGES C CXX) set(ADDITIONAL_VERSION_LABELS "") configure_file( include/version.h.in ${CMAKE_BINARY_DIR}/include/version.h ) From e5179b56fc665ad4521d1c825b2715537748eb6d Mon Sep 17 00:00:00 2001 From: Leandro Ishi Date: Thu, 17 Aug 2023 20:19:45 +0100 Subject: [PATCH 3/9] Updating version in README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index fa23236b..4a2953b9 100644 --- a/README.md +++ b/README.md @@ -76,13 +76,13 @@ In this binary, all libraries are linked statically. * **Download**: ``` - wget https://github.com/rmcolq/pandora/releases/download/0.10.0-alpha.1/pandora-linux-precompiled-v0.10.0-alpha.1 + wget https://github.com/rmcolq/pandora/releases/download/0.11.0-alpha.0/pandora-linux-precompiled-v0.11.0-alpha.0 ``` * **Running**: ``` -chmod +x pandora-linux-precompiled-v0.10.0-alpha.1 -./pandora-linux-precompiled-v0.10.0-alpha.1 -h +chmod +x pandora-linux-precompiled-v0.11.0-alpha.0 +./pandora-linux-precompiled-v0.11.0-alpha.0 -h ``` * **Notes**: From 24d7ad442d918d59e2169e139ae192ece9701ca8 Mon Sep 17 00:00:00 2001 From: Leandro Ishi Date: Thu, 17 Aug 2023 20:21:05 +0100 Subject: [PATCH 4/9] Updating example/run_pandora.sh --- example/run_pandora.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/example/run_pandora.sh b/example/run_pandora.sh index aca23398..40aa7e56 100755 --- a/example/run_pandora.sh +++ b/example/run_pandora.sh @@ -3,9 +3,9 @@ set -eu ######################################################################################################################## # configs -pandora_version="0.10.0-alpha.1" +pandora_version="0.11.0-alpha.0" pandora_URL="https://github.com/rmcolq/pandora/releases/download/${pandora_version}/pandora_${pandora_version}" -make_prg_version="0.4.0" +make_prg_version="0.5.0" make_prg_URL="https://github.com/iqbal-lab-org/make_prg/releases/download/${make_prg_version}/make_prg_${make_prg_version}" ######################################################################################################################## From cf1206c08e0a6d3c2ec51d26b5c7998bbed38fb8 Mon Sep 17 00:00:00 2001 From: Leandro Ishi Date: Thu, 17 Aug 2023 20:43:52 +0100 Subject: [PATCH 5/9] Updating Changelog --- CHANGELOG.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 85a977c9..e0896c1e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -36,7 +36,7 @@ reference pangenome collections, with a few million PRGs. This new index breaks `seq2path`; - Removed `merge_index` subcommand; - Removed gene-DBG and noise-filtering modules; -- + ### Fixed - Fixed a major bug on finding the longest path through PRGs; - Several refactorings to the `pandora` index implementation; From b5bfb3fe32b2f06fa95f935ed492b228cd6f7bdb Mon Sep 17 00:00:00 2001 From: Leandro Ishi Date: Thu, 17 Aug 2023 21:48:05 +0100 Subject: [PATCH 6/9] Small update to changelog --- CHANGELOG.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index e0896c1e..67e47afd 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -48,7 +48,7 @@ them, but just load on demand (also known as lazy loading). This is particularly PanRGs; - Random multimapping of reads if they map equally well to several graphs, reducing mapping bias. Added parameter `--rng-seed` to `pandora map/compare/discover` commands to make multimapping deterministic, if required; -- A new parameter to deal with auto-updating error rate and kmer model (see `--auto-update-params` parameter in +- A new parameter to deal with auto-updating error rate and kmer model (see `--dont-auto-update-params` parameter in `pandora map/compare/discover` commands); - Three new parameters to control when a gene should be filtered out due to too low or too high coverage (see `--min-abs-gene-coverage`, `--min-rel-gene-coverage` and `--max-rel-gene-coverage` parameters in From f69f19f7018b4d762ad23891d0842e05c2b33c61 Mon Sep 17 00:00:00 2001 From: Leandro Ishi Date: Thu, 17 Aug 2023 21:54:42 +0100 Subject: [PATCH 7/9] Updating pandora URL in the example --- example/run_pandora.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/example/run_pandora.sh b/example/run_pandora.sh index 40aa7e56..c8c1b9c3 100755 --- a/example/run_pandora.sh +++ b/example/run_pandora.sh @@ -4,7 +4,7 @@ set -eu ######################################################################################################################## # configs pandora_version="0.11.0-alpha.0" -pandora_URL="https://github.com/rmcolq/pandora/releases/download/${pandora_version}/pandora_${pandora_version}" +pandora_URL="https://github.com/rmcolq/pandora/releases/download/${pandora_version}/pandora-linux-precompiled-v${pandora_version}" make_prg_version="0.5.0" make_prg_URL="https://github.com/iqbal-lab-org/make_prg/releases/download/${make_prg_version}/make_prg_${make_prg_version}" ######################################################################################################################## From 676c7b09962cc76eab26edf280e9631bb6a1138d Mon Sep 17 00:00:00 2001 From: Leandro Ishi Date: Thu, 17 Aug 2023 21:57:18 +0100 Subject: [PATCH 8/9] Updating example/run_pandora.sh --- example/run_pandora.sh | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/example/run_pandora.sh b/example/run_pandora.sh index c8c1b9c3..ed9ec09e 100755 --- a/example/run_pandora.sh +++ b/example/run_pandora.sh @@ -55,20 +55,20 @@ ${make_prg_executable} from_msa --threads 1 --input msas/ --output-prefix out/pr echo "Running ${pandora_executable} index" "${pandora_executable}" index --threads 1 out/prgs/pangenome.prg.fa echo "Running ${pandora_executable} map" -"${pandora_executable}" map --threads 1 --genotype -o out/map_toy_sample_1 out/prgs/pangenome.prg.fa.panidx.zip reads/toy_sample_1/toy_sample_1.100x.random.illumina.fastq +"${pandora_executable}" map --threads 1 --genotype -o out/map_toy_sample_1 --min-abs-gene-coverage 0 --min-rel-gene-coverage 0 --max-rel-gene-coverage 1000 out/prgs/pangenome.prg.fa.panidx.zip reads/toy_sample_1/toy_sample_1.100x.random.illumina.fastq echo "Running ${pandora_executable} compare" -"${pandora_executable}" compare --threads 1 --genotype -o out/output_toy_example_no_denovo out/prgs/pangenome.prg.fa.panidx.zip reads/read_index.tsv +"${pandora_executable}" compare --threads 1 --genotype -o out/output_toy_example_no_denovo --min-abs-gene-coverage 0 --min-rel-gene-coverage 0 --max-rel-gene-coverage 1000 out/prgs/pangenome.prg.fa.panidx.zip reads/read_index.tsv echo "Running pandora without denovo - done!" echo "Running pandora with denovo..." echo "Running ${pandora_executable} discover" -"${pandora_executable}" discover --threads 1 --outdir out/pandora_discover_out out/prgs/pangenome.prg.fa.panidx.zip reads/read_index.tsv +"${pandora_executable}" discover --threads 1 --outdir out/pandora_discover_out --min-abs-gene-coverage 0 --min-rel-gene-coverage 0 --max-rel-gene-coverage 1000 out/prgs/pangenome.prg.fa.panidx.zip reads/read_index.tsv echo "Running ${make_prg_executable} update" ${make_prg_executable} update --threads 1 --update-DS out/prgs/pangenome.update_DS.zip --denovo-paths out/pandora_discover_out/denovo_paths.txt --output-prefix out/updated_prgs/pangenome_updated echo "Running ${pandora_executable} index on updated PRGs" "${pandora_executable}" index --threads 1 out/updated_prgs/pangenome_updated.prg.fa echo "Running ${pandora_executable} compare" -"${pandora_executable}" compare --threads 1 --genotype -o out/output_toy_example_with_denovo out/updated_prgs/pangenome_updated.prg.fa.panidx.zip reads/read_index.tsv +"${pandora_executable}" compare --threads 1 --genotype -o out/output_toy_example_with_denovo --min-abs-gene-coverage 0 --min-rel-gene-coverage 0 --max-rel-gene-coverage 1000 out/updated_prgs/pangenome_updated.prg.fa.panidx.zip reads/read_index.tsv echo "Running pandora with denovo - done!" # first compare non-zip files From 487fcbbe822e153523eb21cef68d48c120728d0c Mon Sep 17 00:00:00 2001 From: Leandro Ishi Date: Wed, 6 Sep 2023 18:12:16 +0100 Subject: [PATCH 9/9] Adding --genome-size to pandora map/discover/compare commands in example --- example/run_pandora.sh | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/example/run_pandora.sh b/example/run_pandora.sh index ed9ec09e..cccca48c 100755 --- a/example/run_pandora.sh +++ b/example/run_pandora.sh @@ -55,20 +55,20 @@ ${make_prg_executable} from_msa --threads 1 --input msas/ --output-prefix out/pr echo "Running ${pandora_executable} index" "${pandora_executable}" index --threads 1 out/prgs/pangenome.prg.fa echo "Running ${pandora_executable} map" -"${pandora_executable}" map --threads 1 --genotype -o out/map_toy_sample_1 --min-abs-gene-coverage 0 --min-rel-gene-coverage 0 --max-rel-gene-coverage 1000 out/prgs/pangenome.prg.fa.panidx.zip reads/toy_sample_1/toy_sample_1.100x.random.illumina.fastq +"${pandora_executable}" map --threads 1 --genotype -o out/map_toy_sample_1 --genome-size 700 --min-abs-gene-coverage 0 --min-rel-gene-coverage 0 --max-rel-gene-coverage 1000 out/prgs/pangenome.prg.fa.panidx.zip reads/toy_sample_1/toy_sample_1.100x.random.illumina.fastq echo "Running ${pandora_executable} compare" -"${pandora_executable}" compare --threads 1 --genotype -o out/output_toy_example_no_denovo --min-abs-gene-coverage 0 --min-rel-gene-coverage 0 --max-rel-gene-coverage 1000 out/prgs/pangenome.prg.fa.panidx.zip reads/read_index.tsv +"${pandora_executable}" compare --threads 1 --genotype -o out/output_toy_example_no_denovo --genome-size 700 --min-abs-gene-coverage 0 --min-rel-gene-coverage 0 --max-rel-gene-coverage 1000 out/prgs/pangenome.prg.fa.panidx.zip reads/read_index.tsv echo "Running pandora without denovo - done!" echo "Running pandora with denovo..." echo "Running ${pandora_executable} discover" -"${pandora_executable}" discover --threads 1 --outdir out/pandora_discover_out --min-abs-gene-coverage 0 --min-rel-gene-coverage 0 --max-rel-gene-coverage 1000 out/prgs/pangenome.prg.fa.panidx.zip reads/read_index.tsv +"${pandora_executable}" discover --threads 1 --outdir out/pandora_discover_out --genome-size 700 --min-abs-gene-coverage 0 --min-rel-gene-coverage 0 --max-rel-gene-coverage 1000 out/prgs/pangenome.prg.fa.panidx.zip reads/read_index.tsv echo "Running ${make_prg_executable} update" ${make_prg_executable} update --threads 1 --update-DS out/prgs/pangenome.update_DS.zip --denovo-paths out/pandora_discover_out/denovo_paths.txt --output-prefix out/updated_prgs/pangenome_updated echo "Running ${pandora_executable} index on updated PRGs" "${pandora_executable}" index --threads 1 out/updated_prgs/pangenome_updated.prg.fa echo "Running ${pandora_executable} compare" -"${pandora_executable}" compare --threads 1 --genotype -o out/output_toy_example_with_denovo --min-abs-gene-coverage 0 --min-rel-gene-coverage 0 --max-rel-gene-coverage 1000 out/updated_prgs/pangenome_updated.prg.fa.panidx.zip reads/read_index.tsv +"${pandora_executable}" compare --threads 1 --genotype -o out/output_toy_example_with_denovo --genome-size 700 --min-abs-gene-coverage 0 --min-rel-gene-coverage 0 --max-rel-gene-coverage 1000 out/updated_prgs/pangenome_updated.prg.fa.panidx.zip reads/read_index.tsv echo "Running pandora with denovo - done!" # first compare non-zip files