metagenomics · ttubb · May 14, 2024 · Apr 10, 2024 · May 3, 2024 · May 6, 2024
diff --git a/.github/workflows/container.yml b/.github/workflows/container.yml
@@ -3,8 +3,6 @@ name: Build and Push Docker Image
 on:
   push:
     branches: [ "main" ]
-  pull_request:
-    branches: [ "main" ]
 
 jobs:
   build-and-push:

diff --git a/README.md b/README.md
@@ -2,11 +2,11 @@
 <picture>
   <source media="(prefers-color-scheme: dark)" srcset="submg/img/logo_dark.png">
   <source media="(prefers-color-scheme: light)" srcset="submg/img/logo_light.png">
-  <img align="left" alt="submg Logo" sr c="submg/img/logo_light.png" width=350>
+  <img align="left" alt="submg Logo" sr c="submg/img/logo_light.png" width=400>
 </picture>
 
 
-submg aids in the submission of metagenomic study data to the European Nucleotide Archive. It can be used to submit various combinations of samples, reads, (co-)assemblies, bins and MAGs. After you enter your (meta)data in a configuration form, submg derives additional information where required, creates samplesheets and manifests and uploads everything to your ENA account. You can use a combination of manual and submg steps to submit your data (e.g. submitting samples and reads through the ENA web interface, then using the tool to submit the assembly and bins).
+subMG aids in the submission of metagenomic study data to the European Nucleotide Archive. It can be used to submit various combinations of samples, reads, (co-)assemblies, bins and MAGs. After you enter your (meta)data in a configuration form, subMG derives additional information where required, creates samplesheets and manifests and uploads everything to your ENA account. You can use a combination of manual and subMG steps to submit your data (e.g. submitting samples and reads through the ENA web interface, then using the tool to submit the assembly and bins).
 
 
 
@@ -54,10 +54,9 @@ Please Note
 
 # Installation
 
-## Container
 A container based on the main branch is available [through DockerHub](https://hub.docker.com/r/ttubb/submg): `docker pull ttubb/submg`
 
-## Local Installation
+If you want to install the tool locally, follow these steps:
 - Make sure Python 3.8 or higher is installed
 - Make sure Java 1.8 or higher is installed
 - Make sure [wheel](https://pypi.org/project/wheel/) is installed
@@ -130,13 +129,21 @@ Using the table below, MAG `m1` will be submitted as a medium quality contig ass
 A submission can take several hours to complete. We recommend using [nohup](https://en.wikipedia.org/wiki/Nohup), [tmux](https://github.com/tmux/tmux/wiki) or something similar to prevent the process from being interrupted. 
 
 # Taxonomy Assignment
-Assemblies and bins need a valid NCBI taxonomy (scientific name and taxonomic identifier) for submission. If you did taxonomic annotation of bins based on [GTDB](https://gtdb.ecogenomic.org/), you can use the `gtdb_to_ncbi_majority_vote.py` script of the [GTDB-Toolkit](https://github.com/Ecogenomics/GTDBTk) to translate your results to NCBI taxonomy.
+Assemblies and bins need a valid NCBI taxonomy (scientific name and taxonomic identifier) for submission. While in most cases the assignment works automatically, it is important to note that [environmental organism-level taxonomy](https://ena-docs.readthedocs.io/en/latest/faq/taxonomy.html#environmental-organism-level-taxonomy) has to be used for metagenome submissions. For example: Consider a bin that was classified only on the class level and was determined to belong to class `Clostridia`. The taxonomy id of the class `Clostridia` is `186801`. However, the correct environmental organism-level taxonomy for the bin is `uncultured Clostridia bacterium` with the taxid `244328`.
 
+## GTDB-Toolkit Taxonomy
+If you did taxonomic annotation of bins based on [GTDB](https://gtdb.ecogenomic.org/), you can use the `gtdb_to_ncbi_majority_vote.py` script of the [GTDB-Toolkit](https://github.com/Ecogenomics/GTDBTk) to translate your results to NCBI taxonomy.
+
+## NCBI-Taxonomy
 You can provide tables with NCBI taxonomy information for each bin (see `./tests/bacteria_taxonomy.tsv` for an example - the output of `gtdb_to_ncbi_majority_vote.py` has the correct format already). submg will use ENAs [suggest-for-submission-sendpoint](https://ena-docs.readthedocs.io/en/latest/retrieval/programmatic-access/taxon-api.html) to derive taxids that follow the [rules for bin taxonomy](https://ena-docs.readthedocs.io/en/latest/faq/taxonomy.html).
 
+## Manually Specified Taxonomy
 Either in addition to those files, or as an alternative you can provide a `MANUAL_TAXONOMY` table. This should specify the correct taxids and scientific names for bins. An example of such a document can be found in `./examples/data/taxonomy/manual_taxonomy_3bins.tsv`. If a bin is present in this document, the taxonomic data from the NCBI taxonomy tables will be ignored.
 
-In some cases submg will be unable to assign a valid taxonomy to a bin. The submission will be aborted and you will be informed which bins are causing problems. In such cases you have to determine the correct scientific name and taxid for the bin and specify it in the `MANUAL_TAXONOMY` field of your config file. Sometimes the reason for a failed taxonomic assignment is that no proper taxid exists yet. You can [create a taxon request](https://ena-docs.readthedocs.io/en/latest/faq/taxonomy_requests.html) in the ENA Webin Portal to register the taxon.
+## Taxonomy Assignment Failure
+In some cases submg will be unable to assign a valid taxonomy to a bin. The submission will be aborted and you will be informed which bins are causing problems. In such cases you have to determine the correct scientific name and taxid for the bin and specify it in the `MANUAL_TAXONOMY` field of your config file. 
+
+A possible reason for a failed taxonomic assignment is that no proper taxid exists yet. This happens more often than one might expect. You can [create a taxon request](https://ena-docs.readthedocs.io/en/latest/faq/taxonomy_requests.html) in the ENA Webin Portal to register the taxon.
 
 ## NCBI Taxonomy File
 This file contains the NCBI taxonomy for bins. You can provide multiple taxonomy files covering different bins. If you created it with `gtdb_to_ncbi_majority_vote.py` of the [GTDB-Toolkit](https://github.com/Ecogenomics/GTDBTk) it will have the following, compatible format already. Alternatively, provide a .tsv file with the columns 'Bin_id' and 'NCBI_taxonomy'. The string in the 'NCBI_taxonomy' column has to adhere to the format shown below. Taxonomic ranks are separated by semicolons. On each rank, a letter indicating the rank is followed by two underscores and the classification at that rank. The ranks have to be in the order 'domain', 'phylum', 'class', 'order', 'family', 'genus', 'species'. If a classification at a certain rank is unavailable, the rank itself still needs to be present in the string (e.g. "s__").
@@ -157,7 +164,8 @@ ENA provides a [guideline for choosing taxonomy](https://ena-docs.readthedocs.io
 If your bins are the result of dereplicating data from a single assembly you can use submg as described above. If your bins are the result of dereplicating data from multiple different assemblies, you need to split them based on which assembly they belong to. You then run submg seperately for each assembly (together with the corresponding set of bins).
 
 # Bin Contamination above 100 percent
-When calculating completeness and contamination of a bin with tools like [CheckM](https://github.com/Ecogenomics/CheckM), contamination values above 100% can occur. [Usually, this is not an error](https://github.com/Ecogenomics/CheckM/issues/107). However, the ENA API will refuse to accept bins with contamination values above 100%. This issue is unrelated to submg, but to avoid partial submissions submg will refuse to work if such a bin is present in the dataset. If you have bins with contamination values above 100% you can either leave them out by removing them from your dataset or manually set the contamination value to 100% in the `BINS_QUALITY_FILE` file you provide to submg.
+When calculating completeness and contamination of a bin with tools like [CheckM](https://github.com/Ecogenomics/CheckM), contamination values above 100% can occur. [Usually, this is not an error](https://github.com/Ecogenomics/CheckM/issues/107). However, the ENA API will refuse to accept bins with contamination values above 100%. submg will automatically exclude bins with contamination values above 100% from the submission.
+If you _need_ to submit such (presumably low quality) bins, you need to manually set the contamination value to 100 in the 'QUALITY_FILE' you provide under the bins section.
 
 # Support
 submg is being actively developed. Please use the github [issue tracker](https://github.com/ttubb/submg/issues) to report problems. A [discussions page](https://github.com/ttubb/submg/discussions) is available for questions, comments and suggestions. 
diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -1,7 +1,6 @@
 # Base Image
 FROM openjdk:slim
 
-
 # Set up environment
 RUN apt-get update && \
     apt-get upgrade -y

diff --git a/examples/02_samples_reads_assembly_bins.yaml b/examples/02_samples_reads_assembly_bins.yaml
@@ -47,7 +47,9 @@ BINS:
   QUALITY_FILE: "data/checkm_quality_2bins.tsv"                                                       # tsv file containing quality values of each bin. Header must include 'Bin_id', 'Completeness', 'Contamination'. A CheckM output table will work here. >>EXAMPLE: "/mnt/data/checkm_quality.tsv"
   NCBI_TAXONOMY_FILES: ["data/taxonomy/archaea_taxonomy.tsv", "data/taxonomy/bacteria_taxonomy.tsv"]  # A list of files with NCBI taxonomy information about the bins. Consult the README to see how they should be structured. >>EXAMPLE: ["/mnt/data/bacteria_tax.tsv","/mnt/data/archaea_tax.tsv"]
   MANUAL_TAXONOMY_FILE:                                                                               # Scientific names and taxids for bins. See example file for the structure. Columns must be 'Bin_id', 'Tax_id' and 'Scientific_name'. Consult the README for more information. >>EXAMPLE: "/mnt/data/manual_tax.tsv"
-  BINNING_SOFTWARE: 'VAMB'                                                                            # The program that was used for binning. >>EXAMPLE: "metabat2"
+  BINNING_SOFTWARE: 'VAMB'         
+  MIN_COMPLETENESS: 50                                                        # Bins with smaller completeness value will be discarded (values in percent, 0-100). Remove this row to ignore bin completeness. >>EXAMPLE: "90"
+  MAX_CONTAMINATION: 10                                                       # Bins with larger contamination value will be discarded (values in percent, 0-100). Remove this row to ignore bin contamination (>100% contamination bins will still be discarded). >>EXAMPLE: "5"
   ADDITIONAL_SAMPLESHEET_FIELDS:                                                                      # Please add more fields from the ENA samplesheet that most closely matches your experiment
   ADDITIONAL_MANIFEST_FIELDS:                                                                         # You can add additional fields that will be written to the manifest
 BAM_FILES:

diff --git a/examples/11_bins.yaml b/examples/11_bins.yaml
@@ -11,7 +11,7 @@ SEQUENCING_PLATFORMS: ["ILLUMINA"]                                            #
 PROJECT_NAME: "Project ex11 idx00"                                            # Name of the project within which the sequencing was organized >>EXAMPLE: "AgRFex 2 Biogas Survey"
 SAMPLE_ACCESSIONS: ["SAMEA113417017"]                                         # These samples exist in ENA. Your assembly is based on them. >>EXAMPLE: ["ERS15898933","ERS15898932"]
 ASSEMBLY:                                         
-  ASSEMBLY_NAME: "idx00_ex11_asm"                                        # Choose a name, even if your assembly has been uploaded already. Will only be used for naming assembly and bins/MAGs. >>EXAMPLE: "Northern Germany biogas digester metagenome"
+  ASSEMBLY_NAME: "idx00_ex11_asm"                                             # Choose a name, even if your assembly has been uploaded already. Will only be used for naming assembly and bins/MAGs. >>EXAMPLE: "Northern Germany biogas digester metagenome"
   EXISTING_ASSEMBLY_ANALYSIS_ACCESSION: "ERZ21942150"                         # The accession of the assembly analysis that all bins/MAGs originate from >>EXAMPLE: "GCA_012552665"
   EXISTING_CO_ASSEMBLY_SAMPLE_ACCESSION:                                      # The accession of the virtual sample of the co-assembly which all bins/MAGs originate from >>EXAMPLE: "ERZ21942150"
   ASSEMBLY_SOFTWARE: "MEGAHIT"                                                # Software used to generate the assembly >>EXAMPLE: "MEGAHIT"
@@ -25,6 +25,7 @@ BINS:
   NCBI_TAXONOMY_FILES: "data/taxonomy/eukaryota_taxonomy.tsv"                 # A list of files with NCBI taxonomy information about the bins. Consult the README to see how they should be structured. >>EXAMPLE: ["/mnt/data/bacteria_tax.tsv","/mnt/data/archaea_tax.tsv"]  
   MANUAL_TAXONOMY_FILE: "data/taxonomy/manual_taxonomy_eukaryota.tsv"         # Scientific names and taxids for bins. See example file for the structure. Columns must be 'Bin_id', 'Tax_id' and 'Scientific_name'. Consult the README for more information. >>EXAMPLE: "/mnt/data/manual_tax.tsv"
   BINNING_SOFTWARE: "metabat2"                                                # The program that was used for binning. >>EXAMPLE: "metabat2"
+  MAX_CONTAMINATION: 5                                                        # Bins with larger contamination value will be discarded (values in percent, 0-100). Remove this row to ignore bin contamination (>100% contamination bins will still be discarded). >>EXAMPLE: "5"
   ADDITIONAL_SAMPLESHEET_FIELDS:                                              # You can add more fields from the ENA samplesheet that most closely matches your experiment
   ADDITIONAL_MANIFEST_FIELDS:                                                 # You can add additional fields that will be written to the manifest
   COVERAGE_FILE: "data/bin_coverage.tsv"                                      # .tsv file containing the coverage values of each bin. Columns must be 'Bin_id' and 'Coverage'.

diff --git a/examples/data/checkm_quality_3bins.tsv b/examples/data/checkm_quality_3bins.tsv
@@ -1,4 +1,4 @@
 Bin Id	Marker lineage	# genomes	# markers	# marker sets	0	1	2	3	4	5+	Completeness	Contamination	Strain heterogeneity
 bin1	k__Bacteria (UID2570)	433	273	183	101	172	0	0	0	0	62.22	0.10	0.10
-bin2	root (UID1)	5656	56	24	55	1	0	0	0	0	4.17	0.00	0.00
-bin3	k__Bacteria (UID203)	5449	104	58	84	20	0	0	0	0	23.90	0.00	0.00
+bin2	root (UID1)	5656	56	24	55	1	0	0	0	0	94.17	17.11	0.00
+bin3	k__Bacteria (UID203)	5449	104	58	84	20	0	0	0	0	23.90	1.20	0.00
diff --git a/examples/localtest.yaml b/examples/localtest.yaml
@@ -11,7 +11,7 @@ METAGENOME_TAXID: "718289"                                                    #
 SEQUENCING_PLATFORMS: ["ILLUMINA"]                                            # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#platform >>EXAMPLE: ["ILLUMINA","OXFORD_NANOPORE"]
 SAMPLE_ACCESSIONS: ['SAMEA113417017', 'SAMEA113417018']                       # These samples exist in ENA. Your assembly is based on them. >>EXAMPLE: ["ERS15898933","ERS15898932"]
 PAIRED_END_READS:                                 
-- NAME: "3rIQA_ex05_rp1"                                                            # Choose a unique name >>EXAMPLE: "Bioreactor_2_replicate_1"
+- NAME: "AKQ4G_ex05_rp1"                                                            # Choose a unique name >>EXAMPLE: "Bioreactor_2_replicate_1"
   SEQUENCING_INSTRUMENT: "Illumina HiSeq 1500"                                # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#instrument >>EXAMPLE: ["Illumina HiSeq 1500", "GridION"]
   LIBRARY_SOURCE: "METAGENOMIC"                                               # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-source >>EXAMPLE: "GENOMIC"
   LIBRARY_SELECTION: "RANDOM"                                                 # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-source >>EXAMPLE: "RANDOM"
@@ -21,7 +21,7 @@ PAIRED_END_READS:
   FASTQ2_FILE: "data/reads/rev1.fastq"                                        # Path to the fastq file with reverse reads >>EXAMPLE: "/mnt/data/reads_R2.fastq.gz"
   RELATED_SAMPLE_ACCESSION: 'SAMEA113417017'                                  # The accession of the sample that these reads originate from >>EXAMPLE: "ERS15898933"
   ADDITIONAL_MANIFEST_FIELDS:                                                 # You can add additional fields that will be written to the manifest
-- NAME: "3rIQA_ex05_rp2"                                                            # Choose a unique name >>EXAMPLE: "Bioreactor_2_replicate_1"
+- NAME: "AKQ4G_ex05_rp2"                                                            # Choose a unique name >>EXAMPLE: "Bioreactor_2_replicate_1"
   SEQUENCING_INSTRUMENT: "Illumina HiSeq 1500"                                # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#instrument >>EXAMPLE: ["Illumina HiSeq 1500", "GridION"]
   LIBRARY_SOURCE: "METAGENOMIC"                                               # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-source >>EXAMPLE: "GENOMIC"
   LIBRARY_SELECTION: "RANDOM"                                                 # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-source >>EXAMPLE: "RANDOM"
@@ -32,7 +32,7 @@ PAIRED_END_READS:
   RELATED_SAMPLE_ACCESSION: 'SAMEA113417018'                                  # The accession of the sample that these reads originate from >>EXAMPLE: "ERS15898933"
   ADDITIONAL_MANIFEST_FIELDS:                                                 # You can add additional fields that will be written to the manifest
 ASSEMBLY:                                         
-  ASSEMBLY_NAME: "3rIQA_e05_coasm"                                       # Choose a name, even if your assembly has been uploaded already. Will only be used for naming assembly and bins/MAGs. >>EXAMPLE: "SGMA project mg"
+  ASSEMBLY_NAME: "AKQ4G_e05_coasm"                                       # Choose a name, even if your assembly has been uploaded already. Will only be used for naming assembly and bins/MAGs. >>EXAMPLE: "SGMA project mg"
   ASSEMBLY_SOFTWARE: "MEGAHIT"                                                # Software used to generate the assembly >>EXAMPLE: "MEGAHIT"
   ISOLATION_SOURCE: "biogas plant anaerobic digester"                         # Describe where your sample was taken from >>EXAMPLE: "biogas plant anaerobic digester"
   FASTA_FILE: "data/assembly.fasta"                                           # Path to the fasta file >>EXAMPLE: "/mnt/data/assembly.fasta.gz"