Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.9.2 #1 #9

Merged
merged 12 commits into from
May 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .github/workflows/container.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@ name: Build and Push Docker Image
on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]

jobs:
build-and-push:
Expand Down
22 changes: 15 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@
<picture>
<source media="(prefers-color-scheme: dark)" srcset="submg/img/logo_dark.png">
<source media="(prefers-color-scheme: light)" srcset="submg/img/logo_light.png">
<img align="left" alt="submg Logo" sr c="submg/img/logo_light.png" width=350>
<img align="left" alt="submg Logo" sr c="submg/img/logo_light.png" width=400>
</picture>


submg aids in the submission of metagenomic study data to the European Nucleotide Archive. It can be used to submit various combinations of samples, reads, (co-)assemblies, bins and MAGs. After you enter your (meta)data in a configuration form, submg derives additional information where required, creates samplesheets and manifests and uploads everything to your ENA account. You can use a combination of manual and submg steps to submit your data (e.g. submitting samples and reads through the ENA web interface, then using the tool to submit the assembly and bins).
subMG aids in the submission of metagenomic study data to the European Nucleotide Archive. It can be used to submit various combinations of samples, reads, (co-)assemblies, bins and MAGs. After you enter your (meta)data in a configuration form, subMG derives additional information where required, creates samplesheets and manifests and uploads everything to your ENA account. You can use a combination of manual and subMG steps to submit your data (e.g. submitting samples and reads through the ENA web interface, then using the tool to submit the assembly and bins).



Expand Down Expand Up @@ -54,10 +54,9 @@ Please Note

# Installation

## Container
A container based on the main branch is available [through DockerHub](https://hub.docker.com/r/ttubb/submg): `docker pull ttubb/submg`

## Local Installation
If you want to install the tool locally, follow these steps:
- Make sure Python 3.8 or higher is installed
- Make sure Java 1.8 or higher is installed
- Make sure [wheel](https://pypi.org/project/wheel/) is installed
Expand Down Expand Up @@ -130,13 +129,21 @@ Using the table below, MAG `m1` will be submitted as a medium quality contig ass
A submission can take several hours to complete. We recommend using [nohup](https://en.wikipedia.org/wiki/Nohup), [tmux](https://github.com/tmux/tmux/wiki) or something similar to prevent the process from being interrupted.

# Taxonomy Assignment
Assemblies and bins need a valid NCBI taxonomy (scientific name and taxonomic identifier) for submission. If you did taxonomic annotation of bins based on [GTDB](https://gtdb.ecogenomic.org/), you can use the `gtdb_to_ncbi_majority_vote.py` script of the [GTDB-Toolkit](https://github.com/Ecogenomics/GTDBTk) to translate your results to NCBI taxonomy.
Assemblies and bins need a valid NCBI taxonomy (scientific name and taxonomic identifier) for submission. While in most cases the assignment works automatically, it is important to note that [environmental organism-level taxonomy](https://ena-docs.readthedocs.io/en/latest/faq/taxonomy.html#environmental-organism-level-taxonomy) has to be used for metagenome submissions. For example: Consider a bin that was classified only on the class level and was determined to belong to class `Clostridia`. The taxonomy id of the class `Clostridia` is `186801`. However, the correct environmental organism-level taxonomy for the bin is `uncultured Clostridia bacterium` with the taxid `244328`.

## GTDB-Toolkit Taxonomy
If you did taxonomic annotation of bins based on [GTDB](https://gtdb.ecogenomic.org/), you can use the `gtdb_to_ncbi_majority_vote.py` script of the [GTDB-Toolkit](https://github.com/Ecogenomics/GTDBTk) to translate your results to NCBI taxonomy.

## NCBI-Taxonomy
You can provide tables with NCBI taxonomy information for each bin (see `./tests/bacteria_taxonomy.tsv` for an example - the output of `gtdb_to_ncbi_majority_vote.py` has the correct format already). submg will use ENAs [suggest-for-submission-sendpoint](https://ena-docs.readthedocs.io/en/latest/retrieval/programmatic-access/taxon-api.html) to derive taxids that follow the [rules for bin taxonomy](https://ena-docs.readthedocs.io/en/latest/faq/taxonomy.html).

## Manually Specified Taxonomy
Either in addition to those files, or as an alternative you can provide a `MANUAL_TAXONOMY` table. This should specify the correct taxids and scientific names for bins. An example of such a document can be found in `./examples/data/taxonomy/manual_taxonomy_3bins.tsv`. If a bin is present in this document, the taxonomic data from the NCBI taxonomy tables will be ignored.

In some cases submg will be unable to assign a valid taxonomy to a bin. The submission will be aborted and you will be informed which bins are causing problems. In such cases you have to determine the correct scientific name and taxid for the bin and specify it in the `MANUAL_TAXONOMY` field of your config file. Sometimes the reason for a failed taxonomic assignment is that no proper taxid exists yet. You can [create a taxon request](https://ena-docs.readthedocs.io/en/latest/faq/taxonomy_requests.html) in the ENA Webin Portal to register the taxon.
## Taxonomy Assignment Failure
In some cases submg will be unable to assign a valid taxonomy to a bin. The submission will be aborted and you will be informed which bins are causing problems. In such cases you have to determine the correct scientific name and taxid for the bin and specify it in the `MANUAL_TAXONOMY` field of your config file.

A possible reason for a failed taxonomic assignment is that no proper taxid exists yet. This happens more often than one might expect. You can [create a taxon request](https://ena-docs.readthedocs.io/en/latest/faq/taxonomy_requests.html) in the ENA Webin Portal to register the taxon.

## NCBI Taxonomy File
This file contains the NCBI taxonomy for bins. You can provide multiple taxonomy files covering different bins. If you created it with `gtdb_to_ncbi_majority_vote.py` of the [GTDB-Toolkit](https://github.com/Ecogenomics/GTDBTk) it will have the following, compatible format already. Alternatively, provide a .tsv file with the columns 'Bin_id' and 'NCBI_taxonomy'. The string in the 'NCBI_taxonomy' column has to adhere to the format shown below. Taxonomic ranks are separated by semicolons. On each rank, a letter indicating the rank is followed by two underscores and the classification at that rank. The ranks have to be in the order 'domain', 'phylum', 'class', 'order', 'family', 'genus', 'species'. If a classification at a certain rank is unavailable, the rank itself still needs to be present in the string (e.g. "s__").
Expand All @@ -157,7 +164,8 @@ ENA provides a [guideline for choosing taxonomy](https://ena-docs.readthedocs.io
If your bins are the result of dereplicating data from a single assembly you can use submg as described above. If your bins are the result of dereplicating data from multiple different assemblies, you need to split them based on which assembly they belong to. You then run submg seperately for each assembly (together with the corresponding set of bins).

# Bin Contamination above 100 percent
When calculating completeness and contamination of a bin with tools like [CheckM](https://github.com/Ecogenomics/CheckM), contamination values above 100% can occur. [Usually, this is not an error](https://github.com/Ecogenomics/CheckM/issues/107). However, the ENA API will refuse to accept bins with contamination values above 100%. This issue is unrelated to submg, but to avoid partial submissions submg will refuse to work if such a bin is present in the dataset. If you have bins with contamination values above 100% you can either leave them out by removing them from your dataset or manually set the contamination value to 100% in the `BINS_QUALITY_FILE` file you provide to submg.
When calculating completeness and contamination of a bin with tools like [CheckM](https://github.com/Ecogenomics/CheckM), contamination values above 100% can occur. [Usually, this is not an error](https://github.com/Ecogenomics/CheckM/issues/107). However, the ENA API will refuse to accept bins with contamination values above 100%. submg will automatically exclude bins with contamination values above 100% from the submission.
If you _need_ to submit such (presumably low quality) bins, you need to manually set the contamination value to 100 in the 'QUALITY_FILE' you provide under the bins section.

# Support
submg is being actively developed. Please use the github [issue tracker](https://github.com/ttubb/submg/issues) to report problems. A [discussions page](https://github.com/ttubb/submg/discussions) is available for questions, comments and suggestions.
1 change: 0 additions & 1 deletion docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Base Image
FROM openjdk:slim


# Set up environment
RUN apt-get update && \
apt-get upgrade -y
Expand Down
4 changes: 3 additions & 1 deletion examples/02_samples_reads_assembly_bins.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,9 @@ BINS:
QUALITY_FILE: "data/checkm_quality_2bins.tsv" # tsv file containing quality values of each bin. Header must include 'Bin_id', 'Completeness', 'Contamination'. A CheckM output table will work here. >>EXAMPLE: "/mnt/data/checkm_quality.tsv"
NCBI_TAXONOMY_FILES: ["data/taxonomy/archaea_taxonomy.tsv", "data/taxonomy/bacteria_taxonomy.tsv"] # A list of files with NCBI taxonomy information about the bins. Consult the README to see how they should be structured. >>EXAMPLE: ["/mnt/data/bacteria_tax.tsv","/mnt/data/archaea_tax.tsv"]
MANUAL_TAXONOMY_FILE: # Scientific names and taxids for bins. See example file for the structure. Columns must be 'Bin_id', 'Tax_id' and 'Scientific_name'. Consult the README for more information. >>EXAMPLE: "/mnt/data/manual_tax.tsv"
BINNING_SOFTWARE: 'VAMB' # The program that was used for binning. >>EXAMPLE: "metabat2"
BINNING_SOFTWARE: 'VAMB'
MIN_COMPLETENESS: 50 # Bins with smaller completeness value will be discarded (values in percent, 0-100). Remove this row to ignore bin completeness. >>EXAMPLE: "90"
MAX_CONTAMINATION: 10 # Bins with larger contamination value will be discarded (values in percent, 0-100). Remove this row to ignore bin contamination (>100% contamination bins will still be discarded). >>EXAMPLE: "5"
ADDITIONAL_SAMPLESHEET_FIELDS: # Please add more fields from the ENA samplesheet that most closely matches your experiment
ADDITIONAL_MANIFEST_FIELDS: # You can add additional fields that will be written to the manifest
BAM_FILES:
Expand Down
3 changes: 2 additions & 1 deletion examples/11_bins.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ SEQUENCING_PLATFORMS: ["ILLUMINA"] #
PROJECT_NAME: "Project ex11 idx00" # Name of the project within which the sequencing was organized >>EXAMPLE: "AgRFex 2 Biogas Survey"
SAMPLE_ACCESSIONS: ["SAMEA113417017"] # These samples exist in ENA. Your assembly is based on them. >>EXAMPLE: ["ERS15898933","ERS15898932"]
ASSEMBLY:
ASSEMBLY_NAME: "idx00_ex11_asm" # Choose a name, even if your assembly has been uploaded already. Will only be used for naming assembly and bins/MAGs. >>EXAMPLE: "Northern Germany biogas digester metagenome"
ASSEMBLY_NAME: "idx00_ex11_asm" # Choose a name, even if your assembly has been uploaded already. Will only be used for naming assembly and bins/MAGs. >>EXAMPLE: "Northern Germany biogas digester metagenome"
EXISTING_ASSEMBLY_ANALYSIS_ACCESSION: "ERZ21942150" # The accession of the assembly analysis that all bins/MAGs originate from >>EXAMPLE: "GCA_012552665"
EXISTING_CO_ASSEMBLY_SAMPLE_ACCESSION: # The accession of the virtual sample of the co-assembly which all bins/MAGs originate from >>EXAMPLE: "ERZ21942150"
ASSEMBLY_SOFTWARE: "MEGAHIT" # Software used to generate the assembly >>EXAMPLE: "MEGAHIT"
Expand All @@ -25,6 +25,7 @@ BINS:
NCBI_TAXONOMY_FILES: "data/taxonomy/eukaryota_taxonomy.tsv" # A list of files with NCBI taxonomy information about the bins. Consult the README to see how they should be structured. >>EXAMPLE: ["/mnt/data/bacteria_tax.tsv","/mnt/data/archaea_tax.tsv"]
MANUAL_TAXONOMY_FILE: "data/taxonomy/manual_taxonomy_eukaryota.tsv" # Scientific names and taxids for bins. See example file for the structure. Columns must be 'Bin_id', 'Tax_id' and 'Scientific_name'. Consult the README for more information. >>EXAMPLE: "/mnt/data/manual_tax.tsv"
BINNING_SOFTWARE: "metabat2" # The program that was used for binning. >>EXAMPLE: "metabat2"
MAX_CONTAMINATION: 5 # Bins with larger contamination value will be discarded (values in percent, 0-100). Remove this row to ignore bin contamination (>100% contamination bins will still be discarded). >>EXAMPLE: "5"
ADDITIONAL_SAMPLESHEET_FIELDS: # You can add more fields from the ENA samplesheet that most closely matches your experiment
ADDITIONAL_MANIFEST_FIELDS: # You can add additional fields that will be written to the manifest
COVERAGE_FILE: "data/bin_coverage.tsv" # .tsv file containing the coverage values of each bin. Columns must be 'Bin_id' and 'Coverage'.
Expand Down
4 changes: 2 additions & 2 deletions examples/data/checkm_quality_3bins.tsv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity
bin1 k__Bacteria (UID2570) 433 273 183 101 172 0 0 0 0 62.22 0.10 0.10
bin2 root (UID1) 5656 56 24 55 1 0 0 0 0 4.17 0.00 0.00
bin3 k__Bacteria (UID203) 5449 104 58 84 20 0 0 0 0 23.90 0.00 0.00
bin2 root (UID1) 5656 56 24 55 1 0 0 0 0 94.17 17.11 0.00
bin3 k__Bacteria (UID203) 5449 104 58 84 20 0 0 0 0 23.90 1.20 0.00
6 changes: 3 additions & 3 deletions examples/localtest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ METAGENOME_TAXID: "718289" #
SEQUENCING_PLATFORMS: ["ILLUMINA"] # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#platform >>EXAMPLE: ["ILLUMINA","OXFORD_NANOPORE"]
SAMPLE_ACCESSIONS: ['SAMEA113417017', 'SAMEA113417018'] # These samples exist in ENA. Your assembly is based on them. >>EXAMPLE: ["ERS15898933","ERS15898932"]
PAIRED_END_READS:
- NAME: "3rIQA_ex05_rp1" # Choose a unique name >>EXAMPLE: "Bioreactor_2_replicate_1"
- NAME: "AKQ4G_ex05_rp1" # Choose a unique name >>EXAMPLE: "Bioreactor_2_replicate_1"
SEQUENCING_INSTRUMENT: "Illumina HiSeq 1500" # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#instrument >>EXAMPLE: ["Illumina HiSeq 1500", "GridION"]
LIBRARY_SOURCE: "METAGENOMIC" # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-source >>EXAMPLE: "GENOMIC"
LIBRARY_SELECTION: "RANDOM" # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-source >>EXAMPLE: "RANDOM"
Expand All @@ -21,7 +21,7 @@ PAIRED_END_READS:
FASTQ2_FILE: "data/reads/rev1.fastq" # Path to the fastq file with reverse reads >>EXAMPLE: "/mnt/data/reads_R2.fastq.gz"
RELATED_SAMPLE_ACCESSION: 'SAMEA113417017' # The accession of the sample that these reads originate from >>EXAMPLE: "ERS15898933"
ADDITIONAL_MANIFEST_FIELDS: # You can add additional fields that will be written to the manifest
- NAME: "3rIQA_ex05_rp2" # Choose a unique name >>EXAMPLE: "Bioreactor_2_replicate_1"
- NAME: "AKQ4G_ex05_rp2" # Choose a unique name >>EXAMPLE: "Bioreactor_2_replicate_1"
SEQUENCING_INSTRUMENT: "Illumina HiSeq 1500" # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#instrument >>EXAMPLE: ["Illumina HiSeq 1500", "GridION"]
LIBRARY_SOURCE: "METAGENOMIC" # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-source >>EXAMPLE: "GENOMIC"
LIBRARY_SELECTION: "RANDOM" # One of https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-source >>EXAMPLE: "RANDOM"
Expand All @@ -32,7 +32,7 @@ PAIRED_END_READS:
RELATED_SAMPLE_ACCESSION: 'SAMEA113417018' # The accession of the sample that these reads originate from >>EXAMPLE: "ERS15898933"
ADDITIONAL_MANIFEST_FIELDS: # You can add additional fields that will be written to the manifest
ASSEMBLY:
ASSEMBLY_NAME: "3rIQA_e05_coasm" # Choose a name, even if your assembly has been uploaded already. Will only be used for naming assembly and bins/MAGs. >>EXAMPLE: "SGMA project mg"
ASSEMBLY_NAME: "AKQ4G_e05_coasm" # Choose a name, even if your assembly has been uploaded already. Will only be used for naming assembly and bins/MAGs. >>EXAMPLE: "SGMA project mg"
ASSEMBLY_SOFTWARE: "MEGAHIT" # Software used to generate the assembly >>EXAMPLE: "MEGAHIT"
ISOLATION_SOURCE: "biogas plant anaerobic digester" # Describe where your sample was taken from >>EXAMPLE: "biogas plant anaerobic digester"
FASTA_FILE: "data/assembly.fasta" # Path to the fasta file >>EXAMPLE: "/mnt/data/assembly.fasta.gz"
Expand Down
Loading
Loading