Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host read removal with Bowtie 2 #49

Merged
merged 16 commits into from
Jun 18, 2020
Merged

Conversation

skrakau
Copy link
Member

@skrakau skrakau commented May 31, 2020

Here is a suggestion for host read removal using bowtie2. For the host reference sequence either iGenomes or a user specified Fasta reference file can be used.

I also added a MultiQC section for this to display which fraction of reads maps against the host reference and is thus filtered out using a custom content file as suggested by @ewels and @drpatelh (MultiQC/MultiQC#1199), instead of using the standard Bowtie 2 MultiQC format (with its multiple different mapping categories). Moreover, I separated the MultiQC FastQC for before and after preprocessing.

If using this bowtie2 based strategy would be fine for you, I can add a test.

PR checklist

  • This comment contains a description of changes (with reason)
  • If you've fixed a bug or added code that should be tested, add tests!
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repo
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Make sure your code lints (nf-core lint .).
  • Documentation in docs is updated
  • CHANGELOG.md is updated
  • README.md is updated

Learn more about contributing: https://github.com/nf-core/mag/tree/master/.github/CONTRIBUTING.md

@skrakau skrakau requested review from HadrienG and d4straub May 31, 2020 12:00
@d4straub
Copy link
Collaborator

d4straub commented Jun 2, 2020

Bowtie2 host removal is fine.
A test and documentation would be great.

@skrakau
Copy link
Member Author

skrakau commented Jun 2, 2020

Ok thanks, will do!

main.nf Outdated
Comment on lines 559 to 560
zcat ${reads[0]} | echo "Read pairs before removal: \$((`wc -l`/4))" >>${name}_remove_host.log
zcat ${name}_host_unmapped_1.fastq.gz | echo "Read pairs after removal: \$((`wc -l`/4))" >>${name}_remove_host.log
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for debugging? I can't see where ${name}_remove_host.log is used elsewhere.

Might be a bit of an expensive operation if so, zcat on a big FastQ file can take quite a while. And I guess we get this info from your bowtie logs MultiQC module anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I originally copied it from the remove_phix process, but it can be removed here.

bin/multiqc_to_custom_tsv.py Outdated Show resolved Hide resolved

script:
def sensitivity = params.host_removal_verysensitive ? "--very-sensitive" : "--sensitive"
if ( !params.single_end ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for avoiding to duplicate most of the process code (sin gle end and paired end), you can define an input parameter, like at https://github.com/nf-core/mag/blob/master/main.nf#L672

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but in this case there are multiple parts in the process code affected:

  1. -1 "${reads[0]}" -2 "${reads[1]}"
  2. --un-conc-gz ${name}_host_unmapped_%.fastq.gz \
    --al-conc-gz ${name}_host_mapped_%.fastq.gz \
  3. zcat ${name}_host_mapped_1.fastq.gz | awk '{if(NR%4==1) print substr(\$0, 2)}' > ${name}_host_mapped_1.read_ids.txt
    zcat ${name}_host_mapped_2.fastq.gz | awk '{if(NR%4==1) print substr(\$0, 2)}' > ${name}_host_mapped_2.read_ids.txt

Maybe in this case it doesn't necessarily get cleaner if solved like this?

@skrakau skrakau force-pushed the add_host_removal branch from 071ae48 to 86ee5c2 Compare June 3, 2020 15:40
@skrakau
Copy link
Member Author

skrakau commented Jun 3, 2020

OK, I now added a test using the --host_fasta option and some brief documentation.

Currently this host removal only works for short reads. If --host_fasta or --host_genome is specified in combination with --manifest, an error with a corresponding message is returned. Is that OK?

@skrakau skrakau requested a review from HadrienG June 4, 2020 12:46
@skrakau
Copy link
Member Author

skrakau commented Jun 5, 2020

In the MAG test.config file temp_dir is set to a folder within outdir. So far I didn't do that for test_host_rm.config. Is this necessary for this pipeline, and if yes why?

@skrakau
Copy link
Member Author

skrakau commented Jun 8, 2020

Hi @ewels or @apeltzer, could one of you tell me by any chance what the purpose of the line temp_dir = "./tests/tmp_dir" in the test.config files is (it's also used in the ampliseq pipline, but didn't find any documentation)? Still wondering if it is ok to just omit it in the new conf/test_host_rm.config file for this PR.

@ewels
Copy link
Member

ewels commented Jun 8, 2020

Link to line in question:

params.temp_dir = "./tests/tmp_dir"

Nothing to do with me I'm afraid. Looks like it was added by @HadrienG

Phil

@ewels
Copy link
Member

ewels commented Jun 8, 2020

(but I agree, I can't see anything obvious that it is doing, and I suspect that it can be removed from both config files)

@skrakau
Copy link
Member Author

skrakau commented Jun 8, 2020

OK, thanks @ewels !

@d4straub
Copy link
Collaborator

d4straub commented Jun 9, 2020

Currently this host removal only works for short reads. If --host_fasta or --host_genome is specified in combination with --manifest, an error with a corresponding message is returned. Is that OK?

I think that's fine. Using proper settings for ONT qc (e.g. --longreads_length_weight 1) will remove those that have no matches with Illumina reads, so ONT reads can be filtered that way as well. And SPAdes anyway is using ONT reads only for resolving uncertainties in the assembly graph including gap closure and therefore ONT host reads are relatively unlikely to mess up the assembly, I think.

@skrakau
Copy link
Member Author

skrakau commented Jun 9, 2020

Ok, will change it then so that the already filtered short reads will be used to filter the long reads.

@skrakau skrakau force-pushed the add_host_removal branch from 560b7c3 to e3db180 Compare June 10, 2020 19:02
@skrakau
Copy link
Member Author

skrakau commented Jun 11, 2020

Hi @d4straub,

thanks for your input. The following points were added/changed now:

  • filtlong uses now the already filtered short reads as a reference (to indirectly filter out long host reads), for this I changed the channel files_all_raw to files_long_rawand joined this with the filtered short reads
  • accordingly I also changed the order of the processes in the code just for the sake of clarity: first short reads preprocessing, then long reads preprocessing
  • if host removal is turned on in combination with long reads, warnings are given out to inform the user that long reads are only filtered indirectly, and informing about the effect of --longreads_length_weight (if > 1)
  • added some information about this to the docs

I tested this locally for a very basic example, and with --longreads_length_weight 1 long reads mapping to host sequences were discarded as expected, reads mapping to the 'minigut' sequences not.

  • added missing parameters to the summary

Best,
Sabrina

@d4straub
Copy link
Collaborator

d4straub commented Jun 15, 2020

If I am not mistaken there are 3 tests now:

  • Illumina data
  • Illumina data + host removal
  • Illumina + ONT data

wouldn't it be good to test also

  • Illumina + ONT data + host removal

My reasoning is that there are now some channels that are not tested otherwise. And further changes breaking processes that use these channels might go undetected when the test is missing. What do you think?

edit: layout

@skrakau
Copy link
Member Author

skrakau commented Jun 16, 2020

Ok, I added a corresponding test, but it is more for channel testing, as I didn't add host reads to the long read dataset.

Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! Looks good!

@skrakau
Copy link
Member Author

skrakau commented Jun 18, 2020

I made the saving of the host read ids optional and added some sorting, since the order of the Bowtie 2 results is not reproducible.

Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@d4straub d4straub dismissed HadrienG’s stale review June 18, 2020 14:06

Requested changes were either addressed or reasonably dismissed

@skrakau skrakau merged commit 22464a2 into nf-core:dev Jun 18, 2020
@skrakau skrakau deleted the add_host_removal branch May 31, 2021 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants