Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsampling with fastp - document and test #1096

Closed
ewallace opened this issue Oct 17, 2023 · 4 comments · Fixed by #1309
Closed

Subsampling with fastp - document and test #1096

ewallace opened this issue Oct 17, 2023 · 4 comments · Fixed by #1309

Comments

@ewallace
Copy link

Description of feature

It is very helpful when debugging pipelines to run on only a subset of reads, so that it fails fast.

@drpatelh suggested

Another option is to run with --trimmer fastp and I think you can provide the parameter below via a custom config to the pipeline to restrict the reads that are processed (UNTESTED :alert:)
--reads_to_process specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])

This has advantages:

  • requires no new steps in pipeline
  • will probably work

Disadvantages:

  • won't work if Trim Galore! used as preprocessing
  • will run the initial --reads_to_process of reads so if there is anything weirdly sorted about the input .fastq file it could be unrepresentative
  • needs testing and user-facing documentation.

I'm going to create a separate issue ticket about including an explicit subsampling step in the pipeline, because I think it's a separate issue and something that I would routinely use (and tell all my students to use).

@ewallace
Copy link
Author

ewallace commented Oct 17, 2023

Also @drpatelh notes:

I exposed a parameter to append trimming options. Something like this might work without needing to use a custom config: --trimmer fastp --extra_fastp_args '--reads_to_process 10000'. Just be aware that you might need to tweak this number for SE / PE reads because the value will be different.

@drpatelh
Copy link
Member

We need to test if this really works and document if so.

@pinin4fjords
Copy link
Member

This works for me, when applied to the test profile, and I see it suggested in various places as an approach to down-sampling. I don't think paired-end should be an issue.

Down-sampling might mean something slightly different to some people than what FASTP is doing- just taking the specified number of reads off the top of the FASTQ file(s) - but I'll add a line to the docs anyway.

@pinin4fjords pinin4fjords linked a pull request May 30, 2024 that will close this issue
11 tasks
@drpatelh
Copy link
Member

Fixed in #1309

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants