Subsampling with fastp - document and test #1096

ewallace · 2023-10-17T14:04:36Z

Description of feature

It is very helpful when debugging pipelines to run on only a subset of reads, so that it fails fast.

Another option is to run with --trimmer fastp and I think you can provide the parameter below via a custom config to the pipeline to restrict the reads that are processed (UNTESTED :alert:)
--reads_to_process specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])

This has advantages:

requires no new steps in pipeline
will probably work

Disadvantages:

won't work if Trim Galore! used as preprocessing
will run the initial --reads_to_process of reads so if there is anything weirdly sorted about the input .fastq file it could be unrepresentative
needs testing and user-facing documentation.

I'm going to create a separate issue ticket about including an explicit subsampling step in the pipeline, because I think it's a separate issue and something that I would routinely use (and tell all my students to use).

The text was updated successfully, but these errors were encountered:

ewallace · 2023-10-17T14:13:02Z

Also @drpatelh notes:

I exposed a parameter to append trimming options. Something like this might work without needing to use a custom config: --trimmer fastp --extra_fastp_args '--reads_to_process 10000'. Just be aware that you might need to tweak this number for SE / PE reads because the value will be different.

drpatelh · 2024-05-29T10:56:57Z

We need to test if this really works and document if so.

pinin4fjords · 2024-05-30T14:53:15Z

This works for me, when applied to the test profile, and I see it suggested in various places as an approach to down-sampling. I don't think paired-end should be an issue.

Down-sampling might mean something slightly different to some people than what FASTP is doing- just taking the specified number of reads off the top of the FASTQ file(s) - but I'll add a line to the docs anyway.

drpatelh · 2024-06-19T09:07:36Z

Fixed in #1309

ewallace added the enhancement label Oct 17, 2023

ewallace mentioned this issue Oct 17, 2023

Explicit subsampling step in rnaseq pipeline #1097

Open

drpatelh added this to the 3.15.0 milestone May 13, 2024

drpatelh added the awaiting-response-developers label May 29, 2024

drpatelh added the needs-testing label May 29, 2024

pinin4fjords linked a pull request May 30, 2024 that will close this issue

Document FASTP sampling #1309

Merged

11 tasks

pinin4fjords added Ready for review and removed awaiting-response-developers labels May 30, 2024

drpatelh closed this as completed Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subsampling with fastp - document and test #1096

Subsampling with fastp - document and test #1096

ewallace commented Oct 17, 2023

ewallace commented Oct 17, 2023 •

edited

Loading

drpatelh commented May 29, 2024

pinin4fjords commented May 30, 2024

drpatelh commented Jun 19, 2024

Subsampling with fastp - document and test #1096

Subsampling with fastp - document and test #1096

Comments

ewallace commented Oct 17, 2023

Description of feature

ewallace commented Oct 17, 2023 • edited Loading

drpatelh commented May 29, 2024

pinin4fjords commented May 30, 2024

drpatelh commented Jun 19, 2024

ewallace commented Oct 17, 2023 •

edited

Loading