Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output fastq file(without gzip) when process a fastq.gz file #18

Closed
wangyugui opened this issue Mar 27, 2016 · 13 comments
Closed

output fastq file(without gzip) when process a fastq.gz file #18

wangyugui opened this issue Mar 27, 2016 · 13 comments

Comments

@wangyugui
Copy link

Can NxTrim output fastq file(without gzip) when the input is a fastq.gz file?

If the following process access the fastq file(without gzip), it will become faster.

@jaredo
Copy link
Contributor

jaredo commented Mar 27, 2016

Sorry, I don't think I will implement this. It is not a common use case.

If you have a program that does not read gzipped fastq, you can always create a dummy file with mkfifo:

mkfifo r1
zcat example/MP_R1.fastq.gz > r1 &
cat r1

@jaredo jaredo closed this as completed Mar 27, 2016
@wangyugui
Copy link
Author

The program such as trimmomatic/bwa that I am using can support gzipped fastq, but the performance of gzip fastq reading is worse than that of fastq file, because the gunzip process is not multiple thread support.

The fastq.gz file will save disk space, but the file of NxTrim is not the final result file, so we can delete it after it is used.

And It is not the fastq final result , because of the lacking of low qua remove function?

@jaredo
Copy link
Contributor

jaredo commented Mar 28, 2016

I am not convinced of the need to trim low quality bases. Aligners can split reads and modern assemblers use error correction as a pre-processing step. It is not clear a trimming heuristic does a better job than these sophisticated algorithms. I get very nice assemblies directly from the nxtrim output. See here:

https://github.com/sequencing/NxTrim/wiki/Bacterial-assembles-using-Nextera-Mate-pairs

As for performance, decompression is not a bottleneck for any serious bioinformatics task.

When aligning with bwa, I see negligible differences in compute time for uncompressed versus gzipped fastq:

bwa mem EcMG.fna -p EcMG1.mp.fastq.gz > /dev/null 
#40.838 seconds

zcat EcMG1.mp.fastq.gz > tmp.fastq
bwa mem EcMG.fna -p tmp.fastq > /dev/null 
#41.147 seconds

@sklages
Copy link

sklages commented Mar 29, 2016

gzip will become bottleneck on very fast I/O and very large files. And it's not always the case that you directly run bwa after trimming ...

So I'd go for (optional) uncompressed output as well :-)

@jaredo
Copy link
Contributor

jaredo commented Mar 29, 2016

Yes, if you can afford to store uncompressed fastq on your SSD then this might save you some time. On the other hand, on my system, it is actually slightly slower to pull uncompressed fastq from a network disk (probably because it is i/o bound and you have to read more data).

I am not convinced, but I do take pull requests ;)

@jaredo jaredo reopened this Mar 29, 2016
@sklages
Copy link

sklages commented Mar 29, 2016

At least for the MP fraction we could use --stdout to prevent output compression. But this contains both mp and unknown libraries, correct?

@jaredo
Copy link
Contributor

jaredo commented Mar 29, 2016

That is correct.

I use --stdout to pipe the output to bwa. The aligner will then flag whether the reads were FR/RF so I can tell if reads were true mate-pairs or not ie. you don't need to rely on the presence of the Nextera adapter to tell if a read is a true MP when performing alignment.

@sklages
Copy link

sklages commented Mar 29, 2016

ok, makes sense (for direct alignment).
I usually do some denovo assembly of my data (1-2Gbp genome size) and I did see differences when using mp with or without unknown data in scaffolding.

Maybe you could separate both by writing to mp to stdout and unknown to stderr?

@jaredo
Copy link
Contributor

jaredo commented Mar 29, 2016

Yes, for scaffolding you probably only want to use mp (and hence the --justmp flag). Unfortunately your proposed solution is about as complicated as implementing plain text output.

I would really like to see a realistic use case for unzipped fastq (with actual timings) before I consider implementing it. It would have to be at least twice as fast as using the gzipped input.

@sklages
Copy link

sklages commented Mar 29, 2016

I am not sure if --justmp only dumps mp or additionally unknown . Your suggestion implies a mp-only output, nxtrim short help tells me --justmp - just creates a the mp/unknown libraries .. so I am a bit confused ;-)

Concerning the speed issue ... I played around a bit. Not really a fully blown benchmark but just enough to get an idea (if I am not completely wrong):

-rw-r--r-- 1 klages klages 85G 2016.03.29 16:49:27 athCun.raw.il.fq
-rw-r--r-- 1 klages klages 32G 2016.03.29 16:33:37 athCun.raw.il.fq.gz

A little perl script which simply opens the fastq files (.gz via open(my $fh1, "-|", "gzip -dc athCun.raw.il.fq.gz"), iterates over each line and counts lines.

Reading the uncompressed file is done at a rate of roughly 260MiB/sec (as seen in htop) and takes ~330sec. Compressed file is read at a rate of about 37-44MiB/sec and takes ~800sec.

I also used another tool, just for checking the reading rates, https://github.com/ADAC-UoN/fqcounter

This tool reads both fastq files at about the same rate as the simple perl script.

This has been tested on a local filesystem (xfs) of my workstation (HP Z800).

edit: as I alter the fastq-headers after nxtrim I can perfectly live with --stdout if I had the option to decide wether stdout stream consists of mp-only data or mp/unknown data.

@jaredo
Copy link
Contributor

jaredo commented Mar 29, 2016

I am not sure if --justmp only dumps mp or additionally unknown . Your suggestion implies a mp-only output, nxtrim short help tells me --justmp - just creates a the mp/unknown libraries .. so I am a bit confused ;-)

Sorry this isn't clear and the behaviour should be changed. If you run with --justmp and no --stdout you will get sample.mp.fastq.gz and sample.unknown.fastq.gz. The former being useful your scaffolding. When you add --stdout they are all mixed together. I think I should just remove the --justmp requirement to --stdout.

In your example, you are just reading the files, but my point is that if you have to process the data in some way (ie align it for scaffolding), the decompression won't be a bottleneck. I guess conceivably if you are piping it to another trimmer that is very fast, the compression will be a bottleneck.

edit: as I alter the fastq-headers after nxtrim I can perfectly live with --stdout if I had the option to decide wether stdout stream consists of mp-only data or mp/unknown data.

This makes sense for a few different reasons. I will add this.

@sklages
Copy link

sklages commented Mar 30, 2016

I just wanted to show that (de)compression in general may become a bottleneck with large data volumes and fast I/O. NFS mounts cannot deliver data that fast ... that's OK. Compression is slower but may be sped up with multithreading.

@jaredo
Copy link
Contributor

jaredo commented Apr 10, 2016

I have added --stdout-mp and --stdout-un in #22 which I think largely resolves this.

@jaredo jaredo closed this as completed Apr 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants