-
Notifications
You must be signed in to change notification settings - Fork 6
Home
Bio::Faster is a BioRuby gem that implements a fast and simple parser for FastQ file. The new version dropped the support for simple FastA files to focus on the more resource demanding FastQ parsing. This new version is a rewrite of the old one, the C extension has been completely written from scratch and now the parser checks also for formatting problems in FastQ files. Full RSpecs has been defined based on the test files available in the official FastQ paper.
The Bio::Faster class is instantiated with the file name and the each_record method is then used to parse the whole file. It returns an array with the sequence header (ID and comment), the sequence itself and an array with the quality values. Default quality encoding is expected to be Sanger (Phred33).
fastq = Bio::Faster.new("sequences.fastq")
fastq.each_record do |sequence_header, sequence, quality|
puts sequence_header, sequence, quality
end
If the quality encoding is Phred64 (i.e. Solexa) you need to specify it:
fastq_solexa = Bio::Faster.new("sequences.fastq",:solexa)
The each_record method can also read directly from STDIN and this can be useful when dealing with compressed FastQ files.
Just specify stdin as the input:
Bio::Faster.new(:stdin).each_record do |seq|
...
and you can call the Ruby script with pipes in a standard Unix terminal:
zcat sequences.fastq.gz | ruby my_parser.rb
So you can read gzipped files without any drop in the parser performance.
This is a comparison of the time needed to parse a 5.4 Gb Illumina 1.8+ FastQ file.
Using BioFaster:
Bio::Faster.new("test_file.fastq").each_record {|sequence_header, sequence, quality|}
real 3m55.870s
user 3m51.767s
sys 0m4.055s
Using standard BioRuby parser:
Bio::FlatFile.open(Bio::Fastq,File.open("test_file.fastq")).each_entry {|seq|}
real 11m35.946s
user 11m26.762s
sys 0m7.764s
BioFaster is almost 4X times faster then standard object oriented FastQ parser method.