-
Notifications
You must be signed in to change notification settings - Fork 6
Home
Bio::Faster is a BioRuby plugin that implements a fast and simple parser for FastA and FastQ file. It is based upon the Kseq library written in C by Heng Li (http://lh3lh3.users.sourceforge.net/parsefastq.shtml).
Bio::Faster so far implements a single method parse that will open a normal or compressed (gzipped) FastA/FastQ file. This method takes a block and returns an array for each sequence in the dataset. The way to access FastA or FastQ files is exactly the same.
Bio::Faster.parse("sequences.fastq") do |sequence_id, comment, sequence, quality|
puts "#{sequence_id}, #{comment}, #{sequence}, #{quality}"
end
Quality values are converted on the fly from ASCII codes present in FastQ files. The method is tested using BioRuby Bio::Fastq methods as reference and it is fully compatible with actual standards for both Illumina 1.8+ and SFF 454 FastQ files (Sanger/Phred format). It is not compatible with older Illumina FastQ versions (i.e. Solexa format).
Bio::Faster is actually very fast in reading huge Fasta and FastQ files, like the ones generated from Next Generation Sequencing data. This is due to the internal C library (Kseq) and to a non-Object Oriented approach. For each sequence the method will not generate any complex object, but just a simple array. If there is the need to read a huge file, for example to store sequence data somewhere else (e.g. a database) or to extract quality or sequence information, Bio::Faster could be a good choice. Compared to the standard BioRuby Bio::Fastq method, Bio::Faster is actually 4-5X faster.
Here are some tests done to parse a 1 Gb FastQ files using BioRuby Bio::Fastq and Bio::Faster on the same Linux server with Ruby 1.9.3 and BioRuby 1.4.2:
Bio::Fastq (standard BioRuby)
Bio::FlatFile.open(Bio::Fastq, File.open("sample.fastq")).each_entry {|seq|}
# Time
# 139.30s user 0.73s system 99% cpu 2:20.06 total
Bio::Faster
Bio::Faster.parse("sample.fastq") {|seq|}
# Time
# 34.71s user 0.59s system 99% cpu 35.320 total
The actual version of Bio::Faster has been tested successfully with Ruby 1.9 . If you want to run tests or work on the code, you will need to compile the C extension manually. There is a Rake task already available for this purpose, just run
rake ext:build
and you are ready to go.