Skip to content
fstrozzi edited this page Jan 4, 2012 · 24 revisions

Intro

Bio::Faster is a BioRuby plugin that implements a fast and simple parser for FastA and FastQ file. It is based upon the Kseq library written in C by Heng Li (http://lh3lh3.users.sourceforge.net/parsefastq.shtml).

Bio::Faster so far implements a single method parse that will open a normal or compressed (gzipped) FastA/FastQ file. This method takes a block and returns an array for each sequence in the dataset. The way to access FastA or FastQ files is exactly the same.

Bio::Faster.parse("sequences.fastq") do |sequence_id, comment, sequence, quality|
  puts "#{sequence_id}, #{comment}, #{sequence}, #{quality}"
end

Note on quality values for FastQ files

Quality values are converted on the fly from ASCII codes present in FastQ files. The method is tested using BioRuby Bio::Fastq methods as reference and it is fully compatible with actual standards for both Illumina 1.8+ and SFF 454 FastQ files (Sanger/Phred format). It is not compatible with older Illumina FastQ versions (i.e. Solexa format).

Benchmark

Bio::Faster is actually very fast in reading huge Fasta and FastQ files, like the ones generated from Next Generation Sequencing data. This is due to the internal C library (Kseq) and to a non-Object Oriented approach. For each sequence the method will not generate any complex object, but just a simple array. If there is the need to read a huge file, for example to store sequence data somewhere else (e.g. a database) or to extract quality or sequence information, Bio::Faster could be a good choice. Compared to the standard BioRuby Bio::Fastq method, Bio::Faster is actually 4-5X faster.

Here are some tests done to parse a 1 Gb FastQ files using BioRuby Bio::Fastq and Bio::Faster on the same Linux server with Ruby 1.9.3 and BioRuby 1.4.2:

Bio::Fastq (standard BioRuby)

Bio::FlatFile.open(Bio::Fastq, File.open("sample.fastq")).each_entry {|seq|}

# Time
# 139.30s user 0.73s system 99% cpu 2:20.06 total

Bio::Faster

Bio::Faster.parse("sample.fastq") {|seq|}

# Time
# 34.71s user 0.59s system 99% cpu 35.320 total

For developers

The actual version of Bio::Faster has been tested successfully with Ruby 1.9 . If you want to run tests or work on the code, you will need to compile the C extension manually. There is a Rake task already available for this purpose, just run

rake ext:build

and you are ready to go.

Clone this wiki locally