Skip to content
fstrozzi edited this page Jan 4, 2012 · 24 revisions

Intro

Bio::Faster is a BioRuby plugin that implements a fast and simple parser for FastA and FastQ file. It is based upon the Kseq library written in C by Heng Li (http://lh3lh3.users.sourceforge.net/parsefastq.shtml).

Bio::Faster so far implements a single method parse that will open a normal or compressed (gzipped) FastA/FastQ file. This method takes a block and returns an array for each sequence in the dataset. The way to access FastA or FastQ files is exactly the same.

Bio::Faster.parse("sequences.fastq") do |seq|
  seq[0] # sequence ID
  seq[1] # comments
  seq[2] # sequence
  seq[3] # array with quality values (fastQ only)
end

Note on quality values for FastQ files

Quality values are converted on the fly from ASCII codes present in FastQ files. The method is tested using BioRuby Bio::Fastq methods as reference and it is fully compatible with actual standards for both Illumina 1.8+ and SFF 454 FastQ files (Sanger/Phred format). It is not compatible with older Illumina FastQ versions (i.e. Solexa format).

Benchmark

Bio::Faster is actually very fast in reading huge Fasta and FastQ files, like the ones generated from Next Generation Sequencing data. This is due to the internal C library (Kseq) and to a non-Object Oriented approach. For each sequence the method will not generate any complex object, but just a simple array. If there is the need to read a huge file, for example to store sequence data somewhere else (e.g. a database) or to extract quality or sequence information, Bio::Faster could be a good choice. Compared to the standard BioRuby Bio::Fastq method, Bio::Faster is actually 4-5X faster.

Here are some tests done to parse a 1 Gb FastQ files using BioRuby Bio::Fastq and Bio::Faster on the same Linux server with Ruby 1.9.3 and BioRuby 1.4.2:

Bio::Fastq (standard BioRuby)

Bio::FlatFile.open(Bio::Fastq, File.open("sample.fastq")).each_entry {|seq|}

# Time
# 139.30s user 0.73s system 99% cpu 2:20.06 total

Bio::Faster

Bio::Faster.parse("sample.fastq") {|seq|}

# Time
# 34.71s user 0.59s system 99% cpu 35.320 total

For developers

The actual version of Bio::Faster has been tested successfully with Ruby 1.9 . If you want to run tests or work on the code, you will need to compile the C extension manually. There is a Rake task already available for this purpose, just run

rake ext:build

and you are ready to go.

Clone this wiki locally