Thursday, August 4, 2011

Bio::SeqIO very very slow

While it may be a convenient and flexible solution when reading in 10s or 100s or 1000s of sequences, the SeqIO module of BioPerl is just not a workable solution for reading in Next Gen Sequencing files.

 

Consider reading in 100,000 short reads from a fastq file:

It takes 30 seconds to do it the bioperl way:

use Bio::SeqIO;
my $seqin = Bio::SeqIO->new(-format => "fastq",-file   => "infilename",);
while( $seq = $seqin->next_seq() ) { $seqhash{$seq->seq() } ++; }

 

But only 1/2 of a second to do it the home-made way (using basic file IO module):

use File::Util;
my $f = File::Util->new();
$ifh = $f->open_handle('file' => $infilename, 'mode' => 'read');
while(<$ifh>)
{
if($_ =~ /\@.*/)
{
$counter++;
$trigger = 1;
}
elsif($trigger == 1)
{
$_ =~ /(\w+)/;
$seq = $1;
$seqhash{$seq} ++;
$trigger = 0;
}
}

Not as elegant, but it works much faster!!

No comments:

Post a Comment