Tuesday, September 13, 2011

Pull desired sequences out of a multiple FASTA file with regex pattern match

This simple problem stumped me for a while.

Say you want just the human sequences ("hsa") from the following multiple FASTA file:



>cel-mir-90 MI0000059 Caenorhabditis elegans miR-90 stem-loop
GGGCGCCAUUUCGAGCGGCUUUCAACGACGAUAUCAACCGACAACUCACACUUUUGCGUG
UUGAUAUGUUGUUUGAAUGCCCCUUGAAUUGGAUGCCA
>hsa-let-7a-1 MI0000060 Homo sapiens let-7a-1 stem-loop
UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAU
ACAAUCUACUGUCUUUCCUA
>hsa-let-7a-2 MI0000061 Homo sapiens let-7a-2 stem-loop
AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGAUAACUGUACAGCCU
CCUAGCUUUCCU
>dme-mir-13b-2 MI0000135 Drosophila melanogaster miR-13b-2 stem-loop
UAUUAACGCGUCAAAAUGACUGUGAGCUAUGUGGAUUUGACUUCAUAUCACAGCCAUUUU
GACGAGUUUG
>dme-mir-14 MI0000136 Drosophila melanogaster miR-14 stem-loop
UGUGGGAGCGAGACGGGGACUCACUGUGCUUAUUAAAUAGUCAGUCUUUUUCUCUCUCCU
AUA
>mmu-let-7g MI0000137 Mus musculus let-7g stem-loop
CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGA
UAACUGUACAGGCCACUGCCUUGCCAGG
>hsa-mir-30d MI0000255 Homo sapiens miR-30d stem-loop
GUUGUUGUAAACAUCCCCGACUGGAAGCUGUAAGACACAGCUAAGCUUUCAGUCAGAUGU
UUGCUGCUAC
>mmu-mir-122 MI0000256 Mus musculus miR-122 stem-loop
AGCUGUGGAGUGUGACAAUGGUGUUUGUGUCCAAACCAUCAAACGCCAUUAUCACACUAA
AUAGCU

In other words, you want just these sequences:
>hsa-let-7a-1 MI0000060 Homo sapiens let-7a-1 stem-loop
UGGGAUGAGGUAGUAGGUUGUAUAGUUUUAGGGUCACACCCACCACUGGGAGAUAACUAU
ACAAUCUACUGUCUUUCCUA
>hsa-let-7a-2 MI0000061 Homo sapiens let-7a-2 stem-loop
AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGAUAACUGUACAGCCU
CCUAGCUUUCCU
>hsa-mir-30d MI0000255 Homo sapiens miR-30d stem-loop
GUUGUUGUAAACAUCCCCGACUGGAAGCUGUAAGACACAGCUAAGCUUUCAGUCAGAUGU
UUGCUGCUAC

I knew I could do it with a script, using if/then/else testing.   I know some people would say I should make use of BioPerl Bio::SeqIO module.  But I wanted to do it in the bash shell, for the sake of simplicity and learning a little more advanced usage of the amazing text and regex tools available in almost any vanilla linux distro.

This posting from Austin Matzko's blog looked really useful, but ultimately The Grymoire awed me with its comprehensiveness and clarity, in this case, for all things sed.

I made a slight modification of the supplied example (see the "Working with Multiple Lines" section) and I came up with a one-line solution:
sed -n '/^>/ b para; H; $ b para; b; :para; x; /'hsa'/ p' inputfile.fasta

But it is probably better read as a script:
#!/bin/sh
sed -n '
# thanks to http://www.grymoire.com/Unix/Sed.html
#
# if matching description, check the paragraph
/^>/ b para
# else add it to the hold buffer
H
# at end of file, check paragraph
$ b para
# now branch to end of script
b
# this is where a paragraph is checked for the pattern
:para
# return the entire paragraph
# into the pattern space
x
# look for the pattern, if there - print
/'$1'/ p
' $2

No comments:

Post a Comment