regex - How to extract fasta sequences in a file which header line matches with list in another file? -
i'm newbie perl. trying extract fasta sequences 1 file matches lines in file. 2 example files follows:
file1.fasta:
>gene_44|105_nt|+|47540|47644 gtgcgccggcgcgtcgcgatcgcgaaccggcccgtgcgaatcctgccgcatgcgcgccgcatctcgccacgccgcgcatttcatttcgacatccataacgtctga
>gene_69|111_nt|+|75846|75956 atgccgttgccgtcgcgcatcgcggcggccgtgcgcggcgcgcatgcatacgccggcacggccgatgcgcgcgcgacgcgcaaactgcacgcggcgcgggatttgtgttga
>gene_88|177_nt|-|97993|98169
atgcgccagccgacgcacgcccattccgggcgaaacgttccccttatccattcgatcatccgtgccgcactgcgcgaagcggccaccgccgacacgtaccaaaccgcgctcgatgcgaccggcgcggcactcgtcgccatcgcggcgctcgtgcgcgcggaggtgcggcatggctga>gene_90|141_nt|-|99016|99156
ttggaagggcgctttccgcgtgcgagtcgtctgacgcagcgttgcacggtctggtcgaatcgcgagcttcatcgctggatggccgatccgttgaactatcgcgctgtcgacgcggcgaaccagacgacggagggcgcgtaa
file2.list:
somewordsinfront, >gene_44|somewordsattheback
blablabla, >gene_88|blablablablabla
the output expect follows:
>gene_44|105_nt|+|47540|47644 gtgcgccggcgcgtcgcgatcgcgaaccggcccgtgcgaatcctgccgcatgcgcgccgcatctcgccacgccgcgcatttcatttcgacatccataacgtctga
>gene_88|177_nt|-|97993|98169
atgcgccagccgacgcacgcccattccgggcgaaacgttccccttatccattcgatcatccgtgccgcactgcgcgaagcggccaccgccgacacgtaccaaaccgcgctcgatgcgaccggcgcggcactcgtcgccatcgcggcgctcgtgcgcgcggaggtgcggcatggctga
how can achieve that? in advance! :)
next time when ask question, please show code, example
use strict; use warnings; @genes; open $list, '<file2.list'; while (my $line = <$list>) { push (@genes, $1) if $line =~ /[^>]+>([^|]+)/; } $input; close $list; { local $/ = undef; open $fasta, '<file1.fasta'; $input = <$fasta>; close $fasta; } @lines = split(/>/,$input); foreach $l (@lines) { foreach $reg (@genes) { print ">$l" if $l =~ /$reg/ } }
Comments
Post a Comment