bioinformatics - Splicing through a line of a textfile using python -

i trying create genetic signatures. have textfile full of dna sequences. want read in each line text file. add 4mers 4 bases dictionary. example: sample sequence

atgatatatctatcat

what want add atga, tgat, gata, etc.. dictionary id's increment 1 while adding 4mers.

so dictionary hold...

genetic signatures, id atga,1 tgat, 2 gata,3

here have far...

import sys    def main ():     readingfile = open("signatures.txt", "r")     my_dna=""      dnaseq = {} #creates dictionary       char in readingfile:         my_dna = my_dna+char      char in my_dna:                      index = 0         dnaid=1         seq = my_dna[index:index+4]                   if (dnaseq.has_key(seq)): #checks if key in dictionary             index= index +1         else :             dnaseq[seq] = dnaid             index = index+1             dnaid= dnaid+1      readingfile.close()  if __name__ == '__main__':     main()

here output:

actc actc actc actc actc actc

this output suggests not iterating through each character in string... please help!

you need move index , dnaid declarations before loop, otherwise reset every loop iteration:

index = 0 dnaid=1 char in my_dna:                  #... rest of loop here

once make change have output:

atga 1 tgat 2 gata 3 atat 4 tata 5 atat 6 tatc 6 atct 7 tcta 8 ctat 9 tatc 10 atca 10 tcat 11 cat 12 @ 13 t 14

in order avoid last 3 items not correct length can modify loop:

for in range(len(my_dna)-3):     #... rest of loop here

this doesn't loop through last 3 characters, making output:

atga 1 tgat 2 gata 3 atat 4 tata 5 atat 6 tatc 6 atct 7 tcta 8 ctat 9 tatc 10 atca 10 tcat 11

Search This Blog

Bready

bioinformatics - Splicing through a line of a textfile using python -

Comments

Post a Comment

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

c# - Using multiple datasets in RDLC -