zoukankan      html  css  js  c++  java
  • 找最大重复字串

    找最大重复字串

    http://www.chinaunix.net/jh/24/464277.html

    http://www.chinaunix.net 作者:bioinfor  发表于:2007-03-27 19:16:25
    【发表评论】【查看原文】【Shell讨论区】【关闭】 

    1) Write a program to identify all the repetitive patterns in a string of
    charaters (INPUT).  The string is only composed of A,C,G,T characters.  The
    maximum length of string is 10000.  The minimum length of repeat is 10
    characters.   Output: position, size, and patterns.  Here is an example:
    1)写一个程序,识别字符串中所有的重复片段(重复模式),字符串由A,C,G,T组成,字符串最长为10000,随机产生。重复的片段最小是10个符串。输出:位置,大小,和片段。如下:
    String:
    TAAAAACTCGGGGT AAAAACTCGGGGAAAA
    Repeat:
    Repeat: AAAAACTCGGGG, Size: 12, Start Positions: 2, 15

    解释:如这就就有两个重复(空格格开了):T  AAAAACTCGGGG  T AAAAACTCGGGG AAAA
    这两个重复位置分别在字符串的2和15位,大小为12

    2) Write a program to identify all the INVERTED repetitive patterns (e.g.
    TAACCG => GCCAAT) in a string of character (INPUT).  The string is only
    composed of A,C,G,T characters.  The maximum length of string is 10000.  The
    minimum length of repeat is 10 characters.   Output: position, size, and
    patterns.   Here is an example:

    写一个程序识别所有的反向重复,如TAACCG => GCCAAT,也就是前面的反过来就是后面的字符串。和上面一样,字符串最长为10000,随机产生,最小反向重复片段为10,输出位置,大小,和片段,如下:
    String:
    CAAAAACGAGGGGTTTGGGGAGCAAAAA
    Inverted Repeat:
    Inverted Repeat: AAAAACGAGGGG, Size: 12, Start Positions: 17, 2

    解释:如上面,C  AAAAACGAGGGG  TTT   GGGGAGCAAAAA
    AAAAACGAGGGG和GGGGAGCAAAAA分别是反向重复,分别在2和17位上,大小为12。
     
     
    哈, 学生物的吧。 去 perl.com 有 total solution, 如 bioperl.

    或用 awk 自己写, 要讲效率的话, 还是 C 好。

     lightspeed 回复于:2004-12-12 12:26:33

    注意: 下面的例子中的 STR_MAX 及 STR_MIN 可根据需要设定。 

    #!/bin/awk -f
    #
    # A script can be used to check any repeat pieces of nucleotide sequences.
    #
    # Design: lighspeed
    # Date: Dec. 12, 2004
    #
    # Usage::  $0 datafile
    #
    # Variables:
    #
    # left - a repeat string to be matched
    # right - the right side string used to match any left in it
    # rev_left - reverse string of left
    # rev_right - the right side string used to match any rev_left in it
    # flag[position] - an array element which will be set if the string in the position is matched
    #

    {
      L=length($0)
      STR_MIN=10
    #  STR_MAX=int(L / 2)
      STR_MAX=20
      print "------------------- Line# "NR" --------------------\n"

      for ( Str_Len=STR_MIN; Str_Len <= STR_MAX; Str_Len ++ ) {

        for ( i in flag )
          delete flag

        for ( Position=1; Position <= L - 2 * Str_Len + 1; Position ++ ) {

          if ( Position in flag )
            continue

          count=0
          pos=Position
          offset=Position
          rev_offset=offset
          rev_count=0
          rev_pos=""
          rev_left=""
          left=substr($0,Position,Str_Len)
          if (index(left,"A")==0 || index(left,"C")==0 || index(left,"G")==0 || index(left,"T")==0 )
            continue

          right=substr($0, Position + 1)

          for ( i=length(left); i>=1; i-- ) {
            rev_left=rev_left""substr(left,i,1)
          }

          rev_right=right

          while ( Str_Len <= length(right) ) {
            i=index(right,left)
            if ( i > 0 ) {
       count ++
              j=offset + i
       pos=pos";"j
              flag[j]=1
              right=substr(right, i + 1)
              offset+=i
            }
            else
              break
          }

          while ( Str_Len <= length(rev_right) ) {
            i=index(rev_right,rev_left)
            if ( i > 0 ) {
       rev_count ++
              j=rev_offset + i
              if ( rev_pos == "" ) {
         rev_pos=j
              }
              else
         rev_pos=rev_pos";"j
              flag[j]=1
              rev_right=substr(rev_right, i + 1)
              rev_offset+=i
            }
            else
              break
          }

          if (count > 0 )
            match_number[Str_Len] ++

          if (rev_count > 0 )
            rev_number[Str_Len] ++

          if (count > 0 || rev_count > 0) {
            print  left, "Length="Str_Len, "Position="pos, (rev_count >0) ? "Rev_Position="rev_pos : ""
          }
         
        }
      }
      print ""
      print "Summary of Line# "NR
      print "------------------"
       for ( i in match_number ) {
        print "String length :: " i "  Matched Strings :: " match_number
        delete match_number
      }
      print ""
      for ( i in rev_number ) {
        print "String length :: " i "  Matched Reverse Strings :: " rev_number
        delete rev_number
      }                       

    }
     


    数据文件来自 DNA  NCBI35 的片段。 处理成 10000 个字符的单行文件。


    # cat datafile

    TAAAATGTGTAATCAACTAATACAAAGCAAGTTTTGTACTTTTTGTTGAATTTATTACTAAGTAT
    TCTTTTTGATGCAATTGTAAGTAGAAATATTTATTTATTAAGAGATAGGGTCTTACTGTGTGGCC
    CAGTATGGCCTTGAACTCCTGGGCTTAAGACATCCTCCTGCTGCAGCCTCCTGAGTAACTGAGAT
    TACAGGTGTGCACCACCTCGCCTGGCTCAGAATGGTTTTCTTAACTTCATTTTTAGATTGTTCAC
    TGTGAATATATCGAATTACAATAGTTTAGGCTGGGCATGGTGGCTCACGCCTGTAATCCTAGCAC
    TTGGGGAGGCTGAGGTGGGTGGATAACTTGAGGCCAGGAGTTTCAGATCAGCCTGGCCATCACAG
    AGAAACCTTGTCTTTACCAAAATCACAACAAATTAATTAGCTGGTTGTGGTGGTGCATGCTTGCA
    ATCCCAGCTACTGGGGAGGCTGAGGTACGAGAATTACCTGAACCCAGGAGGTGGAGGTTGCAGTG
    AACCGAGATAGTTCCACTGCACTCCAGCCTGGGCGACAGAACGGTTTTTGTATGCTTCAACCTTA
    CTGAACTCATTTATTCATTCTGATATTTACTTTAGTGGATTCTGTATGATTTTCTATATGCAAGA
    TGCTGTCATTTGCAAATAGAGATAGTTTTTCTTTTTTTGTTTCCAATCTGAATGTGTTTTATTTC
    ATTTTCTTGCCTAATGCTCCTCATTAGTTTTCAATGTTGCATAGTATTTTATTGCATGGACGCAC
    CATAATTACTTTTACCAATCTCTTATTGATGGACATGAAGGTTATTTCCAAACTCTTGTGGTTAT
    AACAATGCTGTAATAGATAACCTATTACAAAGAACAGTTCTCAACTCTTTTGGTCTTGGGACTAC
    TTTACCTATTTATGTATAAGTTTCAAGTTTGGGCTTAGAAAGAATTTAATAATCATGCTAATTTT
    GTTTTGTTTTCTTTTTTTTTACTCCTGGACCCAAGCGGTCTTCCCACCTCAACCTCCCAAGTAGC
    TGAGACTACAAGGGTGAACCATCACCCTGGGTAATTTTTAAATTGTTGGCTGGGCACAGTGGCTC
    ACGCCTGTAATCCTAGCACTTTGGGAGGCTGAGACAGGCGGATTACCTGAGGTTGGGAGTTCAAG
    ACCAGCCTGGCCAACATGGTGAAACCCTGTCTCTACTAAAAACACAAAAAATGAGCCAGGTGCAG
    TGGTGCGTGCCTGTAATCCCAGCTACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAATTCAGGAG
    GTGGATGTTGCGGTGAGCTGAGATCGTGCCACTGAACTCCAGCCTGGGCGACAGAGCAAGATTTC
    ATTTCAAAAAACAAAAAGAAAAAAATTTTTAAAAATTGTTTTGAAGAGATACGGTTTCCCTATGT
    TGCCTAGGCTGGTCTCATGCGATTCTCCTGCCTTGGCCTCCCAAAGTGTTGGGATTATAGACATG
    AGACACCACAAATTTAAACAAGGACTTTTTTTATTTTTTAAAGAGATTACTTTTTCTGAGTAAAC
    AAGGACTTTTAAAACAAGGTACTAAAAATCTGGCTGGGCGTGGTGGCTCGCTCCTGTAATCCCAG
    CACTTTGGGAGGCTGAGGTGGGCGGATCACGAGGTCAAGAGATCAAGACCATTCTGGCCAACATA
    GTGAAACCCCGTCTCTGCTAAAAATACAAAAATTAGCTAGGTGTGGTGGTGCACGCCTGTAGTCC
    CAGCTGCTCAGGAGGCTAAGGCAGGAGAATCACTTGAACCCGGGAGGCAGAGGTTGTAGTGAGCC
    GAGATCTCACCACTGCACTCCAGCCTGGCAACAGAGTGAGATTCCGTCTCAAAAAAAAAATTTTT
    TTTAATAAATAAATAAATAAAAATCTTGAAATTTTTATTAGGTCCTGGTGTTTCTAATTTTAATA
    TGATTTAGTTCTCAAGTGCTAGTTAATACTTCATTAATCAGCCAGATGGAAGTGGGGATACTATG
    GAAACAGCATAGGCAAAGCTTAAAGATAAATGAGACCATGGTTTGAAAATATAGGGTGGCATGCG
    CTTTGGTTCAAGGCAATTTGATCATCACAACAATTTGGCTTAAACAGCACTTTGGTTGAAAATGA
    ATATCCCCTAGTTATGTGTTTTTCAAGTATTGGTCATTTTGGTATATCATGAGTTGTTTTGCAAA
    CTTTTGTGCCAAAGTTTTCAGGAAAACTTTCTAATATTTGCTTTTGTGTTTCTAACTGATTTTCA
    GAGAAGTTGTAATTTTGATGTTTTTTCCTTTTAGTGAGCATGCTTTAACAAAAAACAATAACAGA
    AACTGTGTCAAAGAAAAGGACCTGTAATCTTCAGGGTTTGTAGTCTTTTTCCTCTTAAAAAACCC
    TTTTCCTAATTAATGGCAGTTACATCTGCATGGCTGGTTTGGGTAAGTCTTCATTTTGTTGTATT
    GCTGAGTAACAGTCAACAAAGGTTTATCAACTCTTGGTTAAGGGTTCCTTTCATGTTGTGAGTAA
    ACATGAACAATATAGGATCTTATCCTTTTAAGCTATCATGCAAGAAACATGTGAGGTCTCTTAAA
    AATTCACTGTGCTGGCCGGGCATGGTGGCTCACGCTTGTAATCCTAGCACTTTGGGAGGCTGAGG
    TGGGTGGATCACTTGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCT
    ACAAAAATACAAAAATCAGCCGGGCATGATGGCGGGCAGGTGCTTGTAATCCCAGCTACTTGGGA
    GGCTGAGACAGGAGACTCGCTTGAACCCGGGAGGCGGAGGTTGTAGTGAGCCGAGATTGTGCCAC
    TGCACTCCAGCCTGGATGACAGAGCAAGACTCCATCTCAAAAAAGAAAAAAAAAAAAAATTGTGC
    TGGCTGGGCTCAGTGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGTGGATCAC
    CTGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCCATCTCTACAAAAATACAA
    AAATTAGCCAGGCATAATGGCGAGTGCCTGTAATCCAAGCTACTTGGGAGGCTGAGGCAGGAGAA
    TCGCTTGAACCCGGGAGTGAGCCGAGATGGCGCCACTGCACTCTAGCTTGGGTGACAACAGCAAG
    ATTCTGTCTCAGAAAAAAAAAAAAAATTAACTGTGCTTATAAATGGGAGCTAAATTAGGAAAAAA
    ATAAAAAGTAAAAAGAAAATGAAAATAAAAATTTAAAAAATATATTAACAAATTACCTGTCCTAA
    GGTAAAATTCTTTTTTTTTTTCTTGAGACGGAGTCTCGCTCTGTCGCCCACTCGGAAAGGAGTGC
    CAATCTCGGCGTGAAAATGTGTCTGATGCGTATGCACCTGAGCTAGAAAGCCCAAAGACTGCTAA
    GAAGCATGTGAGGGCTCAGAAACAAACATGTTTGGGCTTCGAAAGCCTGTTTTTGGAACCACTTT
    CCCTTGTCTGCAAGGCAGAGGGAGGGAGGTACTCTGTTATTTCTAAGTCTCTCTTGAGCTCTTAC
    ACTGTGCAAGCCCATGAACGTATTTAATCGTGCATTAGACAATTGTTTTTAATCTATGCCCTGCC
    TCTCCCAAGATCAACCTTTCCCTGAGATCGGGGCCCCCTCTGGGTGCACAGGGATATTTTTATTT
    TTTGAGTTGGAGTTTTGCTCTTGTCACCCAGGCTGGAGTGCAATGGCATGATCTTGACTCACTGA
    AACCTCTTCCTCCCGGCTTCCAGTGATTCTTCTGCCTCAGCCTCCCAAGCAGCTGAGATTACAGG
    CATGCACCACCACACTTCGGTTAATTTTTGTATTTTTAGGAGAGATGGAGATTCACCATGTTGGC
    CAGGCTGGTCTTGAACTCCTGACCTCAGGTGATCCTCCCGCCTTGGCCTCCCAAAATGCTGGGAT
    TATAGGCGTGAGGCACCGTGCCCAGCCCATAGGGATATTTTTATATACTTTCCTGCCCCATGGGT
    CAACTGTTCTTGAACCAAAGAAACAAGAGGCGGGGAAGTTATAGGAAGCTTTTAAAATATGCTTC
    TGTGCAGCACTGCTCGCAGCGTGTCACAGATGTGCGGTATTGGAAGACGAAGGTGAAACTGCATG
    GAGATGATTGTGTGGGGGATGAGGAGGTGGTGGGTAGGGGACTTGGCTTTCTTCACACAAAGACA
    TCCAGGCAAATGGTAAGTCCAAAAGCCCTGTGACAGATAATGGCCATTGTTCCTGCAGGGTGACT
    CTTTTCTCTTCTTTTTTTTCTTTTTGAGGCGGAGTCTCACTCTGTCATCTATGCTGGAGTGCAAT
    GGTGCGATCTTGGCTCACTGCAACTTCCGCTTCCCGGGTTCAAAGTGATTTTTCTGCCTCAGCCC
    TCCCGAGTAGCTGGGACTACAGGTGCGCGCCACCATGGCCAGCTAATTTTTATATTTTTAGTAGA
    GACGGGGTTTCTCCATGTTAGCCAGGATGGTCTCGATCTCTTGATCTCGTGATCCACCCGCCTCA
    GCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCGGCGCCCGGCCCTATACACATGATTTTG
    AACATACTGACAGATGGAGAAAACCACTTTGGAAAAGATACTTCACATGTTCTAGAGACGATTTA
    AACCATTAAGCATTCTATGAAGCTTCTGAAGGTCTGTCAGATTTTAAATGACAACAGTGAAATTT
    TAAAACAAGAACAGAAGTCAGCACCAAAGCTAGTTTAACATTAATAATAAGTGAGCCAATAAATA
    GGTCTATGTTTGCCCAGGCAGGTTTTGCTTATTATGTCAGTTGGAAAGCCAGAAGGAAACTGGTT
    TTAACTCTTAATATAACCTGTATCATGACACCATCACTTTACCAGAAATGTAGCTGATGTCAGCA
    TAAGACTGAGACAGTTTACATTTAAAACTGTTGTTTCCTTTCCAACTATTTTCATAATTCATTCA
    TGGTATAGGATTGAGACTATTTCCTTAAACAGAAAAAAATGGGTAATTAACATTGAGAACTTTCC
    ATGTGCCAGATACTGTATGAACTGTCTTAATTTTCATAGCCACCCTGCAAGATATTATCCTCATC
    TTTTTAGAGGAAGAAACAAGTTTCAAGAAATGAAGTAGGTTTTCTAAGGCCACAGCTATAGTAAA
    GAGGTGGAGCTGACATTCAAGCTTGGATATGAATTATTATAATTTCCACAGCACTACACAGCTGT
    CATTTTCTCTACCTGCAAAACTAAATAAATACTGTTAAAAATAAAAGATGATCTCCAAGATCTCT
    AAACATTAAAATTTTACAATAAACTGGTTGAGGTGACACATGCCTATATTTTCAGCTACTCAGGA
    GTTTGAGACTGGCCTGGACAACATAGCAAGACCCTGTCTCTAAATTTAAAAAACAAATTACAATG
    AGATAATCTTAGACCAGAGAAAGGAAAGTGAAATAGCTATTTGGATTATAAACTGTTTTAGTAAC
    TCAAATGTAATGTGTGGTGGTGACAATATCTTTGATTCCTGGGAAGGTCATTGTGAAAGGGAATA
    GAAAATGCCTTGAAGTCAAAATATAAGGCTCTCAAATAGAAAAATAAATATAACATTTAAGTATT
    ATCAACAGAGAACCAAGTTAGAAAAAACTAGTTATAGTCTGAAACAATGCTGTTTAAAAGACTGC
    AGTCACCAGTGTAAACTGACTCAGGCAACACTTCCCAGGGTCCATGCCGTGGACAACTGACTAAT
    CTCTCTATAAACAATTCTTGACACTAGATAGGCCTTTACTAAGAGCAACCAGAGACAGAAATTAG
    TATCGACAGTGGAGTTTTAAAATCACACTTAAAAAAATATTATTGGCTGGGCACAGTGGCTCACG
    CCTGTAATCCCAGCACTTTGGGAGGCTGAGGCAGGCAGATCATGAGGTCAGGAGATCAAGGCTAT
    CCTGGCCAACATGGTGAAACCCCGTCTCTATTAAAAATACAAAAATTAGCCGGGCGTGGCGGTGA
    GTGCCTGTAGCCCCAGCTACTTGGGAGGGTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCGGAG
    GCTACAGTAAGCCGAGTTCGTGCCACTGCACTCCAGCCTCGGCGACGGAGCGAGACTCCCTCTCA
    AAAAAAGAAAAAAAAAATGTAGATTATATTCTGTGAATATTACATCACAGAATAAAACTCTGGAT
    ATAATACATGGGAGAGTTAATATCCAGAAAGACATTGTGCATTTTTGGTCTAAGTTTCATGAGAC
    AAAATATTATTTTCTTTTCTGAGACTCAATTCTTTCCCAAAGGGATCAGTTCTCTTAAGTGGACC
    TTTTTACAGCCTTTCAGCTGGCTCAAAAGATGAGTTTTGGCGAACAAGATTATCGATACTCACTG
    AGCAAGTGGTAGTTAGAATCCCTTTCATATTTGAAGGTCAAACGGCCATAGCTGACATGATTTAG
    ATTCTTCAGCCACTCAAAGTAAGATACTGTCACTCCTCCAGCATTCAAGTAGAGATCCTATGCAC
    AAAAATAAGACAAAGAAATTAGAAGATGATGGTTTTCGTAAAAGCTGAAAATGAACCTAAGACCT
    TAATTTCAATACCAAGGTAGACTGGACTTCAAATATCGCAAATATATTTTAGCCAGTATCAGGAA
    TTTCACACTTAATTAACACTCCTTCCCATCCCACCCAATTCCATCTAAGGCTTTTTCTATTTAGG
    AAAAAAAAAAATCATTTTTTGGCTTATTAATCAAGGAAAGTTAATAATCTTTGGTTAGAGCCTCT
    TCTCTACCAGAAGTTAGTTCTCAGACTAAATGGCTTGCCCACCAACCAGTTGGACTGGACTGTCC
    ACAGGGCCTCTCAGAAGACAGGATTCTTTCTCCTATTACCTAAGGGTAGCCCATTTCAGTTACAT
    TAAATGTCTTAAGTGCTTTCAGCAAAGGGGGTTCTTTAAAATATATTCCAAGCCCACATTAATTT
    CTAGTAACTTTTTGGTGTAGACTCATTTTTACTTGCTAAAAAACCTGAGCACGTGTTCCTCATAT
    TAGTTTTCTGAGTAAAGCTGGAAAGGGCACTTGAAATGCATAAGGTTAGGAATCATATAGAAAAT
    CTTAAGAGCTTTAGTTAGAATAGTGTTTCTAACACAGTACACATTTATATAACCAGACTCTTAGA
    AGGCTAAAGACATTCAGTAAAGAGCCTGAAATTGGAATAAATGTTTCGATCAAAGTGAAAATTAA
    CAGGCTTAGAATTAACCATGCTTCTACTATATTTTCTCAAAAGTGAAAAAGATGAATTCACTAGA
    GCTTGGAGACTAATAATTCCTCTCTTCCTCCAAATTCCTTGCAAAAGACTATTATGATTCTAAGT
    ACATATAAAGCCTAATAAATATAGATGACTTACTGGAATAACCATAATGTTTCTCTCCAGGAAGA
    TCTTGTCAGCTTCTGGAGTTGTTGGCCCATTGGCACCTTCAGCAATGATCTGCAAGAGAGTCAGG
    AACATAGAGAAATGCGAACACCACCGTCAAATCCCCTCCACTGAGGGCAAGAGATGTGCATATAT
    GAACAAGGGGCTGTGGGGAGAAGCACAGTTTCAGTTAAAGTTAAATAGAGGTTATTTTTCTCTGC
    CAAGTGTATAAAACTACCTTTCACTTTTCTATTTATCTAGGTTTTTTTTTTGTTTGCTTGTTTTT
    TTTTTTTACAGGAGTGTCAGGCAGATGCGTTTGTTTTGGTAATGGTTGCACAACTCTGTGCCGGT
    AGCTAAAAGCCATTAAATTATGTACCTTAAATGGGGGAACTGTATGGTATGTGCAGTATGTGCCA
    ATAAAGCTGCTAAAAAGAAAGAGAAAACTCAATCAGACTCTTCTATGACCCCCCTAACGTCATTC
    ACATTGATAATGTTGGTTCTGGTTTCTATAATGTTGTCACCTTGGCTTTGACTCTGGGTGCGTTG
    GATTTGGTCAACTGCTTCTCACTGGCAGCTGGGATCAGTATGTCACAGTCGGCCTCCAAGATGCT
    TCCTTCATAGGGCTTTGCCTTGGGGAAGCCCAGAATGGACCCATGTTGCTGCCATTGATTGAAAA
    TCACAATTAATAGCTGCACCAGAGTTTTAAATATTTATATTTAGTGTCTATGCTATAAAAATGTA
    TTAATACCAATTTGAAGTCTTCCAGTTCCTTTGGGTCAATACCATCTGGATTCCATATACTCCCA
    TCAGACTCACCAACAGCAATACATTTAGCACCAAAACGATGTAAATATCTCATAGAGTGCAGGCC
    CACATTACCAAATCCCTGTGAAGAACAATTACCCATAACACAAAAATTAAAGTCCTGGTATAGAC
    AGCAAGAGTCATATTTTGACCAATGTAAATCCATACTCTGTTTATTTAGAAGCAATTCTCAAAAT
    TCTTTTGCCAAAAGAAACAATGTACTAACTGGTTTCTCTTCAACAATAAAATTCTCTGTTTAAGA
    ATGTGATGAGCGGGCATGGTGGCTCACGCTTGTAATTCCAGCACTTTGGGAGGCTGAGGCAGGTG
    GATCACTTGAGGTCAGGAGTTCGAGATCAGCCATGGCCAACATGGTGAAACCCCGTCTCTACTAA
    AAATACAAAAATTAGGCATGGGGTCCGTGCCTGTAATCCCAGCTACTTGGGAGAATGAGGTAGGA
    GAATCACTTGAACCTGGGAGGTGGAGGTTGCAGTGAGCCGAGACTGCACTCCAGCCTGGGCAATA
    GGGTGAGACTCCATCTCAATCAATCAATAAATGGCAGTGGTGTTAAGTACACCACCACTTTTTGC
    TTTTTTTTTTTTTTTTTTTTTGATGGAGTTTTGCTCTTGTTGCCCAGGCTGGAGTGCAATAGCGG
    AATCTCGGCTCACCACAATCTCTGCCTCCCAGGTTCAAGCAATTCTCCTGCCTCAGCCTCCTAAG
    TAGCTGGGATTACAGGCATGCGCCACCTCGCCTGGCTAATTTTGTATTTTTAGTAGAGACAGGGT
    TTCTCCATGTTGGTCAGGCTGGTCTCGAACTCCTGACCTCAGGTGATCCGCCACTTCAGTCTGCT
    AAAGTGCTGGGATTACAGCTGTGAGCCACAGTGCCCGGACTTTCTTTTTTTTTTTTTTTTGCCAA
    TTTGTATTTTATTTTTGCTAATTTAAAAAATAGTTAATAGAATATCAGAAATACTGAACATTATC
    ATTTCCATAAATGCAAGAGTGTATACATTTTCCACACACTGAGTACTAGCTAGATTTTCTATGAT
    AAACTCTGACCACTTCTTCAGGCAATTCATGTACTTACTTCAGCATTATCATTAATGATGAAGGT
    TCTAGAACCATCATGAACAAGGGTCCCATCTTCACACAAGTTACTTAACTGCTGGGAGGCTCTAT
    TTCATCTTATGTAAACTACAGATAATACCTACTCACCTCAAGGGTATATCAAGAGTTTATGTAAG
    CTAAGTTTGTAGAAAGTAGTTAGCACAGTGCCAGGAAGGGTCCAAGAAGAAATGGTACTTACTAT
    GAAATATTTGTACGTATATATGTATGCATGTTAATGAGCTCTTATTAGCTGTGTTCATTAAAGGT
    TTTCTCTATCCTGTGATTTGCTTTTAGATTTTGGAATACATTTCATTGTGCACATTCCATTTGTA
    TTATTAATATAACAATATTTATTACTATTATTATTATCATCATCAATTCAATCACATCTACTGTA
    TCCCTGATGATGACCATGATTCCTTTAATAATCATAAAACTCTCTTCCCTTCATCACGGGGTAAA
    TAACCTATCACAATGCTGTAAGTCTCCATCAGCACCCCAGGCTGCCCCTGCTGACTTACCAGATC
    TGCTACTTCAGCCAGATCAATCATCATGTTAATGTCACACCCACATTCAATAATGGTGAGTCGGA
    GCTTTTTACCTGTAAGTGAAAAAGATAAAAATTTTACTTTAAAAAGACCCTGAAA


    运行:

    # time ./1 datafile > report

    real    1m14.81s
    user    0m50.65s
    sys     0m0.13s

    # wc -l report

        3355 report


    下面是 report 文件中的部分内容:

    # cat report

    ------------------- Line# 1 --------------------

    TAATCAACTA Length=10 Position=10 Rev_Position=12
    TTTATTACTA Length=10 Position=51;9703
    GGAGGCTGAG Length=10 Position=330;470;1129;1265;1633;2655;2793;3102;5936;8564
    AAAAAAAAAA Length=10 Position=1871;2906;2907;2908;2909;2910;3198;3199;3200;3201;3202;6183;6761;6762 Rev_Position=2906;2907;2908;2909;2910;3198;3199;3200;3201;3202;6183;6761;6762
    GTAGTGAGCCGA Length=12 Position=1811;2838
    AATAAATAAATA Length=12 Position=1889;1893 Rev_Position=1890;1894
    CCACTGCACTCCA Length=13 Position=534;1830;2857;6133
    GTTTTTCTTTTTT Length=13 Position=675 Rev_Position=4304
    GTAATCCCAGCTAC Length=14 Position=1248;2776;8678
    CCTGTAATCCTAGCA Length=15 Position=310;1109
    GGATCACTTGAGGTCA Length=16 Position=2671;8580
    CGGGCATGGTGGCTCAC Length=17 Position=2617;8526
    CACTTGAGGTCAGGAGTT Length=18 Position=2675;8584
    CCAGCCTGGCCAACATGGT Length=19 Position=1172;2698;3011
    CTTTGGGAGGCTGAGGCAGG Length=20 Position=5931;8559
    TTTTTTTTTTTTTTTTTTTT Length=20 Position=8841;8842 Rev_Position=8842

    Summary of Line# 1
    ------------------
    String length :: 10  Matched Strings :: 541
    String length :: 11  Matched Strings :: 438
    String length :: 12  Matched Strings :: 372
    String length :: 13  Matched Strings :: 329
    String length :: 14  Matched Strings :: 293
    String length :: 15  Matched Strings :: 258
    String length :: 16  Matched Strings :: 226
    String length :: 17  Matched Strings :: 200
    String length :: 18  Matched Strings :: 171
    String length :: 19  Matched Strings :: 146
    String length :: 20  Matched Strings :: 124

    String length :: 10  Matched Reverse Strings :: 142
    String length :: 11  Matched Reverse Strings :: 69
    String length :: 12  Matched Reverse Strings :: 40
    String length :: 13  Matched Reverse Strings :: 25
    String length :: 14  Matched Reverse Strings :: 15
    String length :: 15  Matched Reverse Strings :: 7
    String length :: 16  Matched Reverse Strings :: 5
    String length :: 17  Matched Reverse Strings :: 2
    String length :: 18  Matched Reverse Strings :: 1
    String length :: 19  Matched Reverse Strings :: 1
    String length :: 20  Matched Reverse Strings :: 1

     lightspeed 回复于:2004-12-16 23:39:29

    这里是 final version.

    1. 增加了二分查找最长匹配串并自动设置 STR_MAX 的代码。
    2. 改进了 reverse string 算法, 大大提高了运行效率。

    # time ./1 datafile > report1

    real    2m57.43s
    user    2m29.33s
    sys     0m0.08s

    $ time ./1 -v r=1 datafile > report2

    real    0m53.28s
    user    0m46.03s
    sys     0m0.03s    

    #!/bin/awk -f
    #
    # A script can be used to check any repeat pieces of nucleotide sequences.
    #
    # Design: lighspeed
    # Date: Dec. 16, 2004
    #
    # Repeat Match Usage::  $0 datafile
    # Reverted Repeat Match Usage::  $0 -v r=1 datafile
    #

    # function is_overlap : check if a string (position=p, length=l) is overlap with
    # matched strings which are stored in array record (position=i, length=record)

    function is_overlap(p, l) {
      e = p + l - 1
      for (i in record) {
       a = i + record - 1
       if (( i >= p && i <= e ) || ( a >= p && a <= e ) || ( p >= i && p <= a ) || ( e >= i && e <= a ))
          return 1
      }
      return 0
    }

    # flag=0 find the longest matched string and set STR_MAX
    # flag=1 find all other matched strings

    function find_string(STR_MIN, STR_MAX, flag) {
     
      for (i in record)
        delete record

      if ( flag == 1 ) {
        if ( r == 1 )
          print "---------------Reverted Repeat Match Line# "NR" -----------------\n"
        else
          print "------------------Repeat Match Line# "NR" --------------------\n"
      }

      for ( Str_Len=STR_MAX; Str_Len >= STR_MIN; Str_Len -- ) {

        for ( Position=1; Position <= L - 2 * Str_Len + 1; Position ++ ) {
          if ( is_overlap(Position,Str_Len) == 1 )
            continue

          count=0
          pos=Position
          offset=Position + Str_Len - 1
          left=substr($0,Position,Str_Len)

          if (index(left,"A")==0 || index(left,"C")==0 || index(left,"G")==0 || index(left,"T")==0 )
            continue

          right=substr($0, Position + Str_Len)

    # Reverse string left. The start position in reverse string is L -Position - Str_Len + 2

          if ( r == 1 ) {
            old_left=left
            left=substr(REVERSE_STRING, L - Position - Str_Len + 2, Str_Len)
          }
          while ( Str_Len <= length(right) ) {
            i=index(right,left)
            if ( i > 0 ) {
              if ( flag == 0 )
                return 1
              j=offset + i
              if ( is_overlap(j,Str_Len) == 0 ) {
         count ++
                record[Position]=Str_Len
                record[j]=Str_Len
         pos=pos","j
              }
              right=substr(right, i + Str_Len)
              offset+=(i + Str_Len - 1)
            }
            else
              break
          }


          if (count > 0 && flag == 1) {
            match_number[Str_Len] ++
            if (r == 1)
              print  "Reverted Repeat: " old_left",", "Size: "Str_Len",", "Start Positions: "pos
            else
              print  "Repeat: " left",", "Size: "Str_Len",", "Start Positions: "pos
          }
         
        }
      }
      if ( flag == 0 )
        return 0
    }

    {

      L=length($0)
      if ( r == 1 ) {
        REVERSE_STRING=""
    # split($0, aaa, "") :  Use null string "" to split every characters into array aaa
        split($0, aaa, "")
        for (i=L; i >= 1; i -- ) {
          REVERSE_STRING=REVERSE_STRING""aaa
          delete aaa
        }
      }

      STR_MIN=10
      STR_MAX=0

    # Find max matched string and set STR_MAX by binary search algorithm
      low=STR_MIN - 1
      high=int(L / 2) + 2
      while ( low < high ) {
        if ( (high - low) == 1 ) {
          if ( low >= STR_MIN )
            STR_MAX=low
          break
        }
        mid=int((high + low) / 2)
        status=find_string(mid, mid, 0)
        if (status)
          low=mid
        else
          high=mid
      }
      if ( STR_MAX >= STR_MIN )
        find_string(STR_MIN, STR_MAX, 1)
    }


    3. data2 (100000 字符) 的结果

    当长度增加到 100000 时, 速度变的很慢, 我只测了正序, 而且没有完全
    完成, 但可以估计时间。

    由于存在的重复匹配串的最大长度很小, 我采用二分查找首先找到重复匹配串的最大长度
    然后再循环至最小长度。 下面结果前面的

    9 50002 0
    9 25005 0
    .

    就是显示查找的过程。 找到重复匹配串的最大长度为 43 用时 67 min.
    然后, 从 43 循环至 10. 每个长度用时约 5 min

    总时间大约为 67 + 34 * 5 = 237 min 也就是大约 4 小时。


    9 50002 0
    9 25005 0
    9 12507 0
    9 6258 0
    9 3133 0
    9 1571 0
    9 790 0
    9 399 0
    9 204 0
    9 106 0
    9 57 1
    33 57 0
    33 45 1
    39 45 1
    42 45 1
    43 45 0


    Repeat: ATTGGATCATTGATCTAATCCAACCACATAACTATAATTACAG, Size: 43, Start Positions: 25161,25204
    Repeat: TCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCG, Size: 41, Start Positions: 26357,75092
    Repeat: AGTCTTGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCG, Size: 39, Start Positions: 31148,69805
    .
    .

    产生随机字符串
    $ cat rand.pl
    #!/usr/bin/perl -w
    @array=("A","T","C","G");
    for ($a=0;$a<1000;$a++) {
            $rand=$array[int rand scalar @array];
            $out.=$rand;
    }
    print "$out\n";

    $ perl rand.pl>datafile
    $ cat datafile
    ATACTCGTTGACTCTACTCATAACAGCATTAAAGACCGTAGGGTAGGCT...
    riverfor 回复于:2004-12-18 03:18:35


    =======================repeat.py===========================
    # !/usr/bin/pyton
    # Just enjoy the programming, share your excellent code
    # report bugs to email/ msn: riverfor@gmail.com
    #
    import string
    import commands
    import sys
    ## data for deamon
    data = "1234567890112345678901234567890 1234567890989901234567890"
    ## call this functions to get the real data which was saved inthe filePath
    def getSrcDataFromFile(filePath):
        data = commands.getoutput("cat %s" % filePath)
    ##
    result = []
    ## get all substrs used to check whether be fit for
    def getAllSubStr(splitCode = " ", minDataLen = 10, maxDataLen = 10000):
        dataLen = len(data) + 1
        for x in range(dataLen):
            for j in range(x, dataLen):
                seg, ilen = data[x:j], j - x
                # check whether to be fit for these conditions
                if seg.count(splitCode) == 0 :
                    if ilen >= minDataLen and ilen <= maxDataLen:
                        if result.count(seg) < 1:
                            result.append(seg)
                       ## process
    def process(filePath = ""):
        if filePath != "":
            getSrcDataFromFile(filePath)
        getAllSubStr()
        print "sub data\t repeat \t indexs"
        for item in result:
            counts, itemlen = data.count(item), len(item)
            if counts > 1:
                indexs, start, seg = [], 0, data
                for x in range(counts):
                    if seg != "":
                        start += seg.index(item)
                    indexs.append(start)
                    start += itemlen
                    seg = seg[start:]
                print item, "\t" ,counts, "\t" ,indexs
                      
    if __name__ == "__main__":
        print "source data: %s " % data
        print "========result:========"
        if len(sys.argv) < 2:
            process()
        else:
            proccess(sys.argv[1])
     
    ================================================
    This script only print the sub data repeats twice or more, which was with the condition of splitCode = " ", minDataLen = 10, maxDataLen = 10000.
    Usage: python repeay.py [data file] [| awk / sed ....]

    第一次发贴,嘻嘻,没时间写注释啦,但python本来就是如此易读(不要笑话我程序中的英文注解的语法错误哦,因为很多系统的中文支持不好,习惯了).很纳闷为什么cu连php版面都有,却连python的论坛都没有呢,我用过php, java, c, python. 我觉得python绝对是值得我们关注的一门语言,它可以用做shell(远比sed awk易读), 网站脚本(mod_py), application语言(网络的twisted框架, tk的GUI, 各种数据库的支持)......

    用awk写了一下第一题,
    用了最笨得办法,效率超低。处理这个datafile用了10s,而且只适用处理短一些的串。。。

    BEGIN {
            _MIN_LEN=10;
    }
    function find_max_str()
    {
            for(i=length($0)/2;i>=_MIN_LEN;i--) {
                    for(j=0;j<length($0)-i;j++) {
                            k=substr($0,j,i);

                            if((index(k," ")>0)||!((index(k,"A")>0 && index(k,"C")>0 && index(k,"G")>0 && index(k,"T")>0))) {
                                    continue;
                            }

                            if((a[k]++)==2) {
                                    printf("%d:%d:%d:%d:%s\n",NR,b[k],j,length(k),k);
                                    return 0;
                            }
                            else {
                                    b[k]=j;
                            }
                    }
                    for(idx1 in a) {
                            delete a[idx1];
                    }
                    for(idx2 in b) {
                            delete b[idx2];
                    }
            }
            return 1;
    }
    {
            find_max_str();
    }

     zhl1979 回复于:2007-03-27 19:16:25

    awk '{ for (l=10;l<50;l++){for(i=1;i<(10001-l);i++){a=substr($0,i,l) ;hash[a]++  ;   t=hashb[a]; t=i" "t; hashb[a]=t }}}END{for(a  in hash ){ if (hash[a]>1){ print a ,hash[a], hashb[a] }}}

  • 相关阅读:
    python调试代码好的方法
    Java保留两位小数的几种写法总结
    SPRING BOOT 项目中使用<SCOPE>PROVIDED</SCOPE>打包成WAR到TOMCAT运行出现的问题总结
    Spring Boot整合Thrift RPC
    Thrift语法参考
    Thrift中enum的一些探究
    Thrift入门及Java实例演示
    xcrun: error: unable to find utility "xctest", not a developer tool or in PATH
    Composer: Command Not Found
    Mac安装thrift因bison报错的解决办法
  • 原文地址:https://www.cnblogs.com/cutepig/p/894206.html
Copyright © 2011-2022 走看看