zoukankan      html  css  js  c++  java
  • 10 Consensus and Profile

    Problem

    matrix is a rectangular table of values divided into rows and columns. An m×nm×n matrix has mm rows and nn columns. Given a matrix AA, we write Ai,jAi,j to indicate the value found at the intersection of row ii and column jj.

    Say that we have a collection of DNA strings, all having the same length nn. Their profile matrix is a 4×n4×n matrix PP in which P1,jP1,j represents the number of times that 'A' occurs in the jjth position of one of the strings, P2,jP2,j represents the number of times that C occurs in the jjth position, and so on (see below).

    consensus string cc is a string of length nn formed from our collection by taking the most common symbol at each position; the jjth symbol of cc therefore corresponds to the symbol having the maximum value in the jj-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

      A T C C A G C T
      G G G C A A C T
      A T G G A T C T
    DNA Strings A A G C A A C C
      T T G G A A C T
      A T G C C A T T
      A T G G C A C T

      A   5 1 0 0 5 5 0 0
    Profile C   0 0 1 4 2 0 6 1
      G   1 1 6 3 0 1 0 0
      T   1 5 0 0 0 1 1 6

    Consensus A T G C A A C T

    Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

    Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

    Sample Dataset

    >Rosalind_1
    ATCCAGCT
    >Rosalind_2
    GGGCAACT
    >Rosalind_3
    ATGGATCT
    >Rosalind_4
    AAGCAACC
    >Rosalind_5
    TTGGAACT
    >Rosalind_6
    ATGCCATT
    >Rosalind_7
    ATGGCACT
    

    Sample Output

    ATGCAACT
    A: 5 1 0 0 5 5 0 0
    C: 0 0 1 4 2 0 6 1
    G: 1 1 6 3 0 1 0 0
    T: 1 5 0 0 0 1 1 6

    方法一:
    #-*-coding:UTF-8-*-
    ### 10. Consensus and Profile ###
    
    from collections import Counter
    from collections import OrderedDict
    
    fh = open('consesus_and_profile_output2.txt', 'wt')
    
    rosalind = OrderedDict()
    seqLength = 0
    with open('Sample Dataset.txt') as f:
        for line in f:
            line = line.rstrip()
            if line.startswith('>'):
                rosalindName = line
                rosalind[rosalindName] = ''
                continue
            rosalind[rosalindName] += line
        seqLength = len(rosalind[rosalindName])   #len(ATCCAGCT)
    
    A, C, G, T = [], [], [], []
    consensusSeq = ''
    for i in range(seqLength):
        seq = ''
        for k in rosalind.keys():      # rosalind.keys = Rosalind_1...Rosalind_7
            seq += rosalind[k][i]       # 把 Rosalind_1...Rosalind_7 相同顺序的碱基加起来
        A.append(seq.count('A'))
        C.append(seq.count('C'))
        G.append(seq.count('G'))
        T.append(seq.count('T'))
        counts = Counter(seq)           # 为seq计数
        consensusSeq += counts.most_common()[0][0]  #从多到少返回,是个list啊,只返回第一个
    
    fh.write(consensusSeq + '
    ')
    fh.write('
    '.join(['A:	' + '	'.join(map(str, A)), 'C:	' + '	'.join(map(str, C)),
                        'G:	' + '	'.join(map(str, G)), 'T:	' + '	'.join(map(str, T))]))
                        #.join(map(str,A))  把 A=[5, 1, 0, 0, 5, 5, 0, 0] 格式转化成 A:5 1 0 0 5 5 0 0
    fh.close()
    

      方法二:

    # coding=utf-8
    import numpy as np
    import os
    from collections import Counter
    
    fhand = open('Sample Dataset.txt')
    t = []
    for line in fhand:
        if line.startswith('>'):
            continue
        else:
            line = line.rstrip()
        line_list = list(line)   #变成一个list
        t.append(line_list)      # 再把list 放入list
    a = np.array(t)  # 创建一个二维数组
    
    
    L1, L2, L3, L4 = [], [], [], []
    comsquence = ''
    
    for i in range(a.shape[1]): # shape[0],是7 行数,shape[1]是8 项目数
        l = [x[i] for x in a]  # 调出二维数组的每一列
        L1.append(l.count('A'))
        L2.append(l.count('C'))
        L3.append(l.count('T'))
        L4.append(l.count('G'))
        comsquence = comsquence + Counter(l).most_common()[0][0]
    print comsquence
    print 'A:', L1, '
    ', 'T:', L2, '
    ', 'C:', L3, '
    ', 'G:', L4
    

      

  • 相关阅读:
    HDU1006
    白话经典算法系列之六 高速排序 高速搞定
    pca算法
    硬件这碗饭有那么好吃吗?
    《学会等待》有感
    oracle execute immediate
    HDU 2059 龟兔赛跑
    C++ Tricks(一)—— 判断字符串 string 对象的所有字符都相等
    辨异 —— 数学基本概念
    辨异 —— 数学基本概念
  • 原文地址:https://www.cnblogs.com/think-and-do/p/7273364.html
Copyright © 2011-2022 走看看