zoukankan      html  css  js  c++  java
  • 05 Computing GC Content

    Problem

    The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

    DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

    In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

    Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

    Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

    Sample Dataset

    >Rosalind_6404
    CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
    TCCCACTAATAATTCTGAGG
    >Rosalind_5959
    CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
    ATATCCATTTGTCAGCAGACACGC
    >Rosalind_0808
    CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
    TGGGAACCTGCGGGCAGTAGGTGGAAT
    

    Sample Output

    Rosalind_0808
    60.919540


    方法一:
    # -*- coding: utf-8 -*-
    
    
    # to open FASTA format sequence file:
    s=open('Computing_GC_Content.txt','r').readlines()
    
    # to create two lists, one for names, one for sequences
    name_list=[]
    seq_list=[]
    
    data='' # to put the sequence from several lines together
    
    for line in s:
        line=line.strip()
        for i in line:
            if i == '>':
                name_list.append(line[1:])
                if data:
                    seq_list.append(data)         #将每一行的的核苷酸字符串连接起来
                    data=''                       # 合完后data 清零
                break
            else:
                line=line.upper()
        if all([k==k.upper() for k in line]):    #验证是不是所有的都是大写
            data=data+line
    seq_list.append(data)                         # is there a way to include the last sequence in the for loop?
    GC_list=[]
    for seq in seq_list:
        i=0
        for k in seq:
            if k=="G" or k=='C':
                i+=1
        GC_cont=float(i)/len(seq)*100.0
        GC_list.append(GC_cont)
    
    
    m=max(GC_list)
    print name_list[GC_list.index(m)]              # to find the index of max GC
    print "{:0.6f}".format(m)                    # 保留6位小数
    

      方法二:

    # -*- coding: utf-8 -*-
    
    def parse_fasta(s):
        results = {}
        strings = s.strip().split('>')
        # Python split()通过指定分隔符对字符串进行切片,如果参数num 有指定值,则仅分隔 num 个子字符串
    
        for s in strings:
            if len(s) == 0:
                continue
                # 如果字符串长度为0,就跳出循环。
    
            parts = s.split()
            label = parts[0]
            bases = ''.join(parts[1:])
    
            results[label] = bases
    
        return results
    
    
    def gc_content(s):
        n = len(s)
        m = 0
    
        for c in s:
            if c == 'G' or c == 'C':
                m += 1
    
        return 100 * (float(m) / n)
    
    
    if __name__ == "__main__":
    
        small_dataset = """
        >Rosalind_6404
        CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
        TCCCACTAATAATTCTGAGG
        >Rosalind_5959
        CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
        ATATCCATTTGTCAGCAGACACGC
        >Rosalind_0808
        CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
        TGGGAACCTGCGGGCAGTAGGTGGAAT
        """
    
        #large_dataset = open('datasets/rosalind_gc.txt').read()
    
        results = parse_fasta(small_dataset)
        results = dict([(k, gc_content(v)) for k, v in results.iteritems()])
        # 这里iteritem()和item()功能是一样的
        # 前一个results输出,名称+序列,后一个results输出,名称+百分比
    
    
        highest_k = None
        highest_v = 0
    
        for k, v in results.iteritems():
            if v > highest_v:
                highest_k = k
                highest_v = v
                # 输出GC含量高的
        print highest_k
        print '%f%%' % highest_v
    

      方法三:

    # -*- coding: utf-8 -*-
    
    ### 5. Computing GC Content ###
    from operator import itemgetter
    from collections import OrderedDict
    
    seqTest = OrderedDict()
    gcContent = OrderedDict()
    
    with open('Computing_GC_Content.txt', 'rt') as f:
        for line in f:
            line = line.rstrip()
            if line.startswith('>'):
                seqName = line[1:]
                seqTest[seqName] = ''
                continue
            seqTest[seqName] += line.upper()
    
    for ke, val in seqTest.items():
        totalLength = len(val)
        gcNumber = val.count('G') + val.count('C')
        gcContent[ke] = (float(gcNumber) / totalLength)*100
    
    sortedGCContent = sorted(gcContent.items(), key=itemgetter(1))
    largeName = sortedGCContent[-1][0]
    largeGCContent = sortedGCContent[-1][1]
    
    print ('most GC ratio gene is %s and it is %s ' % (largeName, largeGCContent))
    

      

  • 相关阅读:
    cookie的使用
    给定一个数组和一个目标值,找出和为目标值的整数,并返回他们的下标
    String详解
    阳哥讲面试题(四)生产调优
    springboot项目调优(针对特性的服务启动,定制化诉求)
    intern() 方法
    阳哥讲面试题(四)OOM,GC算法,垃圾收集器
    阳哥讲面试题(三)JVM,GC
    阳哥讲面试题(二)队列,线程池
    Semaphore用法
  • 原文地址:https://www.cnblogs.com/think-and-do/p/7258980.html
Copyright © 2011-2022 走看看