zoukankan      html  css  js  c++  java
  • 关于基因组注释文件GTF的解释

      GTF文件的全称是gene transfer format,主要是对染色体上的基因进行标注。怎么理解呢,其实所谓的基因名,基因座等,都只是后来人们给一段DNA序列起的名字而已,还原到细胞中就是细胞核里面的一条长长的染色体(DNA序列)。而这个GTF文件的主要功能,就是指出我们所谓的基因在染色体上的位置(coordinate),并且还标注了这一段区间的其他信息。

      GTF文件我一般喜欢去ensembl下载,gencode也可以。 这里给出链接:

      ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/

      http://www.gencodegenes.org/releases/current.html 

      

      关于这个文件的解释,这里参考ensembl 给出的官方说明: http://www.ensembl.org/info/website/upload/gff.html

     

    GFF/GTF File Format - Definition and supported options

    The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines. The following documentation is based on the Version 2 specifications.

    The GTF (General Transfer Format) is identical to GFF version 2.

    Fields

    Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'

    1. seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
    2. source - name of the program that generated this feature, or the data source (database or project name)
    3. feature - feature type name, e.g. Gene, Variation, Similarity
    4. start - Start position of the feature, with sequence numbering starting at 1.
    5. end - End position of the feature, with sequence numbering starting at 1.
    6. score - A floating point value.
    7. strand - defined as + (forward) or - (reverse).
    8. frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
    9. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

    Note that where the attributes contain identifiers that link the features together into a larger structure, these will be used by Ensembl to display the features as joined blocks.

    Sample GTF output from Ensembl data dump:

     1 transcribed_unprocessed_pseudogene  gene        11869 14409 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; 
     1 processed_transcript                transcript  11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_sourc e "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";

    Sample GFF output from Ensembl export:

    1 X    Ensembl    Repeat    2419108    2419128    42    .    .    hid=trf; hstart=1; hend=21
    2 X    Ensembl    Repeat    2419108    2419410    2502    -    .    hid=AluSx; hstart=1; hend=303
    3 X    Ensembl    Repeat    2419108    2419128    0    .    .    hid=dust; hstart=2419108; hend=2419128
    4 X    Ensembl    Pred.trans.    2416676    2418760    450.19    -    2    genscan=GENSCAN00000019335
    5 X    Ensembl    Variation    2413425    2413425    .    +    .    
    6 X    Ensembl    Variation    2413805    2413805    .    +    .

    Track lines

    Although not part of the formal GFF specification, Ensembl uses track lines to further configure sets of features (thus maintaining compatibility with UCSC). Track lines should be placed at the beginning of the list of features they are to affect.

    The track line consists of the word 'track' followed by space-separated key=value pairs - see the example below. Valid parameters used by Ensembl are:

    • name - unique name to identify this track when parsing the file
    • description - Label to be displayed under the track in Region in Detail
    • priority - integer defining the order in which to display tracks, if multiple tracks are defined.

    More information

    For more information about this file format, see the documentation on the GMOD wiki.

                                    

  • 相关阅读:
    shell之ping减少时间间隔&ping的次数&用IP1去ping IP2的技巧
    kali界面乱码解决方案记录
    win10子系统kali-linux安装图形化界面总结
    树莓派4 64bit 编译安装QT5.13.2 和 Redis Desktop Manager 2020.1-dev
    树莓派4 (8GB) RaspiOS 64 bit 入手配置流程 2020-06-10
    阿里巴巴Java开发手册(泰山版)个人阅读精简
    Java 8 新API Steam 流 学习笔记
    IDEA中maven项目部署到云服务器上(简易)
    收藏模块的设计
    js常用代码片段(更新中)
  • 原文地址:https://www.cnblogs.com/Demo1589/p/6950196.html
Copyright © 2011-2022 走看看