zoukankan      html  css  js  c++  java
  • hg19基因组 | 功能区域 | 位置提取

    如何获取hg19的CDS、UTR、intergenic、intron等的位置信息?

    参考手册:

    Hg19 regions for Intergenic, Promoters, Enhancer, Exon, Intron, 5-UTR, 3-UTR

    怎么从gtf文件获取genome feature的区间 

    The coding region of a gene, also known as the CDS (from coding sequence), is that portion of a gene's DNA or RNA that codes for protein. The region usually begins at the 5' end by a start codon and ends at the 3' end with a stop codon.

    CDS就是所有exon的组合

    有细微的差别:

    Exons = gene - introns

    CDS = gene - introns - UTRs

    therefore also:

    CDS = Exons - UTRs

    就是UTR也算作了exon了。

    也就是exon算是一个比较大的概念,所以文章里用的CDS区域来统计,而不是exon。

    获取数据的方法:

    1. UCSC - 最全,最个性化

    How to download the most similar annotation file as the author required from UCSC browser directly.

    Go to UCSC brower and Tool, Table caterogy. and pick you reference genome and select right version under clade/genome/assembly
    Make sure the group is "Genes and Gene Predictions"
    Choose your preferred track (RefSeq/RefGene or UCSC gene/KnownGene)
    Choose the table that gives gene information (RefSeq or KnownGene)
    Select your region or the entire genome to get coordinates for
    Select BED format as your output format
    Name your output file
    Click "get output"
    Be careful, the ouput files don't have exon, intron, integenic, 5-UTR, 3-UTR informatics if you save it as a single file. You can save them as separated files so that you know the information for each subset.

    需要自己再sort一下:

    cat raw.txt/Gene.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Gene.bed
    cat raw.txt/UTR3.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > UTR3.bed
    cat raw.txt/UTR5.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > UTR5.bed
    cat raw.txt/down2K.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Down2K.bed
    cat raw.txt/CDS.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > CDS.bed
    cat raw.txt/exon.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Exon.bed
    cat raw.txt/up2K.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Up2K.bed
    cat raw.txt/intron.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Intron.bed
    

      

    这里得到的CDS是有overlap的,因为是按转录本算的,同一个基因有多个转录本。

    2. genecode - 最自动化,全部用代码搞定

    3. 其他

    RSeQC有比较好的bed注释文件,但是不是完全的明文,不好利用。

    附录:

    intergenic的需要自己计算

    wget http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
    
    grep -v "_" hg19.chrlen.bed | egrep  "^chr([0-9]{1,2})" | cut -f1,3 > hg19.chrlen.1_22.bed
    # sort -k1,1 hg19.chrlen.1_22.bed > sorted.genome
    bedtools complement -i UCSC.anno/Gene.bed -g hg19.chrlen.1_22.bed > intergenic.bed
    sort -k1,1 -k2,2n intergenic.bed > sorted.intergenic.bed
    bedtools merge -i sorted.intergenic.bed > merged.sorted.intergenic.bed
    

      

      

  • 相关阅读:
    Jenkins 构建后生成的HTML报告,无法导入js/css文件问题
    持续集成:API自动化 + Jenkins定时构建
    Jenkins 的安装与基础配置
    浏览器兼容性测试
    测试过程中bug缺陷的定义、bug类型、bug等级、bug生命周期、bug管理流程、bug状态处理
    常用四种用例设计方法
    软件测试工作流程图、软件测试的基本流程、软件开发流程、测试流程梳理
    软件的生命周期:瀑布型、V模型、敏捷开发模型生命周期;软件测试各阶段工作内容
    软件测试的分类、软件测试分类的说明、软件测试常见的误区
    java应用cpu使用率过高问题排查
  • 原文地址:https://www.cnblogs.com/leezx/p/11889925.html
Copyright © 2011-2022 走看看