zoukankan      html  css  js  c++  java
  • 用哪个版本的基因组和注释文件好?| 亲测

    What Ensembl genome version should I use for alignments? (e.g. toplevel.fa vs. primary_assembly.fa)

    这是一个很细节也很实际的问题,到底用哪个版本?

    参考:

    What Ensembl genome version should I use for alignments? (e.g. toplevel.fa vs. primary_assembly.fa)

    Results differ when using different ensembl versions

    First part options:

    • dna_sm - Repeats soft-masked (converts repeat nucleotides to lowercase)
    • dna_rm - Repeats masked (converts repeats to to N's)
    • dna - No masking

    Second part options:

    • .toplevel - Includes haplotype information (not sure how aligners deal with this)

    • .primary_assembly - Single reference base per position

    大部分都推荐使用soft-mask版本的,也就是没有把repeat替换为N。

    下载hg19基因组:http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

    参考:基因组各种版本对应关系

    从genecode下载hg19注释文件:ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/

    UCSC也可以下载,不过只能从网页导出。http://genome.ucsc.edu/cgi-bin/hgTables

    注:genecode貌似出了问题,https://www.gencodegenes.org/releases/26lift37.html,里面ebi的链接无法下载了。

    参考:http://www.biotrainee.com/thread-2035-1-1.html

    基因组不是越新越好的,看看最新的CNS,里面很少有用最新版本的基因组,为什么?因为注释没跟上,你做出来的东西可能和别人对不上。

    亲测

    用不同版本的基因组效果会怎么样?

    我做了转录组的测试,用的hg19和GRCh38

    结论如下:

    1. reads比对到基因组上的情况大致相同,基本没有差别;

    2. 用不同的注释文件,基因表达的结果差距非常大。同样都是用featureCounts

    GRCh38的结果:

    Assigned        306852
    Unassigned_Unmapped     0
    Unassigned_MappingQuality       0
    Unassigned_Chimera      0
    Unassigned_FragmentLength       0
    Unassigned_Duplicate    0
    Unassigned_MultiMapping 36280
    Unassigned_Secondary    0
    Unassigned_Nonjunction  0
    Unassigned_NoFeatures   56950
    Unassigned_Overlapping_Length   0
    Unassigned_Ambiguity    19771
    
    //================================= Running ==================================\
    ||                                                                            ||
    || Load annotation file /home/lizhixin/databases/ensembl/release91/Homo_s ... ||
    ||    Features : 1199851                                                      ||
    ||    Meta-features : 58302                                                   ||
    ||    Chromosomes/contigs : 47                                                ||
    ||                                                                            ||
    || Process BAM file /home/lizhixin/project/scRNA-seq/reanalyze/first_five ... ||
    ||    Paired-end reads are included.                                          ||
    ||    Assign fragments (read pairs) to features...                            ||
    ||                                                                            ||
    ||    WARNING: reads from the same pair were found not adjacent to each       ||
    ||             other in the input (due to read sorting by location or         ||
    ||             reporting of multi-mapping read pairs).                        ||
    ||                                                                            ||
    ||    Read re-ordering is performed.                                          ||
    ||                                                                            ||
    ||    Total fragments : 419853                                                ||
    ||    Successfully assigned fragments : 306852 (73.1%)                        ||
    ||    Running time : 0.05 minutes                                             ||
    

      

    hg19的结果:

    Assigned        586467
    Unassigned_Unmapped     0
    Unassigned_MappingQuality       0
    Unassigned_Chimera      0
    Unassigned_FragmentLength       0
    Unassigned_Duplicate    0
    Unassigned_MultiMapping 66997
    Unassigned_Secondary    0
    Unassigned_Nonjunction  0
    Unassigned_NoFeatures   133437
    Unassigned_Overlapping_Length   0
    Unassigned_Ambiguity    47278
    
    //================================= Running ==================================\
    ||                                                                            ||
    || Load annotation file /home/lizhixin/databases/cellranger_ref/refdata-c ... ||
    ||    Features : 1130716                                                      ||
    ||    Meta-features : 32738                                                   ||
    ||    Chromosomes/contigs : 45                                                ||
    ||                                                                            ||
    || Process BAM file /home/lizhixin/project/scRNA-seq/reanalyze/first_five ... ||
    ||    Paired-end reads are included.                                          ||
    ||    Assign fragments (read pairs) to features...                            ||
    ||    Total fragments : 834179                                                ||
    ||    Successfully assigned fragments : 586467 (70.3%)                        ||
    ||    Running time : 0.05 minutes                                             ||
    

    不同的注释文件千万不要乱用!!!  

      

  • 相关阅读:
    Leetcode 15 3Sum
    Leetcode 383 Ransom Note
    用i个点组成高度为不超过j的二叉树的数量。
    配对问题 小于10 1.3.5
    字符矩阵的旋转 镜面对称 1.2.2
    字符串统计 连续的某个字符的数量 1.1.4
    USACO twofive 没理解
    1002 All Roads Lead to Rome
    USACO 5.5.1 求矩形并的周长
    USACO 5.5.2 字符串的最小表示法
  • 原文地址:https://www.cnblogs.com/leezx/p/8646225.html
Copyright © 2011-2022 走看看