zoukankan      html  css  js  c++  java
  • 使用Juicer组装HiC

     Juicer 相比其他软件还是蛮有优势的。文章中介绍突出的两个优点是(1)流程化,使用简单。(2)高性能,平行计算,对大数据友好。

    Under /home/user/juicedir, there should be a folder references that contains the reference fasta file for your genome and the BWA index files. You can soft-link if necessary, or otherwise download the fasta files from UCSC and run bwa index on the fasta file.

    Under /home/user/juicedir, you should also create a folder restriction_sites. This should contain your restriction site file. You can create this file using the generate_site_positions.py Python script, or download already created ones from the Juicer AWS mirror.

    Type screen then launch Juicer: /home/user/juicedir/scripts/juicer.sh [options]

    Running without any options will default to the genome of hg19 and the restriction site of MboI.

    See Usage for more options; to adjust genome and/or site, use -g <genomeID> and -s <restriction_site>.

    The files will be split if necessary and Juicer will launch.

    Results are available in the aligned directory. The Hi-C maps are in inter.hic (for MAPQ > 0) and inter_30.hic (for MAPQ >= 30). The Hi-C maps can be loaded in Juicebox and explored.

    When the pipeline has completed successfully, you will see the folders aligned, debug, and splits. The debug folder contains logging information for the pipeline.

    The splits folder is a temporary working directory and can be deleted once you are sure the pipeline ran successfully.

    The aligned folder contains the results: inter.hic / inter_30.hic: The .hic files for Hi-C contacts at MAPQ > 0 and at MAPQ >= 30, respectively

    MAPQ(Mapping Qualities) 用来表示每条read的比对情况,MAPQ越高,表示比对质量越好,后续可以根据分析需要来进行过滤。 

    MAPQ 定义 从概率的角度来看,每个read的比对都是一个真实比对的估计,它是一个随机变量,也有可能存在错误。错误的概率可以用 Phred 来衡量。假设一条read的MAPQ的值为 $mQ, $P 表示reads比对错误的概率。

    $P = 10 ^ (-$mQ / 10.0);
    如果mQ的值为30,那么 mQ的值为30,那么mQ的值为30,那么P(比对错误率) 就是 0.1%


    merged_nodups.txt: The Hi-C contacts with duplicates removed. This file is also input to the assembly and diploid pipelines

    collisions.txt: Reads that map to more than two places in the genome

    inter.txt, inter_hists.m / inter_30.txt, inter_30_hists.m: The statistics and graphs files for Hi-C contacts at MAPQ > 0 and at MAPQ >= 30, respectively.

    对Hi-C互作结果MAPQ>0的结果的统计指标:inter.txt

    These are also stored within the respective .hic files in the header. The .m files can be loaded into Matlab.

    The statistics and graphs are displayed under Dataset Metrics when loaded into Juicebox

    dups.txt, opt_dups.txt: Duplicates and optical duplicates

    abnormal.sam, unmapped.sam: Abnormal chimeric and unmapped reads 异常的嵌合体和未比对上的

    merged_sort.txt: This is a combination of merged_nodups / dups / opt_dups and can be deleted once the pipeline has successfully

    completed stats_dups.txt / stats_dups_hists.m: Statistics and graphs on the duplicates  基于duplicates的统计结果

    软件运行完成之后,在样本对应的目录下,会生成以下目录

    1. splits

    2. aligned

    splits目录下存放的是中间结果,由于hi-C数据量很大,所以会将原始序列拆分成很多份,并行运算,加快速度。默认每份包含22.5M的reads, 当然这个可以通过-C参数调整,该参数指定拆分文件的行数,默认是90000000, 注意fastq文件4行代表一条序列,所以这个参数的值必须是4的倍数。拆分后序列的R1和R2端分别通过bwa比对基因组,然后合并,筛选嵌合体序列,去重复,生成预处理后的结果文件。

    aligned目录下存放的是最终结果,包含了可以导入juicebox的后缀为hic的图谱文件, inter.hicinter_30.hic, 30表示通过MAPQ > 30进行过滤之后的结果。完整流程还会进行后续处理,包括识别TAD, 染色质环等结构。其中识别染色质环的HICCUPs算法必须通过GPU加速运行才可以,所以没有安装GPU卡的普通服务器无法运行这个步骤。

    从上述过程可以看到,juicer的使用确实非常简单。由于Hi-C数据的测序量非常大,以及后续分析算法的复杂度,对服务器计算资源的要求相当高,必须高性能服务器才能满足要求,而该软件所需的GPU卡成本也非常高,一块的成本在2万元左右,这些因素一定程度制约了Hi-C的普及和发展。

    ========================3d-dna=======================

    #需要用到juicer中的两个文件

    # merged_nodups.txt文件和基因组文件

    # pwd

    /home/hujiaxiang/tools_update/3d-dna

    # 新建一个文件,作为工作目录

    mkdir duck

    cd duck

    # 链接merged_nodups.txt文件

    ln -s /home/hujiaxiang/tools_update/juicer/scripts/aligned/merged_nodups.txt   .

                                                 

    # 链接基因组文件

    ln -s /home/hujiaxiang/tools_update/juicer/references/genome.fasta  .

    # 运行3d-dna

    run-asm-pipeline.sh genome.fasta merged_nodups.txt

    需要说明的是,这个软件生成的中间临时文件很大,并且保存在/tmp下面

    如果你的hic数据很多。/tmp目录很有可能装不下你的临时文件,造成报错

    解决办法如下

    新建一个文件夹,作为临时的文件夹

    mkdir /home/hujiaxiang/tools_update/3d-dna/tmp_duck

    将此文件夹绑定到/tmp文件夹

    mount --r bind /home/hujiaxiang/tools_update/3d-dna/tmp_duck  /tmp

    mount -o remount.rw /tmp

    来源:

    https://www.jianshu.com/p/de9400025e1d

    https://github.com/aidenlab/juicer/wiki/Running-Juicer-on-a-cluster

    http://blog.sina.com.cn/s/blog_4ab0b3390102ygde.html

    https://www.jianshu.com/p/de9400025e1d

    https://blog.csdn.net/tanzuozhev/java/article/details/89115080

  • 相关阅读:
    Optimization on content service with local search in cloud of clouds
    译:滑雪租赁问题(ski rental problem)
    计算机专业顶级学术会议
    论文WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression
    FADE:云存储中数据安全删除
    HTML 标签补充
    python mongodb 的调试
    django MultiValueDictKeyError 错误处理
    Using mongo in django to develop web app with python
    使用list和tuple
  • 原文地址:https://www.cnblogs.com/bio-mary/p/12404531.html
Copyright © 2011-2022 走看看