HiCPro分析流程及结果解读

zoukankan html css js c++ java

HiCPro分析流程及结果解读
1. 可由conda安装

2. 首先采用HiCPro自带的digest_genome.py程序获得消化片段的BED文件及chromosomes' size表格文件,需要限制酶酶切位点及参考基因组信息

3. 用bowtie2对falcon_contig.fasta建立索引

HiCPro先采用bowtie分别对PE reads进行比对

对未比对上的reads进行trim和再比对

分别对R1 & R2 reads两次比对的结果合并

Hi-C数据的标准处理流程主要包括：序列比对、数据过滤、数据Binning（将数据分成小单元）和数据校正。将测序得到的Hi-C双端序列与参考序列比对，仅双端均唯一匹配到参考序列上的paired reads（valid pairs）才能用于后续分析，最后再将valid pair序列去除PCR冗余后才是用于Hi-C分析的最终有效数据，该数据占所有测序数据的比率是评价Hi-C实验质量的重要指标。（Hi-C系列三:数据解析基因组三维结构）

4. 采用HiCPro的mergeSAM.py程序合并PE reads

利用HiCPro的mapped_2hic_fragments.py程序将比对结果转化为Hi-C片段信息

对所有的valid pairs进行合并，并且去掉PCR duplication

采用HiCPro的merge_statfiles.py程序对bowtie2比对的多个统计结果合并

跟据BIN_SIZE来构建matrix

采用ice对raw matrix做normalization

5. 统计比对率, 对统计结果画图

HiCPlotter 安装使用极简介绍

HiCPlotter 下载地址：https://github.com/kcakdemir/HiCPlotter 软件依赖：Python >=2.7 以及三个Python模块(Numpy>=1.9.0, Scipy>=0.14.0, Matplotlib>=1.3.1)。解压安装：unzip HiCPlotter-master.zip 使用：python /HiCPlotter_PATH/HiCPlotter.py -f HiC互作矩阵 -tri 1 -bed HiC互作区间 -wg 是否进行全基因组HiC绘图（1或0） -r 1000000 (绘图分辨率，默认100000) -chr 画图染色体范围 -o 输出文件前缀 -n 输出染色体范围（由于HiCPlotter的用法非常多，这里只列出了由HiC-Pro输出的结果可以可视化的命令）

（HiC-Pro一步运行完用于HiCPlotter可视化画图的结果）：

HiC-Pro产生的结果中有两个是下一步HiCPlotter所需的输入文件分别是 HiC互作区间文件和HiC互作矩阵文件。

HiC互作区间文件： /PATH_to_analysis/HiC_data_out/hic_results/matrix/Lib/raw/1000000/Lib_1000000_abs.bed 文件具体内容如下：(前三列分别是染色体及起始位置，第四列是这个区间的编号，在矩阵文件中应用)。
- chr1 0 1000000 1
- chr1 1000000 2000000 2
- chr1 2000000 3000000 3
- chr1 3000000 4000000 4
- chr1 4000000 5000000 5
- chr1 5000000 6000000 6
- chr1 6000000 7000000 7
- chr1 7000000 8000000 8
- chr1 8000000 9000000 9
- chr1 9000000 10000000 10
- chr1 10000000 11000000 11
HiC互作矩阵文件： /PATH_to_analysis/HiC_data_out/hic_results/matrix/Lib/iced/1000000/L2_1000000_iced.matrix

文件具体内容如下：(每个数字代表的每个100Kb的区间，和bed文件对应)。

the matrix file is a three column sparse format in which first two columns are interacting bins and third column is interaction frequency. Bins do not interact with each other (with score 0) are not listed in the file.
- 1 1 678.658080
- 1 2 324.615405
- 1 3 156.816128
- 1 4 135.730975
- 1 5 93.67516
- 1 6 104.083832
- 1 7 71.933971
- 1 8 51.807506
- 1 9 54.103978
- 1 10 42.184552
- 1 11 40.610061
- 1 12 23.162058
- 1 13 20.459862
- 1 14 8.426264
三、HiCPlotter 分析及最终可视化结果

全基因组Hi-C互作图谱运行命令：

python /HiCPlotter_PATH/HiCPlotter.py -f Lib1_1000000_iced.matrix -tri 1 -bed Lib1_1000000_abs.bed -wg 1 –r 1000000 -chr chr7 -n WholeGenome -o WholeGenome

单条染色体Hi-C互作图谱运行命令：

python /HiCPlotter_PATH/HiCPlotter.py -f Lib1_1000000_iced.matrix -tri 1 -bed Lib1_1000000_abs.bed -wg 0 –r 1000000 -chr chr2 -n chr2 –o chr2 #这里与全基因组Hi-C互作图谱的差异在于-wg参数设为0，表示输出的是单条染色体或者单条序列

结果解读：

According to the reference genome, a high-quality Hi-C experiment is usually associated with a high mapping rate.

Once the reads are aligned on the genome, the fraction of singleton or multiple hits is usually expected to be low.

In the same way, a high level of dangling-end or self-circle read pairs is associated with a bad quality experiment, and reveals a problem during the digestion, fill-in or ligation steps.

在Reads aligned to the reference genome这一步里，

Low quality alignment, singleton and multiple hits are usually removed；

在Reads aligned to the restriction fragments这一步里，

Read pairs are assigned to a restriction fragment. Invalid pairs, such as dangling-end and self-circle, are good indicators of the library quality and are tracked（追踪） but discarded for subsequent further analysis. The fractions of duplicated reads, as well as short range versus long range interactions, are also reported.

A high level of duplication indicates poor molecular complexity and a potential PCR bias.

a high quality experiment is usually characterized by a significant fraction (>40 %) of long-range intra-chromosomal valid pairs

补充阅读：

HiC-Pro: an optimized and flexible pipeline for Hi-C data processing材料和方法

HiC-Pro工作流程：

4个模块：

read 比对；

检测和过滤有效的互作结果；

binning;

互作图均一化；

reads先和参考基因组进行比对，只保留uniquely比对上的reads并比对到限制性片段，对交互关系进行分类并丢弃无效的成对reads。

如果有表型数据和N屏蔽的参考基因组，再将reads与这种N屏蔽的参考基因组进行比对；

然后将reads与DNA和蛋白质交联物的酶切片段进行比对，过滤掉没交联上蛋白质的DNA片段；

（以上步骤在每个read块里并行运行，来自多个块的数据合并在一起并且bin分箱产生一个整个的基因组互作图，最后的均一化去除基因组互作图的系统偏差）

参考来源：

Servant, Nicolas, et al. "HiC-Pro: an optimized and flexible pipeline for Hi-C data processing." Genome biology 16.1 (2015): 259.

https://www.jianshu.com/p/9e9261dc5db1

http://blog.sciencenet.cn/blog-2970729-1182259.html

http://blog.sciencenet.cn/blog-2970729-1185463.html

http://www.jintiankansha.me/t/JImtr78k5f
查看全文

相关阅读:
程序性能优化1
在archlinux上搭建twitter storm cluster
.Net 跨平台可移植类库PCL可用于任何平台包括Mono
.NET面向对象特性之封装
 哈夫曼(Huffman)编码
 高质量视频学习网站
 （Java实现）洛谷 P1042 乒乓球
 （Java实现）洛谷 P1042 乒乓球
 （Java实现）洛谷 P1071 潜伏者
 （Java实现）洛谷 P1071 潜伏者

原文地址：https://www.cnblogs.com/bio-mary/p/12034244.html