最近在学习snakemake 用于生信流程管理,现在用一个snakemake 来完成小任务:将在某一文件夹下的多个bam文件截取一部分,然后建立索引,在提取出fastq序列,最后比对回基因组。
需要两个文件,一个配置文件config.yaml和snakemake文件。
config.yaml 文件内容用空格分隔,主要是要用到的软件,也可以包含一些流程中要用的参数
1 TMP_DIR: /home/share/work/TMP 2 3 HG19: /home/share/BioSoft/hg19_bwa/ucsc.hg19.fasta 4 5 BWA: /home/share/BioSoft/bin/bwa
snakemake文件:我一般都以 .py 做后缀名,主流程写在这里
1 configfile: "./config.yaml" 2 3 hg19=config["HG19"] 4 tmp=config["TMP_DIR"] 5 Bwa=config["BWA"] 6 7 SAMPLES,=glob_wildcards("bam/{sample}.05.mapped.bam") 8 9 rule all: 10 input: 11 expand("out/{sample}.bam.bai",sample=SAMPLES), 12 expand("out/bwa/{sample}.sam",sample=SAMPLES) 13 14 rule split: 15 input: 16 "bam/{sample}.05.mapped.bam" 17 output: 18 "out/{sample}.bam" 19 log: 20 "log/{sample}.log" 21 shell: 22 "samtools view {input} chr7:3240000-5260000 -b -o {output} 2> {log}" 23 24 rule index: 25 input: 26 "out/{sample}.bam" 27 output: 28 "out/{sample}.bam.bai" 29 shell: 30 "samtools index {input} {output}" 31 32 rule fq: 33 input: 34 "out/{sample}.bam" 35 output: 36 r1="out/{sample}_r1.fq", 37 r2="out/{sample}_r2.fq", 38 params: 39 null="/dev/null" 40 shell: 41 "samtools fastq -N {input} -1 {output.r1} -2 {output.r2} -0 {params.null} -s {params.null}" 42 43 rule bwa: 44 input: 45 r1="out/{sample}_r1.fq", 46 r2="out/{sample}_r2.fq" 47 output: 48 "out/bwa/{sample}.sam" 49 params: 50 RG=r"'@RG ID:{sample} PL:illumina SM:{sample}'", 51 ref=hg19, 52 bwa=Bwa 53 threads: 3 54 log: 55 "log/bwa/{sample}.log" 56 shell: 57 "{params.bwa} mem -t {threads} -M -R {params.RG} {params.ref} {input.r1} {input.r2} -o {output}"
写完后伪运行下,检查各个步骤代码
1 snakemake -np -s snakemake.py
检查没有问题之后投递计算节点,任务结束后还可以做个流程图:
1 snakemake --dag -s snakemake.py |dot -Tpdf > pipeline.pdf