zoukankan html css js c++ java

利用kseq.h parse fasta/fastq 文件

在分析中经常需要统计fasta/fastq文件的序列数和碱基数，但是没有找到一些专门做这件事的小工具，可能是这个功能太简单了；

之前用自己写的perl的脚本统计这些信息，当fastq文件非常大时，就变的很慢；

今天在网上搜到kseq.h可以parse fasta/fastq文件，用C写的，速度很快；

http://lh3lh3.users.sourceforge.net/parsefastq.shtml

自己试了一下, 在这个基础上添加个小功能，命名为parse.c：

#include <zlib.h>  
#include <stdio.h>
#include <string.h>  
#include "kseq.h"  
// STEP 1: declare the type of file handler and the read() function  
KSEQ_INIT(gzFile, gzread)  
  
int main(int argc, char *argv[])  
{  
    gzFile fp;  
    kseq_t *seq;
    long seqs = 0;
    long bases = 0;  
    int l;  
    if (argc == 1) {  
        fprintf(stderr, "Usage: %s <in.seq>
", argv[0]);  
        return 1;  
    }  
    fp = gzopen(argv[1], "r"); // STEP 2: open the file handler  
    seq = kseq_init(fp); // STEP 3: initialize seq  
    while ((l = kseq_read(seq)) >= 0) { // STEP 4: read sequence  
        //printf("name: %s
", seq->name.s);  
        //if (seq->comment.l) printf("comment: %s
", seq->comment.s);  
        //printf("seq: %s
", seq->seq.s);  
        //if (seq->qual.l) printf("qual: %s
", seq->qual.s);
        bases += strlen(seq->seq.s);
        seqs += 1;  
    }  
    //printf("return value: %d
", l);
    printf("reads: %ld
", seqs);
    printf("bases: %ld
", bases);         
    kseq_destroy(seq); // STEP 5: destroy seq  
    gzclose(fp); // STEP 6: close the file handler  
    return 0;  
}

然后编译

gcc -o fastx_read_length -lz parse.c

因为调用zlib,读取压缩文件，所以编译时需要添加-lz 选项；

测试了一下可以跑通；感觉kseq.h功能好强大，支持fasta/fastq,支持gzip压缩文件

查看全文

相关阅读:
NDBCLuster && MySQL Cluster 转
 Publisher FAQ | Readability: Enjoy Reading. Support Writing.
Eucalyptus Cloud Computing Software
Oracle Coherence
hbase note
使用python代码实现三叉搜索树高效率”自动输入提示”功能
 Facebook 的系统架构肇凯博客
 CentOS Linux 环境 Rails 安装记录
 linux socket网络编程进阶推荐阅读开源项目转
 HandlerSocket

原文地址：https://www.cnblogs.com/xudongliang/p/5307462.html