zoukankan      html  css  js  c++  java
  • fasta多行文件处理

    1.创建fa文件,如下,命令为1.fa
    >SOX2    
    ACGAGGGACGCATCGGACGACTGCAGGACTGTC
    ACGAGGGACGCATCGGACGACTGCAGGACTGTC
    ACGAGGGACGCATCGGACGACTGCAGGAC
    >POU5F1    
    CGGAAGGTAGTCGTCAGTGCAGCGAGTCCGT
    CGGAAGGTAGTCGTCAGTGCAGCGAGTCC
    >NANOG    
    ACGAGGGACGCATCGGACGACTGCAGGACTGTC
    ACGAGGGACGCATCGGACGACTGCAGG
    ACGAGGGACGCATCGGACGACTGCAGGACTGTC
    ACGAGGGACGCATCGGACGACTGCAGGACTGT

    2.给>开头的行的行尾加上TAB键,以便隔开名字和序列,

    sed 's/^(>.*)/1 /' test.fasta | cat -A   > 2.fa(cat -A可以显示所有的符号)  ###  ()表示记录匹配的内容,1则表示()中记录的匹配的内容(没怎么看懂这个命令)

    结果如下:

    >SOX2^I$
    ACGAGGGACGCATCGGACGACTGCAGGACTGTC$
    ACGAGGGACGCATCGGACGACTGCAGGACTGTC$
    ACGAGGGACGCATCGGACGACTGCAGGAC$
    >POU5F1^I$
    CGGAAGGTAGTCGTCAGTGCAGCGAGTCCGT$
    CGGAAGGTAGTCGTCAGTGCAGCGAGTCC$
    >NANOG^I$
    ACGAGGGACGCATCGGACGACTGCAGGACTGTC$
    ACGAGGGACGCATCGGACGACTGCAGG$
    ACGAGGGACGCATCGGACGACTGCAGGACTGTC$
    

    3.把所有的换行符替换为空格,tr

    cat 2.fa  | tr ' '   ' ' > 3.fa

    >SOX2     ACGAGGGACGCATCGGACGACTGCAGGACTGTC ACGAGGGACGCATCGGACGACTGCAGGACTGTC ACGAGGGACGCATCGGACGACTGCAGGAC >POU5F1     CGGAAGGTAGTCGTCAGTGCAGCGAGTCCGT CGGAAGGTAGTCGTCAGTGCAGCGAGTCC >NANOG     ACGAGGGACGCATCGGACGACTGCAGGACTGTC ACGAGGGACGCATCGGACGACTGCAGG ACGAGGGACGCATCGGACGACTGCAGGACTGTC ACGAGGGACGCATCGGACGACTGCAGGACTGT

    4.把最后一个空格替换成换行符

    sed -e 's/ $/ /' 3.fa > 4.fa

    5.把‘ >’替换成换行符(空格+>)

    sed -e 's/ >/ >/g' 4.fa  > 5.fa

    >SOX2     ACGAGGGACGCATCGGACGACTGCAGGACTGTC   ACGAGGGACGCATCGGACGACTGCAGGACTGTC   ACGAGGGACGCATCGGACGACTGCAGGAC
    >POU5F1     CGGAAGGTAGTCGTCAGTGCAGCGAGTCCGT  CGGAAGGTAGTCGTCAGTGCAGCGAGTCC
    >NANOG     ACGAGGGACGCATCGGACGACTGCAGGACTGTC   ACGAGGGACGCATCGGACGACTGCAGG ACGAGGGACGCATCGGACGACTGCAGGACTGTC ACGAGGGACGCATCGGACGACTGCAGGACTGT

    6.把所有的空格替换掉

    sed 's/  //g' 5.fa > 6.fa

    >SOX2    ACGAGGGACGCATCGGACGACTGCAGGACTGTCACGAGGGACGCATCGGACGACTGCAGGACTGTCACGAGGGACGCATCGGACGACTGCAGGAC
    >POU5F1    CGGAAGGTAGTCGTCAGTGCAGCGAGTCCGTCGGAAGGTAGTCGTCAGTGCAGCGAGTCC
    >NANOG ACGAGGGACGCATCGGACGACTGCAGGACTGTCACGAGGGACGCATCGGACGACTGCAGGACGAGGGACGCATCGGACGACTGCAGGACTGTCACGAGGGACGCATCGGACGACTGCAGGACTGT

    7.把TAB键转换为换行符

    sed 's/ / /g'  6.fa > 7.fa

    >SOX2
    ACGAGGGACGCATCGGACGACTGCAGGACTGTCACGAGGGACGCATCGGACGACTGCAGGACTGTCACGAGGGACGCATCGGACGACTGCAGGAC
    >POU5F1
    CGGAAGGTAGTCGTCAGTGCAGCGAGTCCGTCGGAAGGTAGTCGTCAGTGCAGCGAGTCC
    >NANOG
    ACGAGGGACGCATCGGACGACTGCAGGACTGTCACGAGGGACGCATCGGACGACTGCAGGACGAGGGACGCATCGGACGACTGCAGGACTGTCACGAGGGACGCATCGGACGACTGCAGGACTGT

  • 相关阅读:
    LeetCode OJ:Merge Two Sorted Lists(合并两个链表)
    LeetCode OJ:Remove Nth Node From End of List(倒序移除List中的元素)
    LeetCode OJ:Find Peak Element(寻找峰值元素)
    LeetCode OJ:Spiral MatrixII(螺旋矩阵II)
    LeetCode OJ:Longest Palindromic Substring(最长的回文字串)
    利用生产者消费者模型实现大文件的拷贝
    Linux下用c语言实现whereis.
    Huffman编码实现文件的压缩与解压缩。
    修改MySQL数据库存储位置datadir
    python中pickle简介
  • 原文地址:https://www.cnblogs.com/lmt921108/p/7714906.html
Copyright © 2011-2022 走看看