zoukankan      html  css  js  c++  java
  • linux command line 利用Entrez Direct下载NCBI数据

    一、软件的安装

    1.软件下载:

    curl    ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/edirect.zip -O (熟悉curl下载文件的方法,见http://www.cnblogs.com/duhuo/p/5695256.html)

    2.解压

    unzip edirect.zip

    3.添加、激活环境变量

    echo  'export PATH=/home/lmt/desktop/edirect/:$PATH'  >>  ~/.zshrc (根据自己的配置文件选择,可能使~/.bashrc。查看shell ,echo $SHELL 就晓得啦)

    source ~/.zshrc(激活环境变量)

    二、.entrez direct的功能

    1.esearch   根据给定的indexed fields进行查找

    2.efilter   过滤之前查找到的的结果

    3.efetch   根据指定的格式下载所需的数据

    。。。。。

    三、用法举例

    下载核酸或蛋白序列(fasta格式)

    esearch -db nucleotide -query  'CHN-JS-2014'  |  efetch    -format    fasta       >  11.fasta             #下载的为全基因组碱基序列

    >KP757892.1 Porcine deltacoronavirus isolate CHN-JS-2014, complete genome
    ACATGGGGACTAAAGATAAAAATTATAGCATTAGTCTATAATTTTATCTCCCTAGCTTCGCTAGTTCTCT
    ACCGACACCAATCCAGGTGCGTCTGCCACCAAGTTGGCTACCCTTTCTAGGGGCGCTTTCGCGCTTGCTC
    ACCATTAGATTACCTGGAAACCAGCCATTCAGGTTGGAGTTTCCCCAGGCTCTTTTGTGTGGGCATTAGC

    esearch  -db  necleotide -query  'CHN-JS-2014'  |  efetch  -format   gene_fasta   >  22.fasta      #下载的为各个区段的基因的碱基序列,如S/E/M等,分开的

    >lcl|KP757892.1_gene_3 [gene=E] [locus_tag=PDCoV-CHN-JS-2014_gp3] [location=22797..23048]
    ATGGTAGTCGACGACTGGGCCGTTACCATCCCTGGACAATATATTATTGCTATACTAGTTGTCATCTGCA
    TTGGTGTGGCACTACTTTTTATTAACACTTGCTTAGCTTGTGTTAAATTATTTTACAAGTGCTACCTAGG
    GGCAGCATACCTTGTTAGGCCTATTATAGTGTACTACTCCAAGCCGAACCCCGTACCTGAGGATGAGTTT
    GTAAAAGTACACCAATTTCCTAGAAACACTCACTATGTCTGA
    >lcl|KP757892.1_gene_4 [gene=M] [locus_tag=PDCoV-CHN-JS-2014_gp4] [location=23041..23694]
    ATGTCTGACGCAGAAGAGTGGCAAATTATTGTTTTCATTGCGATCATATGGGCGCTTGGCGTCATCCTCC
    AAGGAGGCTATGCCACGCGTAATCGTGTGATCTATGTTATTAAACTTATTCTGCTTTGGCTGCTCCAACC
    CTTCACCCTAGTGGTGACCATTTGGACCGCAGTTGACAGATCATCTAAGAAGGACGCAGTTTTCATTGTG
    TCCATAATTTTTGCCGTACTGACCTTCATATCCTGGGCCAAGTACTGGTATGACTCAATTCGCTTATTAA
    TGAAAACCAGATCTGCATGGGCACTCTCACCTGAGAGTAGACTCCTTGCAGGGATTATGGATCCAATGGG
    TACATGGAGGTGCATTCCCATCGACCACATGGCTCCAATTCTCACACCAGTCGTTAAGCATGGCAAGCTC

    esearch  -db  necleotide -query  'CHN-JS-2014'  |  efetch  -format   fasta_cds_aa     >  33.fasta            #下载的为各个区段的基因的蛋白序列,分开的(在核酸库里搜索,试着用蛋白库,发现报错)

    >lcl|KP757892.1_prot_AKC54443.1_3 [gene=E] [locus_tag=PDCoV-CHN-JS-2014_gp3] [protein=envelope protein] [protein_id=AKC54443.1] [location=22797..23048] [gbkey=CDS]
    MVVDDWAVTIPGQYIIAILVVICIGVALLFINTCLACVKLFYKCYLGAAYLVRPIIVYYSKPNPVPEDEF
    VKVHQFPRNTHYV
    >lcl|KP757892.1_prot_AKC54444.1_4 [gene=M] [locus_tag=PDCoV-CHN-JS-2014_gp4] [protein=membrane protein] [protein_id=AKC54444.1] [location=23041..23694] [gbkey=CDS]
    MSDAEEWQIIVFIAIIWALGVILQGGYATRNRVIYVIKLILLWLLQPFTLVVTIWTAVDRSSKKDAVFIV
    SIIFAVLTFISWAKYWYDSIRLLMKTRSAWALSPESRLLAGIMDPMGTWRCIPIDHMAPILTPVVKHGKL
    KLHGQELANGISVRNPPQDMVIVSPSDTFHYTFKKPVESNNDPEFAVLIYQGDRASNAGLHTITTSKAGD
    ARLYKYM

    esearch  -db  necleotide -query  'CHN-JS-2014'  |  efetch  -format   fasta_cds_na     >  44.fasta            #下载的为各个区段基因的碱基序列,如S/E/M等,分开的,和22.fasta结果一样,只是注释信息较多

    下载序列(非fasta格式)

    >lcl|KP757892.1_cds_AKC54443.1_3 [gene=E] [locus_tag=PDCoV-CHN-JS-2014_gp3] [protein=envelope protein] [protein_id=AKC54443.1] [location=22797..23048] [gbkey=CDS]
    ATGGTAGTCGACGACTGGGCCGTTACCATCCCTGGACAATATATTATTGCTATACTAGTTGTCATCTGCA
    TTGGTGTGGCACTACTTTTTATTAACACTTGCTTAGCTTGTGTTAAATTATTTTACAAGTGCTACCTAGG
    GGCAGCATACCTTGTTAGGCCTATTATAGTGTACTACTCCAAGCCGAACCCCGTACCTGAGGATGAGTTT
    GTAAAAGTACACCAATTTCCTAGAAACACTCACTATGTCTGA
    >lcl|KP757892.1_cds_AKC54444.1_4 [gene=M] [locus_tag=PDCoV-CHN-JS-2014_gp4] [protein=membrane protein] [protein_id=AKC54444.1] [location=23041..23694] [gbkey=CDS]
    ATGTCTGACGCAGAAGAGTGGCAAATTATTGTTTTCATTGCGATCATATGGGCGCTTGGCGTCATCCTCC
    AAGGAGGCTATGCCACGCGTAATCGTGTGATCTATGTTATTAAACTTATTCTGCTTTGGCTGCTCCAACC
    CTTCACCCTAGTGGTGACCATTTGGACCGCAGTTGACAGATCATCTAAGAAGGACGCAGTTTTCATTGTG
    TCCATAATTTTTGCCGTACTGACCTTCATATCCTGGGCCAAGTACTGGTATGACTCAATTCGCTTATTAA
    TGAAAACCAGATCTGCATGGGCACTCTCACCTGAGAGTAGACTCCTTGCAGGGATTATGGATCCAATGGG
    TACATGGAGGTGCATTCCCATCGACCACATGGCTCCAATTCTCACACCAGTCGTTAAGCATGGCAAGCTC

    esearch  -db  necleotide -query  'CHN-JS-2014'  |  efetch  -format   gb     >  55.fasta                                   #下载的格式和在NCBI里的界面结果显示一样。

    LOCUS       KP757892               25420 bp ss-RNA     linear   VRL 17-DEC-2015
    DEFINITION  Porcine deltacoronavirus isolate CHN-JS-2014, complete genome.
    ACCESSION   KP757892
    VERSION     KP757892.1
    KEYWORDS    .
    SOURCE      Porcine deltacoronavirus
      ORGANISM  Porcine deltacoronavirus
                Viruses; ssRNA viruses; ssRNA positive-strand viruses, no DNA
                stage; Nidovirales; Coronaviridae; Coronavirinae.
    REFERENCE   1  (bases 1 to 25420)
      AUTHORS   Dong,N., Fang,L., Zeng,S., Sun,Q., Chen,H. and Xiao,S.
      TITLE     Porcine Deltacoronavirus in Mainland China
      JOURNAL   Emerging Infect. Dis. 21 (12), 2254-2255 (2015)
       PUBMED   26584185
    REFERENCE   2  (bases 1 to 25420)
      AUTHORS   Dong,N., Fang,L., Zeng,S., Sun,Q. and Xiao,S.
      TITLE     Direct Submission
      JOURNAL   Submitted (06-FEB-2015) State Key Laboratory of Agricultural
                Microbiology, Huazhong Agricultural University, 1 Shizishan Street,
                Wuhan, Hubei 430070, China
    COMMENT     ##Assembly-Data-START##
                Sequencing Technology :: Sanger dideoxy sequencing
                ##Assembly-Data-END##
    FEATURES             Location/Qualifiers
    。。。。
    。。。。。
    。。。。。
    。。。。 gene
    22797..23048 /gene="E" /locus_tag="PDCoV-CHN-JS-2014_gp3" CDS 22797..23048 /gene="E" /locus_tag="PDCoV-CHN-JS-2014_gp3" /codon_start=1 /product="envelope protein" /protein_id="AKC54443.1" /translation="MVVDDWAVTIPGQYIIAILVVICIGVALLFINTCLACVKLFYKC YLGAAYLVRPIIVYYSKPNPVPEDEFVKVHQFPRNTHYV" gene 23041..23694 /gene="M"
    。。。。。。
    。。。。。。。

     下载SRA数据的info信息

    esearch  -db   sra   -query   SRP075747  |  efetch   -format  runinfo  >  runinfo.txt

    Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
    SRR3589948,2016-09-09 16:27:05,2016-05-26 07:22:58,40008592,4080876384,40008592,102,1812,,https://sra-download.ncbi.nlm.nih.gov/traces/sra40/SRR/003505/SRR3589948,SRX1801292,,RIP-Seq,other,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2500,SRP075747,PRJNA323422,2,323422,SRS1468122,SAMN05178619,simple,9606,Homo sapiens,GSM2177715,,,,,,,no,,,,,GEO,SRA429358,,public,D9CB6278FA440C16D04832F947BF338F,165928A89FAE018C75463F7074DADEA8
    SRR3589949,2016-09-09 16:27:05,2016-05-26 07:23:43,37825589,3858210078,37825589,102,1664,,https://sra-download.ncbi.nlm.nih.gov/traces/sra40/SRR/003505/SRR3589949,SRX1801293,,RIP-Seq,other,TRANSCRIPTOMIC,PAIRED,0,0,ILLUMINA,Illumina HiSeq 2500,SRP075747,PRJNA323422,2,323422,SRS1468123,SAMN05178620,simple,9606,Homo sapiens,GSM2177716,,,,,,,no,,,,,GEO,SRA429358,,public,4C986EE070A46559AF6F8892378A6E7C,EC2FFDCD9C997BED576391FD3B19CF9E
  • 相关阅读:
    浅谈异或相关性质
    重谈树状数组
    洛谷 U141397 !
    谈谈Sleep和wait的区别
    请描述线程的生命周期
    一个普通main方法的执行,是单线程模式还是多线程模式?为什么?
    创建线程的方式
    一道关于try catch finally返回值的问题
    throw跟throws的区别
    罗列常见的5个非运行时异常
  • 原文地址:https://www.cnblogs.com/lmt921108/p/8087474.html
Copyright © 2011-2022 走看看