zoukankan      html  css  js  c++  java
  • blast 数据库说明

       The BLAST Databases
                        Last updated on January 12, 2015
    
    This document describes the BLAST databases available on the NCBI FTP site under 
    the /blast/db directory. The direct URL is ftp://ftp.ncbi.nlm.nih.gov/blast/db
    
    1. Quick Start
        * Get all numbered files for a database with the same base name:
          Each of these files represents a subset (volume) of that database,
          and all of them are needed to reconstitute重建 the database.
        * After extraction, there is no need to concatenate连接 the resulting files:
          Call the database with the base name, for nr database files, use "-db nr".
        * For easy download, use the update_blastdb.pl script from the blast+ package.
        * Incremental增加 update is not available.
    
    2. General Introduction
    
    BLAST search pages under the Basic BLAST section of the NCBI BLAST home page
    (http://blast.ncbi.nlm.nih.gov/) use a standard set of BLAST databases for 
    nucleotide, protein, and translated BLAST searches.  These databases are made 
    available as compressed压缩 archives文档 of pre-formatted form) and can be donwloaed from
    the /db directory of the BLAST ftp site (ftp://ftp.ncbi.nlm.nih.gov/blast/db/). 
    The FASTA files reside位于 under the /FASTA subdirectory.
    
    The pre-formatted databases offer the following advantages:
        * Pre-formatting removes the need to run makeblastdb;
        * Species-level taxonomy ids are included for each database entry;
        * Databases are broken into smaller-sized volumes and are therefore easier 
          to download;
        * Sequences in FASTA format can be generated from the pre-formatted databases
          by using the blastdbcmd utility;
        * A convenient script (update_blastdb.pl) is available in the blast+ package 
          to download the pre-formatted databases.
    
    Pre-formatted databases must be downloaded using the update_blastdb.pl script or 
    via FTP in binary mode. Documentation for this script can be obtained by running 
    the script without any arguments; Perl installation is required.
    
    The compressed files downloaded must be inflated with gzip or other decompress 
    tools. The BLAST database files can then be extracted out of the resulting tar 
    file using the tar utility on Unix/Linux, or WinZip and StuffIt Expander on 
    Windows and Macintosh platforms, respectively.  
    
    Large databases are formatted in multiple one-gigabyte volumes版本, which are named 
    using the basename.##.tar.gz convention习俗. All volumes with the same base name are 
    required. An alias file is provided to tie individual volumes together so that 
    the database can be called using the base name (without the .nal or .pal 
    extension). For example, to call the est database, simply use "-db est" option 
    in the command line (without the quotes). 
    
    Additional BLAST databases that are not provided in pre-formatted formats may 
    be available in the FASTA subdirectory. For other genomic BLAST databases, 
    please check the genomes ftp directory at:
        ftp://ftp.ncbi.nlm.nih.gov/genomes/
    
    3. Contents of the /blast/db/ directory
    
    The pre-formatted BLAST databases are archived in this directory. The names of 
    these databases and their contents are listed below.
    
    +-----------------------------+------------------------------------------------+
     File Name                    | Content Description                           
    +-----------------------------+------------------------------------------------+
    16SMicrobial.tar.gz	          | Bacterial and Archaeal 16S rRNA sequences from 
                                    BioProjects 33175 and 33117
    FASTA/                        | Subdirectory for FASTA formatted sequences
    README                        | README for this subdirectory (this file)
    Representative_Genomes.*tar.gz| Representative典型 bacterial/archaeal genomes database
    cdd_delta.tar.gz              | Conserved保守 Domain Database sequences for use with 
                                    stand alone deltablast
    cloud/	                      | Subdirectory of databases for BLAST AMI; see
                                    http://1.usa.gov/TJAnEt
    env_nr.*tar.gz                | Protein sequences for metagenomes
    env_nt.*tar.gz                | Nucleotide sequences for metagenomes
    est.tar.gz                    | This file requires est_human.*.tar.gz, 
                                    est_mouse.*.tar.gz, and est_others.*.tar.gz files 
                                    to function. It contains the est.nal alias so that 
                                    searches against est (-db est) will include 
                                    est_human, est_mouse and est_others. 
    est_human.*.tar.gz            | Human subset of the est表达序列标签 database from the est
                                    division of GenBank, EMBL and DDBJ.
    est_mouse.*.tar.gz            | Mouse subset of the est databasae
    est_others.*.tar.gz           | Non-human and non-mouse subset of the est database
    gss.*tar.gz                   | Sequences from the GSS division of GenBank, 
                                    EMBL, and DDBJ基因勘测序列
    htgs.*tar.gz                  | Sequences from the HTG division of GenBank, EMBL,
                                    and DDBJ
    human_genomic.*tar.gz         | Human RefSeq (NC_######) chromosome records with 
                                    gap adjusted concatenated NT_ contigs
    nr.*tar.gz                    | Non-redundant protein sequences from GenPept, 
                                    Swissprot, PIR, PDF, PDB, and NCBI RefSeq
    nt.*tar.gz                    | Partially non-redundant nucleotide sequences from 
                                    all traditional divisions of GenBank, EMBL, and DDBJ 
                                    excluding GSS,STS, PAT, EST, HTG, and WGS.
    other_genomic.*tar.gz         | RefSeq chromosome records (NC_######) for non-human
                                    organisms
    pataa.*tar.gz                 | Patent protein sequences
    patnt.*tar.gz                 | Patent nucleotide sequences. Both patent databases
                                    are directly from the USPTO, or from the EPO/JPO 
                                    via EMBL/DDBJ
    pdbaa.*tar.gz                 | Sequences for the protein structure from the 
                                    Protein Data Bank
    pdbnt.*tar.gz                 | Sequences for the nucleotide structure from the 
                                    Protein Data Bank. They are NOT the protein coding
                                    sequences for the corresponding pdbaa entries.
    refseq_genomic.*tar.gz        | NCBI genomic reference sequences
    refseq_protein.*tar.gz        | NCBI protein reference sequences
    refseq_rna.*tar.gz            | NCBI Transcript reference sequences
    sts.*tar.gz                   | Sequences from the STS division of GenBank, EMBL,
                                    and DDBJ序列标签位点
    swissprot.tar.gz              | Swiss-Prot sequence database (last major update)
    taxdb.tar.gz                  | Additional taxonomy information for the databases 
                                    listed here 
                                  | providing common and scientific names
    tsa_nt.*tar.gz                | Sequences from the TSA division of GenBank, EMBL,
                                    and DDBJ
    vector.tar.gz                 | Vector sequences from 2010, see Note 2 in section 4.
    wgs.*tar.gz                   | Sequences from Whole Genome Shotgun assemblies
    +-----------------------------+------------------------------------------------+
    
    4. Contents of the /blast/db/FASTA directory
    
    This directory contains FASTA formatted sequence files. The file names 
    and database contents are listed below. These files must be unpacked非压缩 and 
    processed through blastdbcmd建库命令 before they can be used by the BLAST programs. 
    +-----------------------+-----------------------------------------------------+ |File Name | Content Description | +-----------------------+-----------------------------------------------------+ alu.a.gz | translation of alu.n repeats alu.n.gz | alu repeat elements (from 2003) drosoph.aa.gz | CDS translations from drosophila.nt drosoph.nt.gz | genomic sequences for drosophila (from 2003) env_nr.gz* | Protein sequences for metagenomes, taxid 408169 env_nt.gz* | Nucleotide sequences for metagenomes, taxid 408169 est_human.gz* | human subset of the est database (see Note 1) est_mouse.gz* | mouse subset of the est database est_others.gz* | non-human and non-mouse subset of the est database gss.gz* | sequences from the GSS division of GenBank, EMBL, and DDBJ htgs.gz* | sequences from the HTG division of GenBank, EMBL, and DDBJ human_genomic.gz* | human RefSeq (NC_######) chromosome records with gap adjusted concatenated NT_ contigs igSeqNt.gz | human and mouse immunoglobulin variable region nucleotide sequences igSeqProt.gz | human and mouse immunoglobulin variable region protein sequences mito.aa.gz | CDS translations of complete mitochondrial genomes mito.nt.gz | complete mitochondrial genomes nr.gz* | non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq nt.gz* | nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) and wgs entries. Partially non-redundant. other_genomic.gz* | RefSeq chromosome records (NC_######) for organisms other than human pataa.gz* | patent protein sequences patnt.gz* | patent nucleotide sequences. Both patent sequence files are from the USPTO, or EPO/JPO via EMBL/DDBJ pdbaa.gz* | protein sequences from pdb protein structures pdbnt.gz* | nucleotide sequences from pdb nucleic acid structures. They are NOT the protein coding sequences for the corresponding pdbaa entries. sts.gz* | database for sequence tag site entries swissprot.gz* | swiss-prot database (last major release) vector.gz | vector sequences from 2010. (See Note 2) wgs.gz* | whole genome shotgun genome assemblies yeast.aa.gz | protein translations from yeast.nt yeast.nt.gz | yeast genomes (from 2003) +-----------------------+---------------------------------------------------+ NOTE: (1) NCBI does not provide the complete est database in FASTA format. One needs to get all three subsets (est_human, est_mouse, and est_others and concatenate them into the complete est fasta database). (2) For screening for vector contamination, use the UniVec database: ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/ * marked files have pre-formatted counterparts. 5. Database updates The BLAST databases are updated regularly. There is no established incremental增加的 update scheme. We recommend downloading the complete databases regularly to keep their content current. 6. Non-redundant defline syntax The non-redundant databases are nr, nt (partially) and pataa. In them, identical sequences are merged into one entry. To be merged two sequences must have identical lengths and every residue at every position must be the same. The FASTA deflines for the different entries that belong to one record are separated by control-A characters invisible to most programs. In the example below both entries gi|1469284 and gi|1477453 have the same sequence, in every respect: >gi|3023276|sp|Q57293|AFUC_ACTPL Ferric transport ATP-binding protein afuC ^Agi|1469284|gb|AAB05030.1| afuC gene product ^Agi|1477453|gb|AAB17216.1| afuC [Actinobacillus pleuropneumoniae] MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVT KSSIQNRDICIVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQ QQRVALARALVLKPKVLILDEPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMN KGTIMQKARQKIFIYDRILYSLRNFMGESTICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPE AIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLINANPDQFDPDATKAFIHFTEQGIFLLNKE The syntax of sequence header lines used by the NCBI BLAST server depends on the database from which each sequence was obtained. The table at http://www.ncbi.nlm.nih.gov/toolkit/doc/book/ch_demo/?report=objectonly#ch_demo.T5 lists the supported FASTA identifiers. For databases whose entries are not from official NCBI sequence databases, such as Trace database, the gnl| convention is used. For custom databases, this convention should be followed and the id for each sequence must be unique, if one would like to take the advantage of indexed database, which enables specific sequence retrieval检索 using blastdbcmd program included in the blast executable package. One should refer to documents distributed in the standalone BLAST package for more details. 7. Formatting a FASTA file into a BLASTable database FASTA files need to be formatted with makeblastdb建库 before they can be used in local blast search. For those from NCBI, the following makeblastdb commands are recommended: For nucleotide fasta file: makeblastdb -in input_db -dbtype nucl -parse_seqids For protein fasta file: makeblastdb -in input_db -dbtype prot -parse_seqids 8. Technical Support Questions and comments on this document and NCBI BLAST related questions should be sent to the blast-help group at: blast-help@ncbi.nlm.nih.gov For information about other NCBI resources/services, please send email to NCBI User Service at: info@ncbi.nlm.nih.gov
  • 相关阅读:
    [bzoj1500][luogu2042][cogs339][codevs1758]维修数列(维护数列)
    无旋treap的简单思想以及模板
    [hdu2036]改革春风吹满地
    (treap)[bzoj3224][洛谷3369][cogs1829]Tyvj 1728 普通平衡树
    [bzoj3875][Ahoi2014]骑士游戏
    [bzoj1433][ZJOI2009]假期的宿舍
    <struct、union、enum>差异
    LeetCode(50) Pow(x,n)
    LeetCode(49)Group Anagrams
    LeetCode(48)Rotate Image
  • 原文地址:https://www.cnblogs.com/xiaofeiIDO/p/6429497.html
Copyright © 2011-2022 走看看