zoukankan      html  css  js  c++  java
  • 蛋白序列GO号注释及问题

    #===============================      版本1  ===============================================
    InterProScan的三种使用方法
    Interproscan,通过蛋白质结构域和功能位点数据库预测蛋白质功能。是EBI开发的一个集成了蛋白质家族、结构域和功能位点的非冗余数据库。Interproscan整合了一些使用最普及的一些数据库,并应用于功能未知的蛋白进行Interpro注释和GO注释。
    以下介绍3中interpro注释的方法:

    三、本地化的InterProScan注释
    3.1 本地化的InterProScan安装与配置

    3.1.1 从ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan下载以下5个文件:

    RELEASE/latest/iprscan_v4.8.tar.gz
    BIN/4.x/iprscan_bin4.x_[PLATFORM].tar.gz
    DATA/iprscan_DATA_[LATESTDATAVERSION].tar.gz
    DATA/iprscan_PTHR_DATA_[LATESTDATAVERSION].tar.gz
    DATA/iprscan_MATCH_DATA_[LATESTDATAVERSION].tar.gz

    3.1.2 将5个文件解压到一个文件夹中,然后运行其中的文件Config.pl,来对InterProScan进行配置。
    3.1.3 配置的过程中,若选择进行本地web配置,则修改本地www服务的配置文件,以能进行本地化网页版的运行。
    3.2 本地化InterProScan的使用。
    3.2.1 命令行运行iprscan的方法:

    $bin/iprscan -cli -iprlookup -goterms -format xml -i test.fasta -o test.out

    # help

    http://www.chenlianfu.com/?tag=iprscan
    

    该模块中XML::Parser    XML::Parser::Expat 这两个模块,后一个必须先安装,后续一个接着安装,由于是C层面的模块,需要安装一些东西

     Expat must be installed prior to building XML::Parser and I can't find it in the standard library directories. Install 'expat-devel' (or 'libexpat1-dev') package

    小提示: (root或者sudo权限)  yum 或者 apt-get install expat-devel  (具体版本具体办)

    #==============================================    版本2   =============================================

    https://github.com/ebi-pf-team/interproscan/wiki   原文链接

    第一步: 环境配置

    Software requirements:

    • 64-bit Linux
    • Perl (default on most Linux distributions)
    • Python 2.7.x only
    • Oracle's Java JDK/JRE version 8 (required by InterProScan 5.17-56.0 onwards). Earlier InterProScan release versions required Java 6 (version 6u4 and above) or Java 7.
    • Environment variables set
      • $JAVA_HOME should point to the location of the JVM

                        $JAVA_HOME/bin should be added to the $PATH

    第二步: 数据下载

    wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.27-66.0/interproscan-5.27-66.0-64-bit.tar.gz

    wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.27-66.0/interproscan-5.27-66.0-64-bit.tar.gz.md5

    md5sum -c interproscan-5.27-66.0-64-bit.tar.gz.md5   (解压前,把xxx.tar.gz xxx.tar.gz.md5放到同一目录下做检查完整性)

    tar -pxvzf interproscan-5.27-66.0-64-bit.tar.gz   (-p参数为了保持文件的权限 -v 建议去掉,这个是解压过程显示)


    (解压后进去有个data目录,后续panther数据解压放进去,配置文件默认路径,如果放其他地方,设置一下)

    第三步:运行测试

    ./interproscan.sh -i test_proteins.fasta -f tsv 
    ./interproscan.sh -i test_proteins.fasta -cpu 8 -f GFF3 -goterms -iprlookup -t p -T 20171127tmp

    #  参数: -i 输入  -f format   -goterms -iprlookup  GO注释  -t  数据类型   -T 临时文件目录名称

    小提示:

    TSV 是Tab-separated values的缩写,即制表符分隔值。
    CSV,Comma-separated values(逗号分隔值)。

    #=============================      具体参数  ========================================

    27/11/2017 14:41:35:049 Welcome to InterProScan-5.27-66.0
    usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts
                -XX:+UseFastAccessorMethods -Xms128M -Xmx2048M -jar
                interproscan-5.jar
    
    
    Please give us your feedback by sending an email to
    
    interhelp@ebi.ac.uk
    
     -appl,--applications <ANALYSES>            Optional, comma separated list
                                                of analyses.  If this option
                                                is not set, ALL analyses will
                                                be run.
     -b,--output-file-base <OUTPUT-FILE-BASE>   Optional, base output filename
                                                (relative or absolute path).
                                                Note that this option, the
                                                --output-dir (-d) option and
                                                the --outfile (-o) option are
                                                mutually exclusive.  The
                                                appropriate file extension for
                                                the output format(s) will be
                                                appended automatically. By
                                                default the input file
                                                path/name will be used.
     -cpu,--cpu <CPU>                           Optional, number of cores for
                                                inteproscan.
     -d,--output-dir <OUTPUT-DIR>               Optional, output directory.
                                                Note that this option, the
                                                --outfile (-o) option and the
                                                --output-file-base (-b) option
                                                are mutually exclusive. The
                                                output filename(s) are the
                                                same as the input filename,
                                                with the appropriate file
                                                extension(s) for the output
                                                format(s) appended
                                                automatically .
     -dp,--disable-precalc                      Optional.  Disables use of the
                                                precalculated match lookup
                                                service.  All match
                                                calculations will be run
                                                locally.
     -dra,--disable-residue-annot               Optional, excludes sites from
                                                the XML, JSON output
     -f,--formats <OUTPUT-FORMATS>              Optional, case-insensitive,
                                                comma separated list of output
                                                formats. Supported formats are
                                                TSV, XML, JSON, GFF3, HTML and
                                                SVG. Default for protein
                                                sequences are TSV, XML and
                                                GFF3, or for nucleotide
                                                sequences GFF3 and XML.
     -goterms,--goterms                         Optional, switch on lookup of
                                                corresponding Gene Ontology
                                                annotation (IMPLIES -iprlookup
                                                option)
     -help,--help                               Optional, display help
                                                information
     -i,--input <INPUT-FILE-PATH>               Optional, path to fasta file
                                                that should be loaded on
                                                Master startup. Alternatively,
                                                in CONVERT mode, the
                                                InterProScan 5 XML file to
                                                convert.
     -iprlookup,--iprlookup                     Also include lookup of
                                                corresponding InterPro
                                                annotation in the TSV and GFF3
                                                output formats.
     -ms,--minsize <MINIMUM-SIZE>               Optional, minimum nucleotide
                                                size of ORF to report. Will
                                                only be considered if n is
                                                specified as a sequence type.
                                                Please be aware of the fact
                                                that if you specify a too
                                                short value it might be that
                                                the analysis takes a very long
                                                time!
     -o,--outfile <EXPLICIT_OUTPUT_FILENAME>    Optional explicit output file
                                                name (relative or absolute
                                                path).  Note that this option,
                                                the --output-dir (-d) option
                                                and the --output-file-base
                                                (-b) option are mutually
                                                exclusive. If this option is
                                                given, you MUST specify a
                                                single output format using the
                                                -f option.  The output file
                                                name will not be modified.
                                                Note that specifying an output
                                                file name using this option
                                                OVERWRITES ANY EXISTING FILE.
     -pa,--pathways                             Optional, switch on lookup of
                                                corresponding Pathway
                                                annotation (IMPLIES -iprlookup
                                                option)
     -t,--seqtype <SEQUENCE-TYPE>               Optional, the type of the
                                                input sequences (dna/rna (n)
                                                or protein (p)).  The default
                                                sequence type is protein.
     -T,--tempdir <TEMP-DIR>                    Optional, specify temporary
                                                file directory (relative or
                                                absolute path). The default
                                                location is temp/.
     -version,--version                         Optional, display version
                                                number
     -vtsv,--output-tsv-version                 Optional, includes a TSV
                                                version file along with any
                                                TSV output (when TSV output
                                                requested)
    Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge,
    UK. (http://www.ebi.ac.uk) The InterProScan software itself is provided
    under the Apache License, Version 2.0
    (http://www.apache.org/licenses/LICENSE-2.0.html). Third party components
    (e.g. member database binaries and models) are subject to separate
    licensing - please see the individual member database websites for
    details.
    
    Available analyses:
                          TIGRFAM (15.0) : TIGRFAMs are protein families based on Hidden Markov Models or HMMs
                             SFLD (3) : SFLDs are protein families based on Hidden Markov Models or HMMs
                      SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.
                          PANTHER (12.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.
                           Gene3D (4.1.0) : Structural assignment for whole genes and genomes using the CATH domain structure database
                            Hamap (2017_10) : High-quality Automated and Manual Annotation of Microbial Proteomes
                            Coils (2.2.1) : Prediction of Coiled Coil Regions in Proteins
                  ProSiteProfiles (2017_09) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them
                            SMART (7.1) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs
                              CDD (3.16) : Prediction of CDD domains in Proteins
                           PRINTS (42.0) : A fingerprint is a group of conserved motifs used to characterise a protein family
                  ProSitePatterns (2017_09) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them
                             Pfam (31.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)
                           ProDom (2006.1) : ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database.
                       MobiDBLite (1.0) : Prediction of disordered domains Regions in Proteins
                            PIRSF (3.02) : The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.
    
    Deactivated analyses:
                          Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl
                      SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
            SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
                            TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model
            SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
    
  • 相关阅读:
    java实现 n人过桥问题
    git:rebase的原理
    注解@ConfigurationProperties使用方法
    docker+mysql 更改配置后重启不了的解决方案
    docker+mysql 构建数据库的主从复制
    Linux 踩坑记
    OSS上传图片无法在线预览的解决方案
    Linux中du、df显示不一致问题
    zookeeper作配置中心(存储支付信息)
    @Configuration结合@Bean实现对象的配置
  • 原文地址:https://www.cnblogs.com/jinhh/p/7902854.html
Copyright © 2011-2022 走看看