BLAST is an extensively used local similarity search tool for identifying homologous sequences. When a gene sequence (either protein sequence or nucleotide sequence) is used as a query to search for homologous sequences in a genome, the search results, represented as a list of high-scoring pairs (HSPs), are fragments of candidate genes rather than full-length candidate genes. Relevant HSPs ("signals"), which represent candidate genes in the target genome sequences, are buried within a report that contains also hundreds to thousands of random HSPs ("noises"). Consequently, BLAST results are often overwhelming and confusing even to experienced users. For effective use of BLAST, a program is needed for extracting relevant HSPs that represent candidate homologous genes from the entire HSP report. To achieve this goal, we have designed a graph-based algorithm, genBlastA, which automatically filters HSPs into well-defined groups, each representing a candidate gene in the target genome. The novelty of genBlastA is an edge length metric that reflects a set of biologically motivated requirements so that each shortest path corresponds to an HSP group representing a homologous gene. We have demonstrated that this novel algorithm is both efficient and accurate for identifying homologous sequences, and that it outperforms existing approaches with similar functionalities.
做完tblastn之后,output是很多fragment represent sequence,与fragment represent sequence对应的gene便是candidate gene,这些fragment represent sequence收在一个report中(就是all-opsin.pep.gba.report这个report),这个report中有相关的HSP(也就是高分序列)和random HSP(随机产生,但是被tblastn program认为是HSP的序列,这些错误序列就是noise),genblasta就是将这些noise filter 的 tool。
genBlastA release v1.0.1
SYNOPSIS:
Given a list of query protein or DNA sequences and a target database that
consists of DNA sequences, this program runs wu-blast tblastn on the list
of sequences provided, then for each query, it groups the resulted HSPs
into sensible groups so that each group of HSPs corresponds to a potential
target gene that is homologous to the query. The output is ranked according
to their homology to the query.
Command line options:
-P Search program used to produce blast-format sequence alignments,
can be either "blast" or "wublast", default is "blast",
optional
-q List of query sequences to blast, must be in fasta format,
required
-t The target database of genomic sequences in fasta format,
required
-p Whether query sequences are protein sequences (T/F)
[default: T], optional
-pg Specify which blast/wublast program to run. If not specified,
the default behaviour is to run tblastn (for blast/wublast protein
sequence) / blastn (for blast nucleotide sequence) or tblastx
(for wublast nucleotide sequence).
-e parameter for blast: The e-value, [default: 1e-2],
optional
-g parameter for blast: Perform gapped alignment (T/F)
[default: T], optional
-f parameter for blast: Perform filtering (T/F) [default: F],
optional
-a parameter for genBlast: weight of penalty for skipping HSPs,
between 0 and 1 [default: 0.5], optional
-d parameter for genBlast: maximum allowed distance between HSPs
within the same gene, a non-negative integer [default: 100000],
optional
-r parameter for genBlast: number of ranks in the output,
a positive integer, optional
-c parameter for genBlast: minimum percentage of query gene
coverage in the output, between 0 and 1 (e.g. for 50%
gene coverage, use "0.5"), optional
-s parameter for genBlast: minimum score of the HSP group in
the output, a real number, optional
-o output filename, optional. If not specified, the output
will be the same as the query filename with ".gblast"
extension.
Example:
genblasta -P blast -pg tblastn -q myquery -t mytarget -p T -e 1e-2 -g T -f F -a 0.5 -d 100000 -r 10 -c 0.5 -s 0 -o myoutput
(Rong She (rshe@cs.sfu.ca) May 2010)