Key word
文本操作(text operation) 标引词(indexing term)
倒排文档(inverted file) 用户反馈(user feedback)
检索评价(retrieval evaluation) 查询语言(query language)
用户界面(user interface) 特别检索(ad hoc retrieval)
用户需求档(user profile) 语词加权(term-weighting)
数据检索(data retrieval) 用户任务(User task)
查全率(R, Recall Ratio) 查准率(P, Precision Ratio)
漏检率(O, Omission Ratio) 误检率(M, Miss Ratio)
用户负担(user effort) 面向用户(user-oriented)
扩展布尔模型(extended Boolean model) 词干提取(stemming)
参考测试集(reference test collection) 扩展模式 extended patterns
容错查询 searching allowing errors 有序包含(ordered inclusion)
无序包含(unordered inclusion) 查询重构 query reformulation
查询扩展 query expansion 语词重新加权 term reweighting
用户相关反馈 User Relevance Feedback 描述性元数据 Descriptive Metadata
语义元数据 Semantic Metadata 信息检索(Information Retrieval, IR)
受控词汇表(controlled vocabulary) 文本压缩text compression
压缩比(compression ratio) 词汇分析lexical analysis
排除停用词elimination of stopwords 词汇表(vocabulary)
事件表(occurrence) 倒排文档inverted files
散列变换(hashing) 查询语法树(query syntax tree)
移位-或(shift-or) 顺序检索(sequential search)
并行计算(parallel computing) 分布计算(distributed computing)
算术逻辑单元(arithmetic logic unit, ALU) 虚拟处理器(virtual processor)
中介器(broker ) 数字图书馆(Digital Library, DL)
信息存取任务(information access tasks) 信息可视化(information visualization)
上下文关键词(keyword-in-context, KWIC) 半结构化数据(semi-structured data)
动态服务器(dynamic server) 档案库服务器(archive server)
精确匹配(exact match) 基于内容(content-based)
收集器-标引器(crawler-indexer) 机器人robots
蜘蛛spiders 查询界面the query interface
响应界面the answer interface 网页级别PageRank
漫游Web Crawling the Web 广度优先breadth-first
d深度优先epth-first fashion 专指性查询Specific queries
泛指性查询Broad queries 网络目录Web Directories
元搜索引擎Metasearchers 用户培训Teaching the User
网络目录Web Directories 元搜索引擎Metasearchers
软件代理Software Agents 工程索引(Engineering Index,EI)
从用户的角度(from a user-centered perspective)
杜威十进分类法(Dewey Decimal Classification)
结构化文本检索(structured text retrieval, STR)
联机公共检索目录(online public access catalog, OPAC)
多媒体信息检索(Multimedia Information Retrieval, MIR)
从计算机学科的角度(from a computer-science perspective)
文献逻辑表示(视图)(logical view of the document)
检索性能评价(retrieval performance evaluation)
通用标记语言(SGML,standard general markup language)
机读目录记录(Machine Readable Cataloging Record, MARC)
资源描述框架(Resource Document Framework, RDF)
XML(eXtensible Markup Language, 可扩展标记语言)
HTML(HyperText Markup Language, 超文本标记语言)
分布式信息检索(distributed information retrieval)
通过图像内容查询(Query by Image Content, QBIC)
Distributed Architecture分布式结构Centralized Architecture集中式结构
国会图书馆分类法(Library of Congress Classification)
联机计算机图书馆中心(Online Computer Library Center, OCLC)
数字图书馆创新项目(Digital Libraries Initiative, DLI)
基于数字化对象标识符(Digital Object Identifier, DOI)
重点句子分析
第1章
The effective retrieval of relevant information is directly affected both by the user task and by the logical view of the documents
(直接影响检索效率的两个因素是:用户任务和信息系统逻辑视图文件)
The full text is clearly the most complete logical view of a document but its usage usually implies higher computational costs
(全文本是最完整的逻辑视图的文件,但它的使用通常意味着更高的成本计算)
Three dramatic and fundamental changes have occurred due to the advances in modern computer technology and the boom of the Web. First, it became a lot cheaper to have access to various sources of information. Second, the advances in all kinds of digital communication provided greater access to networks. Third, the freedom to post whatever information someone judges useful has greatly contributed to the popularity of the Web.
现代计算机技术和网络发展导致三个戏剧性和根本性变化的发生。首先,对于获得各种信息来源,它变得更加便宜;第二,各种先进的数字通信提供了更多的接入网络;第三,自由发布各种信息已经普及。
The text operations transform the original documents and generate a logical view of them. 文本操作,把原文件转换成一个符合的逻辑试图
Once the logical view of the documents is defined, the database manager (using the DB Manager Module) builds an index of the text.
一旦逻辑视图文件被定义,数据库管理员用数据库管理模块建立文本索引
the most popular one is the inverted file 最流行的方式之一是倒排文档
Given that the document database is indexed, the retrieval process can be initiated. The user first specifies a user need which is then parsed and transformed by the same text operations applied to the text.
只要文献数据库被检索,检索程序初始化。用户先详细说明自己的需要,然后由同一文本业务应用到文本。
The query is then processed to obtain the retrieved documents
查询然后处理以便获得检出文献
Before been sent to the user, the retrieved documents are ranked according to a likelihood of relevance.
在送给用户之前,检出文献根据相关度进行排序
At this point, he might pinpoint a subset of the documents seen as definitely of interest and initiate a user feedback cycle.
此时, 他可能查明重要明确文件的子集而且开始用户反馈周期。
第2章
Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
传统的信息检索系统通常采用标引词来标引和检索文件。
Clearly, one central problem regarding information retrieval systems is the issue of predicting which documents are relevant and which are not. Such a decision is usually dependent on a ranking algorithm() which attempts to establish a simple ordering of the documents retrieved.
信息检索的核心问题是预测哪些文献相关,哪些文献不相关。这种常常取决于文件检索排序算法
ranking algorithms are at the core of information retrieval systems.
排序算法是信息检索系统的核心。
The three classic models in information retrieval are called Boolean, vector, and probabilistic.( In the Boolean model, documents and queries are represented as sets of index terms. Thus, as suggested in , we say that the model is set theoretic. In the vector model, documents and queries are represented as vectors in a t-dimensional space. Thus, we say that the model is algebraic. In the probabilistic model, the framework for modeling document and query representations is based on probability theory., Thus, as the name indicates, we say that the model is probabilistic.
信息检索的三个经典模型分别是:布尔模型、向量模型和概率模型。在布尔模型中,文档和查询被表示成标引词的集合,我们通常称之集合论。在向量模型中,文档和查询表示成t维的向量,我们通常叫做代数模型。在概率模型中,文献建模和查询表示建立在概率论上,我们叫概率模型
User Task
|
Index Terms |
Full Text |
Full Text +Structure
|
Retrieval |
Classic Set Theoretic Algebraic Probabilistic |
Classic Set Theoretic Algebraic Probabilistic |
Structured
|
Browsing |
Flat |
Flat Hypertext |
Structure Guided Hypertext |
Figure 2.2 Retrieval models most frequently associated with distinct combinations of a document logical view and a user task.
通常和文献逻辑表示与用户任务的不同组合相关联的检索模型
In a conventional传统的 information retrieval system, the documents in the collection remain relatively static 相对静止while new queries are submitted to the system. This operational most has been termed ad hoc retrieval in recent years and is the most common form of user task. A similar but distinct task is one in which the queries remain relatively static while new documents come into the system (and leave).
This operational mode has been termed filtering. 这个操作模型称之为过滤
In a filtering task , a user profile describing the user's preferences is constructed. Such a profile is then compared to the incoming documents in an attempt to determine those which might be of interest to this particular user.
在过虑任务中,描述了用户优先选择的用户需求档被创建。然后将这一需求档与进入系统的文献相比较,以确定那些符合用户需要的文献。
not all terms are equally useful for describing the document contents.
并不是所有元素都对描述文档内容有意义。
This effect is captured through the assignment of numerical weights to each index term of a document.
这一作用的产生,通过指定文件标引词的权值来获得。
The Boolean model is a simple retrieval model based on set theory and Boolean algebra
布尔模型是最简单的检索模型,它是建立在集合理论和布尔代数上
Given its inherent simplicity and neat formalism, the Boolean model received great attention in past years and was adopted by many of the early commercial bibliographic systems.
由于它内部简单和形式简洁,过去几年,他被很多早期的商业书目系统所采用。
The Boolean model considers that index terms are present or absent in a document. As a result, the index term weights are assumed to be all binary, i.e., wi,j ∈{0, 1}. A query q is composed of index terms linked by three connectives:.not, and, or
布尔模型假定标引词出现或不出现在文献中。这样,标引词得权重被设定为二值的,0或1。查询连接词有: not, and, or
A hypertext is a high level interactive navigational structure which allows us to browse text non-sequentially on a computer screen
超文本是高层交互式导航结构,我们可以在电脑屏幕上游览非顺序的文本。
第3章
The most common measures of system performance are time and space .The shorter the response time, the smaller the space used, the better the system is considered to be.
常用的系统评价指标是时间和空间。最短响应时间,最小可用空间也是一个好的系统要考虑的。
information retrieval systems require the evaluation of how precise is the answer set. This type of evaluation is referred to as retrieval performance evaluation.
信息检索系统需要对检索结果集的准确度进行评价,这种评价称为检索性能评价
Such an evaluation is usually based on a test reference collection and on an evaluation measure. The test reference collection consists of a collection of documents, a set of example information requests, and a set of relevant documents (provided by specialists) for each example information request
这种评价常常建立在参考测试集和评价测度上。参考测试集由文献集,信息查询实例和每个信息查询实例的一组相关文献组成。
The goal of the conference series is to encourage research in information retrieval from large text applications by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results
系列会议的目标是通过提供大型测试集、统一评分程序以及检索结果对比的评价论坛,来鼓励大型文本应用的信息检索研究.
Such an effort consisted of promoting a yearly conference, named TREC for Text Retrieval Conference, dedicated to experimentation with a large test collection comprising over a million documents.
其具体工作是每年举行一次TREC(文本检索会议),该会议主要致力于大型测试集的试验研究。
Recall查全率is the fraction比值 of the relevant documents (the set R) which has been retrieved i.e., Recall=|Ra|/|R|
查全率是一个已经检索出相关文献数量和文献中全部相关文献数量的比值。
Precision查准率is the fraction比值 of the retrieved documents (the set A) which is relevant i.e., Precision=|Ra|/|A|
查准率是一个已经检索出相关文献数量和全部检出的文献数量。
Collection |
Relevant Docs | R | |
| Ra | Answer Set | A | |
As discussed above, a single measure单值测度 which combines recall and precision might be of interest. One such measure is the harmonic mean F of recall and precision which is computed as
F(j)=2 / (1/r(j)+1/p(j))
依方讨论, 单值测度是查全率和查准率联合,是一种更有效得方式。
Another measure which combines recall and precision was proposed by van Rijsbergen and is called the E evaluation measure. The idea is to allow the user to specify whether he is more interested in recall or in precision. The E measure is defined as follows.
E(j) = 1 - 1 +b2 / (b2/r(j)+1/p(j))
第4章
for the basic information retrieval models. keyword-based retrieval is the main type of querying task.对于基本的信息检索模型而言,基于关键词的检索是查询任务的主要类型.
Keyword-Based Querying基于关键词查询
A query is the formulation of a user information need. In its simplest form,a query is composed of keywords
查询是用户信息需求的概要表示,最简单的表示是由关键词组成
Single-Word Queries单一词查询
Context Queries 上下文查询
Boolean Queries布尔查询
Natural Language自然语言查询
Pattern Matching模式匹配
Structural Queries结构性查询
the exact positions where a word appears in the text may be required for instance, by an interface which highlights each occurrence of that word.
单词在文本中出现的确切位置也是一种信息,在查询界面中可要求高亮度显示检出的每个单词
Many systems complement single-word queries with the ability to search words in a given context, that is, near other words. Words which appear near each other may signal a higher likelihood of relevance
有些系统采用单一词查询在给定的上下文中检索, 也即查找邻近的其他单词。因为邻近出现的单词可能具有更大的相关性
Boolean queries are a simplified abstraction of natural language queries.
布尔查询是自然语言查询的简化和抽象
A pattern is a set of syntactic features that must occur in a text segment. Those segments satisfying the pattern specifications are said to 'match' the pattern
模式就是必须出现在文本部分中的语法特征的集合,满足模式集合规范的这些部分就可以说是与此模式的匹配.
A hypertext is a directed graph where the nodes hold some text and the links represent connections between nodes or between positions inside the nodes
超文本是一个有向图,结点表示文本,有向边表示结点之间或者结点中位置之间的联系
第5章
Such query reformulation involves two basic steps: expanding the original query with new terms and reweighting the terms in the expanded query.
查询重构技术包括两个步骤:利用新的语词来扩展初始的查询;在扩展的查询中给语词重新加权
These approaches are grouped in three categories: (a) approaches based on feedback information from the user; (b) approaches based on information derived from the set of documents initially retrieved (called the local set of documents); and (c) approaches based on global information derived from the document collection
这些方法分成三组:
(a) 基于用户反馈信息的方法
(b) 基于最初检出文献集合(称为文献的局部集)信息的方法
(c) 基于文献集合全局信息的方法.
Relevance feedback is the most popular query reformulation strategy
相关反馈是最流行的查询重构技术.
The main idea consists of selecting important terms, or expressions, attached to the documents that have been identified as relevant by the user, and of enhancing the importance of these terms in a new query formulation. The expected effect is that the new query will be moved towards the relevant documents and away from the non-relevant ones.
其基本思想是从用户认为相关的文献中选择重要的语词或表达式,然后在新的查询表达式中不断提高这些语词的重要性,我们希望新的查询能够将相关文献与不相关的文献中区分开来
第6章
a document can be any physical unit, for example a file, an email, or a World Wide Web (or just Web) page.
文献可能是任何一种物理单元,例如文件,邮件,或是网页。
A document has a given syntax and structure which is usually dictated by the application or by the person who created it.
文献有语法和结构,通常视应用而定,或由生产文献的人指定。
The current trend is to use languages which provide information on the document structure, format, and semantics while being readable by humans as well as computers.
在人机阅读时,流行的方式是用语言来提供文献结构,格式,语义等信息,
Metadata is information on the organization of the data, the various data domains, and the relationship between them. In short, metadata is 'data about the data.'
元数据是关于数据的组织、不同数据域及其相互关系的信息,简言之,就是“关于数据的数据”
the Dublin Core Metadata都柏林核心元数据
Descriptive Metadata, metadata that is external to the meaning of the document, and pertains more to how it was created. Another type of metadata characterizes the subject matter that can be found within the document's contents. We will refer to this as Semantic Metadata.
我们将这类信息称之为描述性元数据,而有关文献含义的外部元数据更多的是说明文献是怎样产生的,另一类元数据描述的是文献的主题内容,我们把这类元数据称之为语义元数据。
An important metadata format is the Machine Readable Cataloging Record (MARC) which is the most used format for library records.
一种重要的元数据格式是机读目录记录,通常用于图书馆记录。
In the Web, metadata can be used for many purposes. Some of them are cataloging (BibTeX is a popular format for this case), content rating (for example, to protect children from reading some type of documents), intellectual property rights, digital signatures (for authentication), privacy levels (who should and who should not have access to a document), applications to electronic commerce, etc. The new standard for Web metadata is the Resource Description Framework (RDF), which provides interoperability between applications.
在Web中,元数据常用于多种目的。如编目(BibTex是因特网上最常用的格式)、内容等级(如防止儿童浏览一些不健康的文档)、知识产权、数字签名(鉴别)、权限(哪些人可以访问该文档,哪些人不可以)、电子商务的应用等,新的web元数据标准是资源描述框架,提供了相互的操作应用。
With the advent of the computer, it was necessary to code text in binary digits.Information theory defines a special concept, entropy, to capture information content
随着计算机的出现,把文本用二进制数来编码是必须的。信息论,定义了一特殊概念,用它来测定信息量
the entropy of this text is defined as
σ
E = - ∑ pi log2 pi
i=1
熵的求法: 例如 001001011011
0概率是0.5
1概率是0.5
带入上述公式得结果E=1
The marks are called tags, 标记又称为标签
SGML stands for Standard Generalized Markup Language (ISO 8879) and is a metalanguage for tagging text
SGML是用于标记文本的一种元语言
Each instance of SGML includes a description of the document structure called a document type definition. Hence, an SGML document is defined by: (1) a description of the structure of the document and (2) the text itself marked with tags which describe the structure.
每个SGML实例包括一个文献结构叫文献类型定义。因此SGML文献定义如下: (1) 对文献结构的描述(2) 用描述结构的标签标记文本
The document type definition is used to describe and name the pieces that a document is composed of and define how those pieces relate to each other.
DTD常用于描述和命名组成文献的部分,以及明确这些部分之间是如何彼此相互关联的,
HTML stands for HyperText Markup Language and is an instance of SGML. HTML was created in 1992
超文本标记语言是SGML的一个实例,创立在1992年。
XML stands for eXtensible Markup Language and is a simplified subset of SGML.
可扩展的标记语言是SGML简化的一个子集.
Multimedia usually stands for applications that handle different types of digital data originating from distinct types of media. The most common types of media in multimedia applications are text, sound, images, and video (which is an animated sequence of images).
多媒体通常表示处理不同类型的数字数据的应用,这些数据来源于不同的媒体。最普遍的类型是文本、音频、图像和视频(图像的一个动画结果)。
There are several formats for images. The simplest formats are direct representations of a bit-mapped 位图(or pixel-based像素图) display such as XBM, BMP, or PCX. However, those formats consume too much space.
图像格式:
Compuserve's Graphic Interchange Format (GIF)图形交换格式.
Joint Photographic Experts Group (JPEG)
Tagged Image File Format (TIFF标签图像文件格式).
Portable Network Graphics (PNG新型位图图像格式).
音乐格式:
digital audio are AU, MIDI, and WAVE,mp3.
The main one is MPEG (Moving Pictures Expert Group)
数字动画主要的是MPEG。
第7章
Usually, noun words (or groups of noun words) are the ones which are most representative代表性 of a document content.
通常说,名词或名词词组主要代表了文献的内容。
using the set of all words in a collection to index its documents generates too much noise for the retrieval task.
对于检索任务来说,用集合中的所有单词来标引文献就会产生太多的噪声
Text normalization and the building of a thesaurus are strategies aimed at improving the precision of the documents retrieved.
在提高文献检索查准率上,文本规范化和叙词表的建立是很有效的策略。
A good compression algorithm is able to reduce the text to 30-35% of its original size.
好的压缩算法可以减少文本大小到有文件的30-35%之间。
Document preprocessing is a procedure which can be divided mainly into five text operations (or transformations):
(1) Lexical analysis of the text
(2) Elimination of stop words
(3) Stemming
(4) Selection of index terms
(5)Construction of term categorization structures
主要划分为5个文本操作(变换)步骤:
文本的词汇分析
(2)排除停用词
(3)词干提取
(4)选择标引词
(5)构造语词分类结构
Lexical analysis is the process of converting a stream of characters (the text of the documents) into a stream of words (the candidate words to be adopted as index terms).
词汇分析是把文献中的文本字符序列转换为单词序列的过程,这些单词可以作为标引词的侯选
Numbers are usually not good index terms 数字不适合作标引词
Breaking up hyphenated words might be useful due to inconsistency of usage.
把连字符分解来是一种比较好的方法
Normally, punctuation marks标点符号 are removed entirely in the process of lexical analysis.
一般来说,在词汇分析过程中要完全去除标点符号
The case of letters is usually not important for the identification of index terms. As a result, the lexical analyzer normally converts all the text to either lower or upper case.
字母的大小写对标引词的定义通常不是很重要,词汇分析都会转换成小写或大写。
In fact, a word which occurs in 80% of the documents in the collection is useless for purposes of retrieval. Such words are frequently referred to as stop words and are normally filtered out as potential index terms. Articles, prepositions, and conjunctions are natural candidates for a list of stop words.
事实上, 单词出现80%对检索没有意义,这种词都是停用词需要过滤掉,他们包括潜标引词,冠词,介词,连词,候选词。
A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes).词干是单词的一部分,是去除词缀(前后缀)后剩下的部分
If a full text representation of the text is adopted then all words in the text are used as index terms. The alternative is to adopt a more abstract view
如果采用全文的表示,文本中的所有单词都作为标引词,将会选择采用更抽象的视图。
an intuitively promising strategy for selecting index terms automatically is to use the nouns in the text.自动标引的一个最直观的方法是采用文本中的名词
The word thesaurus has Greek and Latin origins and is used as a reference to a treasury of words
希腊语和拉丁语常用作词库的引用
the main purposes of a thesaurus are basically: (a) to provide a standard vocabulary (or system of references) for indexing and searching; (b) to assist ,users with locating terms for proper query formulation; and (c) to provide classified hierarchies that allow the broadening and narrowing of the current query request according to the needs of the user.
叙词表的主要目的是:
(a) 为标引和检索提供标准化的词汇表或参照系统
(b) 帮助用户确定哪些语词适合于查询表达式
(c) 根据用户需要,提供当前查询上位类和下位类的分类层次
The main components of a thesaurus are its index terms, the relationships among the terms, and a layout design for these term relationships.
叙词表的主要组成部分是标引词、语词之间的关系以及其编排方式
Thesaurus Index Terms叙词表中的标引词
The terms are the indexing components of the thesaurus.标引词是叙词表的索引单元
Thesaurus Term Relationships叙词表中的词间关系
The set of terms related to a given thesaurus term is mostly composed of synonyms and near-synonyms
叙词表中的词间关系主要是同义词和近义词.
Document clustering is the operation of grouping together similar (or related) documents in classes文献聚类是把类中相似或相关文献分成一组.
The operation of clustering documents is usually of two types: global and local.
聚集文献的操作由两种类型: 全局和局部聚类
Text compression is about finding ways to represent the text in fewer bits or bytes.
文本压缩是用更少的位/比特或字节来表达文本内容的一种方法
Definition Compression ratio is the size of the compressed file as a fraction of the uncompressed file.压缩比是指已压缩文本的大小与未压缩文本大小的比率
第8章
An obvious option in searching for a basic query is to scan the text sequentially.A second option is to build data structures over the text (called indices) to speed up the search..
检索一个基本查询的常用方法是顺序地扫描文本,第二种方法是建立全面覆盖文本的数据结构(索引)以加快检索的速度
We cover three main indexing techniques: inverted files, suffix arrays, and signature files.
三种主要标引技术:倒排文档、后缀数组、签名档
An inverted file (or inverted index) is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. The inverted file structure is composed of two elements: the vocabulary and the occurrences. The vocabulary is the set of all different words in the text. For each such word a list of all the text positions where the word appears is stored. The set of all those lists is called the 'occurrences'
倒排文档(或倒排索引)是一种面向单词的标引机制,它为文本集合建立标引,以加快检索任务的速度,倒排文档由词汇表和事件表两种结构组成。词汇表是文本中所包含的所有不同单词的集合,对于词汇表中的每一个这样一来的单词,其在文本中出现的所有位置都存储在一个列表中,所有这些列表的集合称为”事件表”
To reduce space requirements, a technique called block addressing is used. The text is divided in blocks, and the occurrences point to the blocks where the word appears (instead of the exact positions)
Hence, searching on an inverted index always starts in the vocabulary.
为了节省空间,块寻址技术被应用。将文本分成若干个块,事件表指向单词所在块的位置(而无需指示出精确的位置,因此在倒排索引上进行检索总是从词汇表开始的
the most time-demanding operation on inverted indices is the merging or intersection of the lists of occurrences.
在倒排索引上进行操作时花费最多的时间是用于事件表的合并和相交
This index sees the text as one long string. Each position in the text is considered as a text suffix将文本看作一个长的字符串。文本中的每个位置都被认为是文本的一个后缀。
Signature files are word-oriented index structures based on hashing.
数字签名文件是面向单词索引结构以散列变换为基础
These algorithms are used when operating on sets of results, which is the case in Boolean queries.
这些算法主要用于在结果集合上进行运算,布尔查询属于这种运算。
The problem of exact string matching字符串匹配 is: given a short pattern P of length m and a long text T of length n, find all the text positions where the pattern occurs.
给定一个长度为m的短模式P,以及一个长度为n的长文本T,在文本T中找出出现模式P的所有位置
The brute-force(BF)algorithm is the simplest possible one. It consists of merely trying all possible pattern positions in the text
简单的找到文本中所有可能模式的位置.
Searching and compression were traditionally regarded as exclusive operations
检索和压缩是.相互独立的操作.
Texts which were not to be searched could be compressed, and to search a compressed text it had to be decompressed first..In recent years, very efficient compression techniques have appeared that allow searching directly in the compressed text.
没有被检索的文本可以进行压缩,而检索一个压缩过的文本就必须对其进行解压。最近几年,在压缩文件中直接检索文件技术已经出现。
第9章
To support the demanding requirements of modern search environments, we must turn to alternative architectures and algorithms.
为了适应现代检索环境的要求,我们必须转向其他一些检索体系和算法
In this chapter we explore parallel and distributed information retrieval techniques.
在这章节中,我们探究平行和分配数据检索技术。
Parallel computing is the simultaneous application of multiple processors to solve a single problem, where each processor works on a different part of the problem. With parallel computing, the overall time required to solve the problem can be reduced to the amount of time required by the longest running part.
并行计算是多处理器同时处理同一个单个问题, 每个处理器对应处理一个问题的不同部分。并行处理所用时间是不同部分中花费最长求解时间。
MIMD is the most general and most popular class of parallel architectures.
多指令多数据流是现在最为通用和流行的一类并行体系结构
Ideally, when running a parallel algorithm on N processors, we would obtain perfect speedup or s= n.
理想情况下,在N个处理器上同时运行一个并行算法,可以获得最佳的加速率,即S=N.
In practice, perfect speedup is unattainable either because the problem cannot be decomposed into N equal subtasks, the parallel architecture imposes control overheads (e.g., scheduling调度 or synchronization同步), or the problem contains an inherently sequential component.
在实际中,适当的加速时不可能实现的。因为问题不可能被分解成n个相等的部分,并行结构需要额外的控制开销, 有些部分只能顺序处理。
The simplest way in which a retrieval system can exploit a MIMD computer is through the use of multitasking. Each of the processors in the parallel computer runs a separate, independent search engine. The search engines do not cooperate to process individual queries
一个最简单方式就是采用多任务处理模式,并行计算机中的每个处理器运行一个分散的、独立的搜索引擎, 搜索引擎并不协同处理单个查询。
Note, however, that the response time of individual queries remains unchanged.
然而要注意的是,每个查询的响应时间不变。
This high level data representation reveals two possible methods for partitioning the data. The first method, document partitioning The second method, term partitioning
高级数据表达技术提供了两种数据分割方法: 文献分割法和语词分割法
When the distributed system is centrally administered, more options are available. The first option is simple replication of the collection across all of the search servers. The second option is random distribution of the documents. The final option is explicit semantic partitioning of the documents.
当分布式系统被集中管理时,有很多的选择方案,首先,在所有搜索服务器上对文献集进行复制,然后将文献集随机分布,最后对文献集合进行语义分割
Source selection is the process of determining which of the distributed document collections are most likely to contain relevant documents for the current query, and therefore should receive the query for processing.
信息源的选取指明确哪些分布式文献集最有可能包含与当前查询相关的文献,即对于当前查询而言,应该检索哪些文献集
One approach is to always assume that every collection is equally likely to contain relevant documents and simply broadcast the query to all collections
一种方法是:假定每个文献集合包含相关文献的概率相等然后把查询发给所有的文献集合.
When document collections are partitioned into semantically meaningful collections or it is prohibitively expensive to search every collection every time, the collections can be ranked according to their likelihood of containing relevant documents
当文献被分割成语义集合或是每次搜索花费都很高,文献集将按其所含相关文献可能性的大小进行排序.
Query processing in a distributed IR system proceeds as follows:
(1) Select collections to search.
(2) Distribute query to selected collections
(3) Evaluate query at distributed collections in parallel.
(4) Combine results from distributed collections into final result
As described in the previous section, Step 1 may be eliminated if the query is always broadcast to every document collection in the system.
分布式信息检索系统的查询处理步骤如下:
选取要检索的文献集
将查询分发给选取的文献集
在分发了查询的文献集中对查询进行并行处理
将各文献集的中间结果合并成最终结果
如前所述,如果查询总是在系统中每个文献中传播,第一步骤可以删除。
第10章
provide informative feedback提供说明性的反馈信息, permit easy reversal of actions允许用户随时修改操作, support an internal locus of control支持内部的控制轨迹, reduce working memory load降低工作存储器负载, and provide alternative interfaces for novice and expert users.并为新老用户分别提供相应的可选界面
Design Principles设计原则
Offer informative feedback.提供说明性的反馈信息
Reduce working memory load.降低工作存储器负载
Provide alternative interfaces for novice and expert users.为新、老用户提供可选界面
Humans are highly attuned to images and visual information
人类已经在很大程度上习惯于接受图象和可视化信息
The standard evaluations emphasize high recall levels; in the TREC tasks systems are compared to see how well they return the top 1000 documents
标准评价追求高查全率,在TREC中对各系统的评价比较主要是看它们返回前1000个文献的能力如何
useful metrics beyond precision and recall include: time required to learn the system, time required to achieve goals on benchmark tasks, error rates, and retention of the use of the interface over time
除了检全率和检准率还包括: 学习系统所需时间,实现基准任务的目标所需时间,出错率,界面长期使用方式的一致性。
an interaction cycle交互循环模式
(1)Start with an information need.以信息需求为出发点
(2)Select a system and collections to search on.选择查询系统及作为查询对象的信息集合
(3)Formulate a query.构造查询
(4)Send the query to the system向系统提交查询.
(5)Receive the results in the form of information items接收以信息项形式表达的查询结果
(6)Scan, evaluate, and interpret the results.查看、评价并说明查询结果
(7)Either stop; or,结果查询或者
(8)Reformulate the query and go to step 4.重构查询并返回第四步
Search interfaces must provide users with good ways to get started.检索界面应该给用户提供较好的检索起始方式 An empty screen or a blank entry form空白的登陆表格 does not provide clues 提示to help a user decide how to start the search process.
In this section we will discuss four main types of starting points: lists, overviews, examples, and automated source selection.四种主要类型的检索超始方式:列表,概述,实例以及信息源的自动选择。
An overview can show the topic domains represented within the collections向用户展示信息集合涉及的主题范围, to help users select or eliminate sources from consideration帮助用户添加或删除某些检索信息集合.
A general framework for a query is shown to the user who then modifies it to construct a partially complete description of what they want.系统先给出查询的一般结构形式,然后用户通过修改这一结构来对其需求进行初步描述
Foremost among these is that most people find the basic syntax counter-intuitive.
首要的原因是大多数人发现其基本的语法与直觉相悖
A standard interface for relevance feedback 标准的相关反馈界面consists of a list of titles题名列表 with checkboxes beside the titles 题名旁边有校验框that allow the user to mark relevant documents.
In some cases users are allowed to indicate a value on a relevance scale在有些情况下还允许用户为相关度指定一个值
第11-12章
Multimedia systems must have the capability to store, retrieve, transport, and present data with very heterogeneous characteristics such as text, images (both still and moving), graphs and sound.
多媒体系统必须能够存储、检索、传送和表示具有不同特性的数据,如文本、图像(静止的和动态的)、图表和声音等
Data Modeling 1 数据建模
Data Retrieval 2 数据检索
The main goal of a Multimedia IR system is to efficiently perform retrieval, based on user requests, exploiting not only data attributes, as in traditional DBMS , but also the content of multimedia objects. 主要目标是根据用户需求有效地进行检索,不仅像传统数据库管理系统中那样利用数据的属性,而且还利用多媒体对象的内容
Multimedia IR system differs from a traditional IR system in two main aspects. First, more complex Second, object retrieval is mainly based on a similarity approach.
多媒体信息检索系统与传统信息检索系统的两点不同:变得更复杂; 对象检索基于相似性方法
Multimedia IR systems should therefore combine both the DBMS and the IR technology, to integrate the data modeling capabilities of DBMS with the advanced and similarity-based query capabilities of IR systems.
多媒体信息检索系统应该是数据库管理系统和信息检索技术的结合,把数据库管理系统的数据建模功能和信息检索系统的高级检索、基于相似性查询功能结合起来
attribute-based queries基于属性的查询 as well as content-based queries.基于内容的查询
extensible type system.可扩充型系统
Because of the semi-structured nature of multimedia objects, the previous approach is no longer adequate in a Multimedia IR system.
由于多媒体对象半结构得特性, 早期的方法已经不再适用多媒体 IR 系统。
the exact match is only one of the possible ways of querying multimedia objects. More often, a similarity-based approach is applied that considers both the structure and the content of the objects. Queries of the latter type are called content-based queries since they retrieve multimedia objects depending on their global content.
精确匹配仅仅是查询多媒体数据的可能方法之一, 基于相似性的方法它不仅考虑对象的结构而且还考虑它的内容,基于内容的查询它检索多媒体对象主要依赖整体的内容。
a query by example 实例查询
In general, query predicates can be classified into three different groups:
Attribute predicates属性判定
Structural predicates结构判定
Semantic predicates语义判定
Similarity queries can been classified into two categories: 相似查询分为两种类型:
Whole match完全匹配
Sub-pattern match 子模式匹配
The QBIC (Query By Image Content通过图像内容查询) project studies methods to query large online image databases using the images' content as the basis of the queries. Examples of the content include color, texture, shape, position, and dominant edges of image items and regions.
研究如何利用图像内容作为查询的基础对大型联机图像数据库进行查询,这些内容包括颜色、纹理、形状、位置以及图像中对象的主要边缘和区域主要边缘
第13章
There are basically three different forms of searching the Web. The first is to use search engines that index a portion of the Web documents as a full-text database. The second is to use Web directories, which classify selected Web documents by subject. The third and not yet fully available, is to search the Web exploiting its hyperlink**,
有三种基本的网页搜索, 第一种是搜索引擎的使用,它标引一部分网络文献作为一个全文数据库,第二Web目录第三种还没有完全成熟的超链接结构来检索网络按主题对所选择的Web文献进行分类
So, the overall challenge, in spite of the intrinsic problems固有的问题 posed by the Web, is to submit a good query to the search system, and obtain a manageable and relevant answer
因此, 尽管网络本身存在一些固有的问题,最大的挑战是给检索系统提交一个好的查询以及获得可管理的相关返回结果.
retrieval systems that model the Web as a full-text database检索系统将Web看作一个全文本数据库.
One main difference between standard IR systems and the Web is that, in the Web, all queries must be answered without accessing the text (that is, only the indices索引 axe available).
标准的信息检索系统和web检索最大的不同是:在web中全部的查询的解答不必访问文本。
Most search engines use a centralized crawler-indexer architecture. Crawlers are programs (software agents软件代理) that traverse the Web sending new or updated pages to a main server where they are indexed. Crawlers are also called robots, spiders, wanderers, walkers步行者, and knowbots.
大多数搜索引擎采用了集中式的收集器-标引器的结构,网络搜索就是遍历网络的程序,它把新的或更新的网页发送倒主服务器上并建立索引。这种程序也叫机器人, 蜘蛛,漫步者,知识机器人.
There are two important aspects of the user interface of search engines: the query interface and the answer interface
搜索引擎的用户界面两个重要方面,:查询界面与响应界面
The answer usually consists of a list of the ten top ranked Web pages.
检索结果通常由排在前10位的网页列表所组成
All search engines also provide a query interface for complex queries as well as a command language including Boolean operators and other features, such as phrase search, proximity search, and wild cards.
所有搜索引擎都为复杂查询提供查询界面,以及包括布尔算符和其他特征如词组检索、相邻检索、通配符等的命令语言
three ranking algorithms: Boolean spread, vector spread, and most-cited.
经典的排序方法:布尔扩展,向量扩展,最大引用。
Some of the new ranking algorithms also use hyperlink information.
新的排序算法也使用超链接信息
The number of hyperlinks that point to a page provides a measure of its popularity and quality.
指向一个网页的超链接的数量能够用来测度该网页的流行程度和质量
Another technique is to partition the Web分割网络 using country codes国家代码 or Internet names因特网名字, and assign one or more robots to each partition 分配一个或更多网络蜘蛛到分割的网络, and explore each partition exhaustively全面地搜索网络.
The order in which the URLs are traversed is important.遍历URL的顺序也是很重要的
Most indices use variants of the inverted file 大部分的索引采用倒排文档的变形
In short, an inverted file is a list of sorted words (vocabulary), each one having a set of pointers to the pages where it occurs.
倒排文档是分类单词(词汇)的一个列表,每个词有一组指向含有该词页面的指针
The best and oldest example of a Web directory is Yahoo!
最好最早网页目录的例子是Yahoo.
Web directories are also called catalogs类目, yellow pages黄页, or subject directories主题目录.
Directories are hierarchical taxonomies that classify human knowledge.目录是各种知识的等级分类
The main advantage of this technique is that if we find what we are looking for, the answer will be useful in most cases. On the other hand, the main disadvantage is that the classification is not specialized enough and that not all Web pages are classified.主要缺点是分类不专业化,而且也不是所有的页面都进行分类
Meta searchers are Web servers that send a given query to several search engines, Web directories and other databases, collect the answers and unify them.
把一个给定的查询发送到几个搜索引擎,Web目录及其它数据库,并收集和统一结果的一种网络服务器
The main advantages of meta searchers are the ability to combine the results of many sources and the fact that the user can pose the same query to various sources through a single common interface.
主要的优点是能够将许多信息源的结果结合起来,用户通过一个公共界面将相同的查询提交给各种不同的信息源。
Dynamic search in the Web is equivalent to sequential text searching. 网络中的动态搜索等同于顺序文本检索The idea is to use an online search to discover relevant information by following links. 其基本思路是通过追踪链接,使用联机检索来发现相关信息.The main advantage is that you are searching in the current structure of the Web, and not in what is stored in the index of a search engine. 其主要优点是你在搜索网络的当前结构,而不是搜索存储在搜索引擎索引中的东西.
第14章
This early adoption took two main forms: searching remote electronic databases provided by commercial vendors in order to provide reference services to patrons, and the creation and searching of catalog records for materials held within the library.
早期采用的两种形式:通过检索由商业供应商提供的远程电子数据库来为用户提供参考服务,及图书馆内部馆藏资源记录的生成与检索
Online IR Systems and Document Databases联机信息检索系统和文献数据库
A synergistic relationship exists between the producers and vendors of document databases
在文献数据库生产商和供应商之间存在着协作关系。
Online Retrieval Systems联机检索系统
Today DIALOG operates worldwide with databases offered via the Internet to libraries and other organizations as well as individuals.
世界范围内经营,其数据库通过互联网向图书馆及其他机构和个人开放
Library catalogs serve as lists of the library's holdings, organized as finding tools for the collection.
图书馆目录是图书馆馆藏的清单,将其组织在一起可以作为查找馆藏的工具
By the 1980s, true online public access catalogs had been implemented.
80年代真正的联机公共检索目录开始得以实施
The most common type of searching in OPAC is subject searching, and failures by users in topical searching are well documented . Common failures are null sets ('zero results'), or at the other extreme, information overload
最常用的检索类型是主题检索,涉及用户使用失败的典型是:零搜索和信息超载两种形式。
v