zoukankan      html  css  js  c++  java
  • 信息抽取的资料文档

    Line Eikvil 原著 (1999.7) 陈鸿标 译 (2003.3)
    信息抽取(Information Extraction: IE是把文本里包含的信息进行结构化处理,变成表格一样的组织形式。输入信息抽取系统的是原始文本,输出的是固定格式的信息点。信息点从各种各样的文档中被抽取出来,然后以统一的形式集成在一起。这就是信息抽取的主要任务………

    2.Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence
     Silviu Cucerzan ,David Yarowsky


      这是介绍信息抽取(Information Extraction)的一篇报告,包括MUC、Web抽取(Web Extraction)等。

    5.FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text
      本文介绍了FASTUS系统,这是一个从自然语言文本中进行信息抽取的系统, 抽取来的信息输入数据库或者用作其它用途。

    6.MUC-7 Information Extraction Task Definition


    8.Information Extraction: Techniques and Challenges
    本文介绍了IE(Information Extration)技术(18页)。


    10.Class-based Language Modeling for Named Entity Identification (Draft)
    Jian Sun, Ming Zhou, Jianfeng Gao

    (Accepted by special issue \\\\\\\"Word Formation and Chinese Language processing\\\\\\\" of the International Journal of Computational Linguistics and Chinese Language Processing) Abstract: We address in this paper the problem of Chinese named entity (NE) identification using class-based language models (LM). This study is concentrated on three kinds of NEs that are most commonly used, namely, personal name (PER), location name (LOC) and organization name (ORG). Our main contributions are three-fold: (1) In our research, Chinese word segmentation and NE identification have been integrated into a unified framework. It consists of several sub-models, each of which in turn may include other sub-models, leads to the overall model a hierarchical architecture. The class-based hierarchical LM not only effectively captures the features of named entities, but also handles the data sparseness problem. (2) Modeling for NE abbreviation is put forward. Our modeling-based method for NE abbreviation has significant advantages over rule-based ones. (3) In addition, we employ a two-level architecture for ORG model, so that the nested entities in organization names can be identified. When decoding, two-step strategy is adopted: identifying PER and LOC; and identifying ORG. The evaluation on a large, wide-coverage open-test data has empirically demonstrated that the class-based hierarchical language modeling, which integrates segmentation and NE identification, unifies the abbreviation modeling into one framework, has achieved competitive results of Chinese NE identification.

    Scott Miller, Michael Crystal, Heidi Fox, Lance Ramshaw, Richard Schwartz,

    12.(slides) Chinese Named Entity Identification using class-based language model
    Jian Sun, Jianfeng Gao, Lei Zhang, Ming Zhou, and Changning Huang
    This is the slides for the 19th International Conference on Computational Linguistics

    13.Chinese Named Entity Identification using class-based language model
    Jian Sun, Jianfeng Gao, Lei Zhang, Ming Zhou, and Changning Huang
    We consider here the problem of Chinese named entity (NE) identification using statistical language model(LM). In this research, word segmentation and NE identification have been integrated into a unified framework that consists of several class-based language models. We also adopt a hierarchical structure for one of the LMs so that the nested entities in organization names can be identified. The evaluation on a large test set shows consistent improvements. Our experiments further demonstrate the improvement after seamlessly integrating with linguistic heuristic information, cache-based model and NE abbreviation identification.

    14.MUC-7 EVALUATION OF IE TECHNOLOGY: Overview of Results
    Elaine Marsh, Dennis Perzanowski
    reviews MUC-7 and introduces the result and progress during this conference

    15.Method of k-Nearest Neighbors


    16.Multilingual Topic Detection and Tracking:Successful Research Enabled by Corpora and Evaluation
    Charles L. Wayne
    Topic Detection and Tracking (TDT) refers to automatic techniques for locating topically related material in streams of data such as newswire and broadcast news. DARPA-sponsored research has made enormous progress during the past three years, and the tasks have been made progressively more difficult and realistic. Well-designed corpora and objective performance evaluations have enabled this success.


    18.Information Extraction Supported Question Answering


    20.Description of the American University in Cairo\"s System Used for MUC-7


    21.Analyzing the Complexity of a Domain With Respect To An Information Extraction Task



    作者Stephen Soderland为华盛顿州立大学计算机科学系教授。本文的被引用次数高达50多次。论文以信息抽取系统WHISK系统为例,描述了如何以机器学习的方式,利用小规模样本训练系统自动学习目标文本的抽取模式,从而实现自动化信息抽取的一种技术。这种技术不但极具启发意义而且很有实用价值。










    27.XWRAP An XML enabled Wrapper Construction System for Web Information Sources

    28.Data Mining on Symbolic Knowledge Extracted from the Web

    Rayid Ghani, Rosie Jones, Dunja Mladeni´cy, Kamal Nigam, Se´an Slattery
    Information extractors and classifiers operating on unrestricted, unstructured
    texts are an errorful source of large amounts of potentially
    useful information, especially when combined with a crawler which
    automatically augments the knowledge base from the world-wide
    web. At the same time, there is much structured information on the
    WorldWideWeb. Wrapping the web-sites which provide this kind of
    information provide us with a second source of information; possibly
    less up-to-date, but reliable as facts. We give a case study of combining
    information from these two kinds of sources in the context
    of learning facts about companies. We provide results of association
    rules, propositional and relational learning, which demonstrate
    that data-mining can help us improve our extractors, and that using
    information from two kinds of sources improves the reliability of
    data-mined rules.

    29.A Brief Survey of Web Data Extraction Tools
    Alberto H. F. Laender Berthier A. RibeiroNeto
    Altigran S. da Silva Juliana S. Teixeira

    In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. The importance of this problem derives from the fact that, once extracted, the data can be handled in a way similar to instances of a traditional database. The approaches proposed in the literature to address the problem of Web data extraction use techniques borrowed from areas such as natural language processing, languages and grammars, machine learning, information retrieval,...

    30.Toward Semantic Understanding|An Approach Based on Information Extraction Ontologies
    Information is ubiquitous, and we are
    ooded with
    more than we can process. Somehow, we must rely
    less on visual processing, point-and-click navigation,
    and manual decision making and more on computer
    sifting and organization of information and auto-
    mated negotiation and decision making. A resolu-
    tion of these problems requires software with seman-
    tic understanding|a grand challenge of our time.
    More particularly, we must solve problems of au-
    tomated interoperability, integration, and knowledge
    sharing, and we must build information agents and
    process agents that we can trust to give us the in-
    formation we want and need and to negotiate on our
    behalf in harmony with our beliefs and goals.
    This paper pro ers the use of information-
    extraction ontologies as an approach that may lead
    to semantic understanding.
    Keywords: Semantics, information extraction, high-
    precision classi cation, schema mapping, data inte-
    gration, Semantic Web, agent communication, ontol-
    ogy, ontology generation.

    The Chinese message structure is composed of several Chinese fragments which may be
    characters words or phrases. Every message structure carries certain information. We have developed a
    HowNet-based extractor that can extract Chinese message structures from a real text and serves as an
    interactive tool for building large-scale bank of Chinese message structures. The system utilizes the
    HowNet Knowledge System as its basic resources. It is an integrated system of rule-based analyzer,
    statistics based on the examples and the analogy given by HowNet-based concept similarity calculator.
    Keyword: Chinese message structure; Knowledge Database Mark-up Language (KDML); parsing;

    32.Wrapper induction Efficiency and expressiveness Extended abstract

     Recently many systems have been built that auto
    matically interact with Internet information resources
    However these resources are usually formatted for use
    by people eg the relevant content is embedded in
    HTML pages Wrappers are often used to extract a
    resources content but handcoding wrappers is te
    dious and errorprone We advocate wrapper induction
    a technique for automatically constructing wrappers
    We have identied several wrapper classes that can be
    learned quickly most sites require only a handful of ex
    amples consuming a few CPU seconds of processing
    yet which are useful for handling numerous Internet re
    of surveyed sites can be handled by our

    33.WysiWyg Web Wrapper Factory (W4F)

    In this paper, we present the W4F toolkit for the generation of
    wrappers for Web sources. W4F consists of a retrieval language to
    identify Web sources, a declarative extraction language (the HTML
    Extraction Language) to express robust extraction rules and a map-
    ping interface to export the extracted information into some user-
    de ned data-structures. To assist the user and make the creation
    of wrappers rapid and easy, the toolkit o ers some wysiwyg support
    via some wizards. Together, they permit the fast and semi-automatic
    generation of ready-to-go wrappers provided as Java classes. W4F has
    been successfully used to generate wrappers for database systems and
    software agents, making the content of Web sources easily accessible
    to any kind of application.

    34.Adaptive Information Extraction from Text by Rule Induction and Generalisation
    (LP)2 is a covering algorithm for adaptive Information
    Extraction from text (IE). It induces
    symbolic rules that insert SGML tags into texts
    by learning from examples found in a userdefined
    tagged corpus. Training is performed in
    two steps: initially a set of tagging rules is
    learned; then additional rules are induced to
    correct mistakes and imprecision in tagging. Induction
    is performed by bottom-up generalization
    of examples in the training corpus. Shallow
    knowledge about Natural Language Processing
    (NLP) is used in the generalization process. The
    algorithm has a considerable success story.
    From a scientific point of view, experiments report
    excellent results with respect to the current
    state of the art on two publicly available corpora.
    From an application point of view, a successful
    industrial IE tool has been based on
    (LP)2. Real world applications have been developed
    and licenses have been released to external
    companies for building other applications. This
    paper presents (LP)2, experimental results and
    applications, and discusses the role of shallow
    NLP in rule induction.

    35.Advanced Web Technology Information Extraction

    Ling Liu Calton Pu Wei Han

    This paper describes the methodology and the
    software development of XWRAP an XMLenabled wrap
    per construction system for semiautomatic generation of
    wrapper programs By XMLenabled we mean that the
    metadata about information content that are implicit in
    the original web pages will be extracted and encoded ex
    plicitly as XML tags in the wrapped documents In addi
    tion the querybased content ltering process is performed
    against the XML documents The XWRAP wrapper gen
    eration framework has three distinct features First it ex
    plicitly separates tasks of building wrappers that are spe
    cic to a Web source from the tasks that are repetitive
    for any source and uses a component library to provide
    basic building blocks for wrapper programs Second it pro
    vides a userfriendly interface program to allow wrapper
    developers to generate their wrapper code with a few mouse
    clicks Third and most importantly we introduce and de
    velop a twophase code generation framework The rst
    phase utilizes an interactive interface facility to encode the
    sourcespecic metadata knowledge identied by individual
    wrapper developers as declarative information extraction
    rules The second phase combines the information extrac
    tion rules generated at the rst phase with the XWRAP
    component library to construct an executable wrapper pro
    gram for the given web source We report the initial ex
    periments on performance of the XWRAP code generation
    system and the wrapper programs generated by XWRAP
  • 相关阅读:
    [TimLinux] CSS 纯CSS实现动画展开/收起功能
    [TimLinux] CSS pre超长自动换行
  • 原文地址:https://www.cnblogs.com/cy163/p/687451.html
Copyright © 2011-2022 走看看