zoukankan      html  css  js  c++  java
  • 一堆信息抽取的资料文档

    关键词:结构化信息抽取

     


    “一堆”,就是没有整理,是堆放的。不是自己写的,是找来的。
    我会在这里继续添加的,依然是“堆”。有兴趣的可以看看,没有兴趣的就别碰了。


    有谁有什么好文,拿出来大家共分享。


     


    1.网上信息抽取技术纵览(下载) 
    Line Eikvil 原著 (1999.7) 陈鸿标 译 (2003.3)
    信息抽取(Information Extraction: IE是把文本里包含的信息进行结构化处理,变成表格一样的组织形式。输入信息抽取系统的是原始文本,输出的是固定格式的信息点。信息点从各种各样的文档中被抽取出来,然后以统一的形式集成在一起。这就是信息抽取的主要任务………
    第一章导论
    第二章简要介绍信息抽取技术
    第三章介绍网页分装器(wrapper)的开发
    第四章介绍已经开发出来的网站信息抽取系统
    第五章介绍信息抽取技术的应用范围以及首批已经进入商业运作的商用系统


    2.Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence
     Silviu Cucerzan ,David Yarowsky
     一种独立于语言的命名实体识别方法。


    3.信息抽取研究综述 
      王建会对自动摘要算法改进方面所做的研究工作


    4.信息抽取综述
      这是介绍信息抽取(Information Extraction)的一篇报告,包括MUC、Web抽取(Web Extraction)等。


    5.FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text
      本文介绍了FASTUS系统,这是一个从自然语言文本中进行信息抽取的系统, 抽取来的信息输入数据库或者用作其它用途。


    6.MUC-7 Information Extraction Task Definition
      MUC-7信息抽取任务的定义


    7.OVERVIEW OF MUC-7/MET-2
     本文简要介绍了MUL-7/MET-2的任务


    8.Information Extraction: Techniques and Challenges
    本文介绍了IE(Information Extration)技术(18页)。


    9.信息抽取研究综述李保利,陈玉忠,俞士汶
    摘要:信息抽取研究旨在为人们提供更有力的信息获取工具,以应对信息爆炸带来的严重挑战。与信息检索不同,信息抽取直接从自然语言文本中抽取事实信息。过去十多年来,信息抽取逐步发展成为自然语言处理领域的一个重要分支,其独特的发展轨迹——通过系统化、大规模地定量评测推动研究向前发展,以及某些成功启示,如部分分析技术的有效性、快速NLP系统开发的必要性,都极大地推动了自然语言处理研究的发展,促进了NLP研究与应用的紧密结合。回顾信息抽取研究的历史,总结信息抽取研究的现状,将有助于这方面研究工作向前发展。


    10.Class-based Language Modeling for Named Entity Identification (Draft)
    Jian Sun, Ming Zhou, Jianfeng Gao


    (Accepted by special issue \\\\\\\"Word Formation and Chinese Language processing\\\\\\\" of the International Journal of Computational Linguistics and Chinese Language Processing) Abstract: We address in this paper the problem of Chinese named entity (NE) identification using class-based language models (LM). This study is concentrated on three kinds of NEs that are most commonly used, namely, personal name (PER), location name (LOC) and organization name (ORG). Our main contributions are three-fold: (1) In our research, Chinese word segmentation and NE identification have been integrated into a unified framework. It consists of several sub-models, each of which in turn may include other sub-models, leads to the overall model a hierarchical architecture. The class-based hierarchical LM not only effectively captures the features of named entities, but also handles the data sparseness problem. (2) Modeling for NE abbreviation is put forward. Our modeling-based method for NE abbreviation has significant advantages over rule-based ones. (3) In addition, we employ a two-level architecture for ORG model, so that the nested entities in organization names can be identified. When decoding, two-step strategy is adopted: identifying PER and LOC; and identifying ORG. The evaluation on a large, wide-coverage open-test data has empirically demonstrated that the class-based hierarchical language modeling, which integrates segmentation and NE identification, unifies the abbreviation modeling into one framework, has achieved competitive results of Chinese NE identification.


    11.BBN公司的信息抽取系统SIFT(中文详细说明)
    Scott Miller, Michael Crystal, Heidi Fox, Lance Ramshaw, Richard Schwartz,
    这是BBN的MUC7参评系统SIFT系统的说明,我把它翻译了一下,基本意思很明了,但部分细节我可能还没有把握准确,如果有问题,请给我来信说明。


    12.(slides) Chinese Named Entity Identification using class-based language model
    Jian Sun, Jianfeng Gao, Lei Zhang, Ming Zhou, and Changning Huang
    This is the slides for the 19th International Conference on Computational Linguistics


    13.Chinese Named Entity Identification using class-based language model
    Jian Sun, Jianfeng Gao, Lei Zhang, Ming Zhou, and Changning Huang
    We consider here the problem of Chinese named entity (NE) identification using statistical language model(LM). In this research, word segmentation and NE identification have been integrated into a unified framework that consists of several class-based language models. We also adopt a hierarchical structure for one of the LMs so that the nested entities in organization names can be identified. The evaluation on a large test set shows consistent improvements. Our experiments further demonstrate the improvement after seamlessly integrating with linguistic heuristic information, cache-based model and NE abbreviation identification.


    14.MUC-7 EVALUATION OF IE TECHNOLOGY: Overview of Results
    Elaine Marsh, Dennis Perzanowski
    reviews MUC-7 and introduces the result and progress during this conference


    15.Method of k-Nearest Neighbors


     


    16.Multilingual Topic Detection and Tracking:Successful Research Enabled by Corpora and Evaluation
    Charles L. Wayne
    Topic Detection and Tracking (TDT) refers to automatic techniques for locating topically related material in streams of data such as newswire and broadcast news. DARPA-sponsored research has made enormous progress during the past three years, and the tasks have been made progressively more difficult and realistic. Well-designed corpora and objective performance evaluations have enabled this success.


    17.信息提取概述
    骆卫华的综述报告


    18.Information Extraction Supported Question Answering
    Cymfony公司的IE系统,主要面向QA,包括已实现的NE系统和将要实现的CE和GE的原型。


    19.ALGORITHMS THAT LEARN TO EXTRACT INFORMATION


    20.Description of the American University in Cairo\"s System Used for MUC-7


     


    21.Analyzing the Complexity of a Domain With Respect To An Information Extraction Task


     


    22.从半结构化文本与自由格式文本中学习信息抽取规则

    作者Stephen Soderland为华盛顿州立大学计算机科学系教授。本文的被引用次数高达50多次。论文以信息抽取系统WHISK系统为例,描述了如何以机器学习的方式,利用小规模样本训练系统自动学习目标文本的抽取模式,从而实现自动化信息抽取的一种技术。这种技术不但极具启发意义而且很有实用价值。


    23.信息抽取研究综述

    本文出自北京大学计算机科学与技术系,综述了信息抽取的一些基本概念。


    24.利用Lixto进行可视化的信息抽取

    作者分析了Lixto抽取系统的架构,介绍了一种半自动化的Wrapper生成技术与自动化Web信息抽取技术。


    25.Web数据抽取工具综述

    作者将目前的几种Web数据抽取工具按照六种分类:Wrapper开发语言,可感知HTML的工具,基于NLP的工具,Wrapper归纳工具,基于建模的工具,基于语义的工具依次介绍了各Web数据抽取工具的工作原理与特点,并且比较了它们的一般输出质量。


     


    26.针对BBS短文本的提取标注


    本文前半段将会介绍有关本体的相关概念,后一部分将介绍本体在我们系统中的应用。为了配合信息提取,需要一些先验性的知识和统计信息。所以,我们构造了自己的针对BBS短文本的提取标注工具。为此构建了本体知识并以直观方式展现出来。结合本体推理机,我们的标注工具在标注的同时具备推理能力使得标注智能化,并能通过引用一个包装好的提取算法进行提取预览。


    27.XWRAP An XML enabled Wrapper Construction System for Web Information Sources


    Ling Liu Calton Pu Wei Han

    This paper describes the methodology and the
    software development of XWRAP an XMLenabled wrap
    per construction system for semiautomatic generation of
    wrapper programs By XMLenabled we mean that the
    metadata about information content that are implicit in
    the original web pages will be extracted and encoded ex
    plicitly as XML tags in the wrapped documents In addi
    tion the querybased content ltering process is performed
    against the XML documents The XWRAP wrapper gen
    eration framework has three distinct features First it ex
    plicitly separates tasks of building wrappers that are spe
    cic to a Web source from the tasks that are repetitive
    for any source and uses a component library to provide
    basic building blocks for wrapper programs Second it pro
    vides a userfriendly interface program to allow wrapper
    developers to generate their wrapper code with a few mouse
    clicks Third and most importantly we introduce and de
    velop a twophase code generation framework The rst
    phase utilizes an interactive interface facility to encode the
    sourcespecic metadata knowledge identied by individual
    wrapper developers as declarative information extraction
    rules The second phase combines the information extrac
    tion rules generated at the rst phase with the XWRAP
    component library to construct an executable wrapper pro
    gram for the given web source We report the initial ex
    periments on performance of the XWRAP code generation
    system and the wrapper programs generated by XWRAP  


    28.Data Mining on Symbolic Knowledge Extracted from the Web


    Rayid Ghani, Rosie Jones, Dunja Mladeni´cy, Kamal Nigam, Se´an Slattery
    Information extractors and classifiers operating on unrestricted, unstructured
    texts are an errorful source of large amounts of potentially
    useful information, especially when combined with a crawler which
    automatically augments the knowledge base from the world-wide
    web. At the same time, there is much structured information on the
    WorldWideWeb. Wrapping the web-sites which provide this kind of
    information provide us with a second source of information; possibly
    less up-to-date, but reliable as facts. We give a case study of combining
    information from these two kinds of sources in the context
    of learning facts about companies. We provide results of association
    rules, propositional and relational learning, which demonstrate
    that data-mining can help us improve our extractors, and that using
    information from two kinds of sources improves the reliability of
    data-mined rules.


    29.A Brief Survey of Web Data Extraction Tools
    Alberto H. F. Laender Berthier A. RibeiroNeto
    Altigran S. da Silva Juliana S. Teixeira

    In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. The importance of this problem derives from the fact that, once extracted, the data can be handled in a way similar to instances of a traditional database. The approaches proposed in the literature to address the problem of Web data extraction use techniques borrowed from areas such as natural language processing, languages and grammars, machine learning, information retrieval,...


    30.Toward Semantic Understanding|An Approach Based on Information Extraction Ontologies
    Information is ubiquitous, and we are
    ooded with
    more than we can process. Somehow, we must rely
    less on visual processing, point-and-click navigation,
    and manual decision making and more on computer
    sifting and organization of information and auto-
    mated negotiation and decision making. A resolu-
    tion of these problems requires software with seman-
    tic understanding|a grand challenge of our time.
    More particularly, we must solve problems of au-
    tomated interoperability, integration, and knowledge
    sharing, and we must build information agents and
    process agents that we can trust to give us the in-
    formation we want and need and to negotiate on our
    behalf in harmony with our beliefs and goals.
    This paper pro ers the use of information-
    extraction ontologies as an approach that may lead
    to semantic understanding.
    Keywords: Semantics, information extraction, high-
    precision classi cation, schema mapping, data inte-
    gration, Semantic Web, agent communication, ontol-
    ogy, ontology generation.


    31.基于《知网》的中文信息结构抽取
    The Chinese message structure is composed of several Chinese fragments which may be
    characters words or phrases. Every message structure carries certain information. We have developed a
    HowNet-based extractor that can extract Chinese message structures from a real text and serves as an
    interactive tool for building large-scale bank of Chinese message structures. The system utilizes the
    HowNet Knowledge System as its basic resources. It is an integrated system of rule-based analyzer,
    statistics based on the examples and the analogy given by HowNet-based concept similarity calculator.
    Keyword: Chinese message structure; Knowledge Database Mark-up Language (KDML); parsing;
    chunk;


    32.Wrapper induction Efficiency and expressiveness Extended abstract


     Recently many systems have been built that auto
    matically interact with Internet information resources
    However these resources are usually formatted for use
    by people eg the relevant content is embedded in
    HTML pages Wrappers are often used to extract a
    resources content but handcoding wrappers is te
    dious and errorprone We advocate wrapper induction
    a technique for automatically constructing wrappers
    We have identied several wrapper classes that can be
    learned quickly most sites require only a handful of ex
    amples consuming a few CPU seconds of processing
    yet which are useful for handling numerous Internet re
    sources 
    of surveyed sites can be handled by our
    techniques







    33.WysiWyg Web Wrapper Factory (W4F)


    In this paper, we present the W4F toolkit for the generation of
    wrappers for Web sources. W4F consists of a retrieval language to
    identify Web sources, a declarative extraction language (the HTML
    Extraction Language) to express robust extraction rules and a map-
    ping interface to export the extracted information into some user-
    de ned data-structures. To assist the user and make the creation
    of wrappers rapid and easy, the toolkit o ers some wysiwyg support
    via some wizards. Together, they permit the fast and semi-automatic
    generation of ready-to-go wrappers provided as Java classes. W4F has
    been successfully used to generate wrappers for database systems and
    software agents, making the content of Web sources easily accessible
    to any kind of application.


    34.Adaptive Information Extraction from Text by Rule Induction and Generalisation
    (LP)2 is a covering algorithm for adaptive Information
    Extraction from text (IE). It induces
    symbolic rules that insert SGML tags into texts
    by learning from examples found in a userdefined
    tagged corpus. Training is performed in
    two steps: initially a set of tagging rules is
    learned; then additional rules are induced to
    correct mistakes and imprecision in tagging. Induction
    is performed by bottom-up generalization
    of examples in the training corpus. Shallow
    knowledge about Natural Language Processing
    (NLP) is used in the generalization process. The
    algorithm has a considerable success story.
    From a scientific point of view, experiments report
    excellent results with respect to the current
    state of the art on two publicly available corpora.
    From an application point of view, a successful
    industrial IE tool has been based on
    (LP)2. Real world applications have been developed
    and licenses have been released to external
    companies for building other applications. This
    paper presents (LP)2, experimental results and
    applications, and discusses the role of shallow
    NLP in rule induction.


    35.Advanced Web Technology Information Extraction

  • 相关阅读:
    华为OSPF与ACL综合应用
    综合练习2 设置访问权限,Easy-IP访问外网,内外网访问
    综合练习1,划分vlan,单臂路由,DHCP服务及其限制网段、租期,设置根桥,OSPF路由通告综合练习
    在eNSP下使用Hybird接口
    在eNSP上配置VLAN的Trunk端口
    在ensp上模拟企业网络场景并Access接口加入相应VLAN
    网络安全设备
    WooYun虚拟机的搭建以及配置
    IIS web服务器访问日志
    apache访问日志目录以及术语返回码
  • 原文地址:https://www.cnblogs.com/cy163/p/557329.html
Copyright © 2011-2022 走看看