zoukankan      html  css  js  c++  java
  • Textual Data Mining and WEBSOM

    http://users.tkk.fi/~hhyotyni/latex/Final/node63.html#SECTION03150000000000000000

    In the following, the task of finding relevant information in large document collections is presented as well as the WEBSOM method developed in the Neural Networks Research Centre of the Helsinki University of Technology. The presentation is based on [gif].

    One of the most common computing tasks nowadays is information retrieval. Especially in the rapidly growing World Wide Web (WWW) there is a vast amount of potentially useful information available, but reaching it is not straightforward. It is important to develop more powerful methods for the exploration of miscellaneous document collections.

    Searching for relevant documents has traditionally been based on keywords and their Boolean expressions. Often the search results show high recall and low precision, or vice versa. Considerable efforts have been used to develop alternative methods, but their practical applicability has been low.

    The WEBSOM method is based on an algorithm called the Self-Organizing Map (SOM). The latter, developed in our laboratory, is a general unsupervised learning algorithm for analyzing and visualizing high-dimensional statistical data. It is one of the most widespread artificial neural network models used in application areas like process monitoring, image analysis, telecommunications, and categorization of economic data. The SOM, its mathematical basis, and about one thousand applications are presented in the recent monograph [gif].

    The basic WEBSOM architecture consists of two levels. The word category map [gif] first learns in a self-organizing process to represent relations of words based on their averaged short contexts. The words are mapped onto the two-dimensional map grid, ordered according to the similarities in their usage. The word category map is then used to form a word histogram of the textual document to be analyzed. The histogram, "fingerprint" of the document, is used as input to the second SOM, the document map. The document map self-organizes to represent the similarities between the contents of the documents; each document attains a location on the map based on its contents. Different areas on the map specialize in different topics and the topics change smoothly along the map.

    The WEBSOM demo is available in the Internet. To make it easy and practical to explore the organized document collections we have developed a WWW-based browsing environment. The self-organized document map offers a general idea of the underlying document space. The user may view any area of the map in detail by simply pointing to the map image with the mouse. The Websom browsing interface is implemented as a set of HTML documents that can be viewed using a graphical WWW browser, like Mosaic or Netscape, at the WEBSOM home page at http://websom.hut.fi/websom/ [gif].

    The WEBSOM method is basically applicable to any kind of collection of textual documents. It is especially suitable for exploration tasks in which the users either do not know the domain very well, or they have only a limited idea of the contents of the full-text database being examined. With the WEBSOM, the documents are ordered meaningfully according to their contents. Maps also help the exploration by giving an overall view of what the information space looks like.

    In the World Wide Web, one application could be organization of home pages instead of the newsgroup articles. Also electronic mail messages may automatically be positioned on a suitable map according to personal interests. Relevant areas and single nodes on the map can be used as "mailboxes" in which specified information will be automatically gathered.

    For more detailed information of the WEBSOM method in general, its variants, and application examples see, e.g., [gif,gif,gif]. A detailed description of SOM as a numerical or textual data exploration method and tool can be found in [gif]. Previously the SOM has been used in creating document maps, e.g., by Lin et al [gif] to form a map based on titles of scientific documents. Scholtes has developed, based on the SOM, a neural filter and a neural interest map for information retrieval [gif,gif]. Merkl [gif] has clustered textual descriptions of software library components. In comparison, one of the novel features of the WEBSOM method is the idea of applying the SOM algorithm twice: first for word category analysis and second for creating document maps, based on the first analysis. The natural language processing model of Miikkulainen [gif] contains SOM as a central component.

    The SOM program package with documentation is available for non-commercial purposes [gif]. The original technical report and some WEBSOM demonstrations are also available [gif].

  • 相关阅读:
    poj 3087 直接模拟
    POJ-3126 BFS,埃式筛选及黑科技
    POJ3278-Catch That Cow
    js变量提升
    饿了么
    2分钟就能学会的【Google/百度搜索大法】了解一下?
    span标签间距
    Vue移动端项目如何使用手机预览调试
    Port 3000 is already in use
    koa2第一天 async详解
  • 原文地址:https://www.cnblogs.com/cy163/p/669677.html
Copyright © 2011-2022 走看看