zoukankan      html  css  js  c++  java
  • VIPS: a VIsion based Page Segmentation Algorithm

    VIPS: a VIsion based Page Segmentation Algorithm

    VIPS: a VIsion based Page Segmentation Algorithm


    Introduction

    The VIsion-based Page Segmentation (VIPS) algorithm aims to extract the semantic structure of a web page based on its visual presentation. Such semantic structure is a tree structure; each node in the tree corresponds to a block. Each node will be assigned a value (Degree of Coherence) to indicate how coherent of the content in the block based on visual perception, the bigger is the DoC value, the more coherent is the block. The VIPS algo-rithm makes full use of page layout structure. It first extracts all the suitable blocks from the html DOM tree, and then it finds the separators between these blocks. Here, separators denote the hori-zontal or vertical lines in a web page that visually cross with no blocks. Based on these separators, the semantic tree of the web page is constructed. Thus, a web page can be represented as a set of blocks (leaf nodes of the semantic tree). Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. Contents with different topics are distinguished as separate blocks.

     


    Paper List

    Original Paper

    Applications using VIPS


    If you find the VIPS algoirthm useful, we appreciate it very much if you can cite our following works:

    @Inproceedings{CHWM04
    author = "Deng Cai and Xiaofei He and Ji-Rong Wen and Wei-Ying Ma",
    title = "Block-level link analysis",
    booktitle = "Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'04)",
    pages = {440--447},
    year = "2004"}

    @Inproceedings{CYWM04
    author = "Deng Cai and Shipeng Yu and Ji-Rong Wen and Wei-Ying Ma",
    title = "Block-based web search",
    booktitle = "Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'04)",
    pages = {456--463},
    year = "2004"}

    @Inproceedings{YCWM03,
    author = "Shipeng Yu and Deng Cai and Ji-Rong Wen and Wei-Ying Ma",
    title = "Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation",
    booktitle = "Twelfth International World Wide Web Conference (WWW2003)",
    year = "2003"}

    @Inproceedings{CYWM03,
    author = "Deng Cai and Shipeng Yu and Ji-Rong Wen and Wei-Ying Ma",
    title = "Extracting Content Structure for Web Pages based on Visual Representation",
    booktitle = "Fifth Asia Pacific Web Conference (APWeb2003)",
    year = "2003"}

     


    Demo

    Copyright Notice: All these programs can only be used for research.

    VIPS dll (The VIPS DLL is always under development. All versions are downloadable here.)

    • VIPS dll (pageanalyzer.dll) (release date: 03/26/2008. One bug fixed. Thanks Ankur Gupta for pointing out the bug.)

       

    • VIPS dll (pageanalyzer.dll) (release date: 01/16/2006. Some people requested for the HTML source code output, I added it. Also I changed some interfaces so you need to rebuild your program if you want to use this new dll. Meanwhile, please download the newest demo.)

       

    • VIPS Demo (release date: 01/16/2006) (You should download VIPS dll and register it first! This demo can only work on the new VIPS dll)

       

    • VIPS dll (pageanalyzer.dll) (release date: 03/20/2005, some bugs fixed)

       

    • VIPS dll (pageanalyzer.dll) (release date: 08/20/2004)

       

    • VIPS Demo (release date: 08/20/2004) (You should download VIPS dll and register it first!)

    How to use VIPS dll.

    • You should familiar with how to host a webbrowser(Internet Explorer) in your program. Some articles in MSDN are very useful.

       

    • A more powerful example of using VIPS dll in VS2003 (release date: 01/25/2006)
      (This example provides source code on how to process batch job using VIPS dll. The framework of this example is based on MFCbrowser, which is a demo project in MSDN. You only need to focus on the MFCbrowserView.cpp and MFCbrowserView.h. I added some comments and hopefully these two files are self explained. Email me if you still have any questions.)

       

    • A example of using VIPS dll in VC6.0 (release date: 08/20/2004) (You should download VIPS dll and register it first!)

    Notice: we are currently working to enhance the VIPS algorithm, any suggestions or problems can be send to dengcai2 AT cs DOT uiuc DOT edu.

  • 相关阅读:
    C# PC版微信消息监听自动回复
    http ContentLength 为0 下载问题
    linq to entity不识别方法"System.String ToString()"
    android 环境搭建
    已备份数据库的磁盘上结构版本为 661。服务器支持版本 539,无法还原或升级此数据库
    android 百度地图 团队开发及正式apk发布
    Jquery UI Autocomplete 在mvc中应用
    <input type="image">表单提交2次 重复插入数据问题
    仿腾讯新闻时间轴效果
    项目管理笔记
  • 原文地址:https://www.cnblogs.com/lexus/p/3600470.html
Copyright © 2011-2022 走看看