zoukankan      html  css  js  c++  java
  • A N EAR -D UPLICATE D ETECTION A LGORITHM T O F ACILITATE D OCUMENT C LUSTERING——有时间看看里面的相关研究

    摘自:http://aircconline.com/ijdkp/V4N6/4614ijdkp04.pdf

    In the syntactical approach we define binary attributes that correspond to each fixed length substring of words (or characters). These substrings are a framework for near-duplicate detection called shingles. We can say that a shingle is a sequence of words. A shingle has two parameters: the length and the offset. The length of the shingle is the number of the words in a shingle and the
    offset is the distance between the beginnings of the shingles. We assign a hash code to each shingle, so equal shingles have the same hash code and it is improbable that different shingles
    would have the same hash codes (this depends on the hashing algorithm we use). After this we randomly choose a subset of shingles for a concise image of the document [6, 8, and 9]. M.Henzinger [32] uses like this approach AltaVista search engine .There are several methods for selecting the shingles for the image: a fixed number of shingles, a logarithmic number of shingles, a linear number of shingle (every nth shingle), etc. In lexical methods, representative words are chosen according to their significance. Usually these values are based on frequencies. Those words whose frequencies are in an interval (except for stop- words from a special list
    about 30 stop-words with articles, prepositions and 
    pronouns) are taken. The words with high 
    frequency can be non informative and words with low
    frequencies can be misprints or occasional 
    words. 
    In lexical methods, like I-Match [11], a large text 
    corpus is used for generating the lexicon. The 
    words that appear in the lexicon represent the docu
    ment. When the lexicon is generated the words 
    with the lowest and highest frequencies are deleted
    . I-Match generates a signature and a hash 
    code of the document. If two documents get the same
    hash code it is likely that the similarity 
    measures of these documents are equal as well. I-Ma
    tch is sometimes instable to changes in texts [22]. Jun Fan et al. [16] introduced the idea of fusing algorithms (shingling, I-Match, simhash) and presented the experiments. The random lexicons based multi fingerprints generations are imported into shingling based simhash algorithm and named it "shingling based multi fingerprints simhash algorithm". The combination performance was much better than original Simhash.
     
    The paper proposed the novel task for detecting and eliminating near duplicate and duplicate web pages to increase the efficiency of web crawling. So, the technique proposed aims at helping document classification in web content mining by eliminating the near-duplicate documents and in document clustering. For this, a novel Algorithm has been proposed to evaluate the similarity content of two ocuments.
     
     
    Duplicate Detection (DD) Algorithm
    Step 1: Consider the Stemmed keywords of the web page.
    Step 2: Based on the starting character i.e. A-Z we here by assumed the hash values should start with1-26.
    Step 3: Scan every word from the sample and compare with DB (data base) (initially DB Contains NO key values. Once the New keyword is found then generate respective hash value. Store that key value in temporary DB.
    Step 4: Repeat the step 3 until all the keywords get completes.
    Step 5: Store all Hash values for a given sample in local DB (i.e. here we used array list)
    Step 6: Repeat step 1 to step 6 for N no. of samples.
    Step 7: Once the selected samples were over then calculate similarity measure on the samples hash values which we stored in local DB with respective to webpages in repository.
    Step 8: From similarity measure, we can generate a report on the samples in the score of %forms. Pages that are 80% similar are considered tobe near duplicates
     
    我晕,貌似没有看到精髓啊!
     
  • 相关阅读:
    【已解决】ERR_BLOCKED_BY_XSS_AUDITOR:Chrome 在此网页上检测到了异常代码:解决办法
    【已解决】Microsoft visual c++ 14.0 is required问题解决办法
    爬虫处理网站的bug---小于号未转化为实体符
    pymysql 在数据库中插入空值
    python 正则括号的使用及踩坑
    pymysql 解决 sql 注入问题
    python3 操作MYSQL实例及异常信息处理--用traceback模块
    LeetCode 837. 新21点 | Python
    LeetCode 面试题64. 求1+2+…+n | Python
    LeetCode 101. 对称二叉树 | Python
  • 原文地址:https://www.cnblogs.com/bonelee/p/6420721.html
Copyright © 2011-2022 走看看