zoukankan      html  css  js  c++  java
  • 搜索引擎优化 TF_IDF之Java实现

    实现之前,我们要事先说明一些问题:

    我们用Redis对数据进行持久化,存两种形式的MAP:

    key值为term,value值为含有该term的url
    key值为url,value值为map,记录term及在文章中出现的次数
    总的计算公式如下:


    1.计算词频TF
    这里通过给出url地址,获取搜索词term在此url中的数量,计算出TF


    获取url中的词汇总数

    /**
    * @Author Ragty
    * @Description 获取url中的词汇总数
    * @Date 11:18 2019/6/4
    **/
    public Integer getWordCount(String url) {
    String redisKey = urlSetKey(url);
    Map<String,String> map = jedis.hgetAll(redisKey);
    Integer count = 0;

    for(Map.Entry<String, String> entry: map.entrySet()) {
    count += Integer.valueOf(entry.getValue());
    }
    return count;
    }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15

    返回搜索项在url中出现的次数

    /**
    * @Author Ragty
    * @Description 返回搜索项在url中出现的次数
    * @Date 22:12 2019/5/14
    **/
    public Integer getTermCount(String url,String term) {
    String redisKey = urlSetKey(url);
    String count = jedis.hget(redisKey,term);
    return new Integer(count);
    }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10

    获取搜索词的词频

    /**
    * @Author Ragty
    * @Description 获取搜索词的词频(Term Frequency)
    * @Date 11:25 2019/6/4
    **/
    public BigDecimal getTermFrequency(String url,String term) {
    if (!isIndexed(url)) {
    System.out.println("Doesn't indexed.");
    return null;
    }

    Integer documentCount = getWordCount(url);
    Integer termCount = getTermCount(url,term);
    return documentCount==0 ? new BigDecimal(0) : new BigDecimal(termCount).divide(new BigDecimal(documentCount),6,BigDecimal.ROUND_HALF_UP);
    }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15

    2.计算逆文档频率
    计算逆文档频率,需要计算文档总数,以及包含该搜索词的文章数


    获取redis索引文章的总数

    /**
    * @Author Ragty
    * @Description 获取redis索引文章的总数
    * @Date 19:46 2019/6/5
    **/
    public Integer getUrlCount() {
    Integer count = 0;
    count = urlSetKeys().size();
    return count;
    }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10

    获取含有搜索词的文章数

    /**
    * @Author Ragty
    * @Description 获取含有搜索词的文章数
    * @Date 22:42 2019/6/5
    **/
    public Integer getUrlTermCount(String term) {
    Integer count = 0;
    count = getUrls(term).size();
    return count;
    }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10

    计算逆文档频率IDF(InverseDocumnetFrequency)

    /**
    * @Author Ragty
    * @Description 计算逆文档频率IDF(InverseDocumnetFrequency)
    * @Date 23:32 2019/6/5
    **/
    public BigDecimal getInverseDocumentFrequency(String term) {
    Integer totalUrl = getUrlCount();
    Integer urlTermCount = getUrlTermCount(term);
    Double xx = new BigDecimal(totalUrl).divide(new BigDecimal(urlTermCount),6,BigDecimal.ROUND_HALF_UP).doubleValue();
    BigDecimal idf = new BigDecimal(Math.log10(xx));
    return idf;
    }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12

    3.获取TF-IDF
    /**
    * @Author Ragty
    * @Description 获取tf-idf值
    * @Date 23:34 2019/6/5
    **/
    public BigDecimal getTFIDF(String url,String term) {
    BigDecimal tf = getTermFrequency(url, term);
    BigDecimal idf = getInverseDocumentFrequency(term);
    BigDecimal tfidf =tf.multiply(idf);
    return tfidf;
    }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11

    4.数据测试
    这里我采用我自己爬取的部分数据,进行一下简单的测试(可能因为数据集的原因导致部分结果不准确)

    测试类方法:

    /**
    * @Author Ragty
    * @Description 获取tfidf下的相关性
    * @Date 8:47 2019/6/6
    **/
    private static BigDecimal getRelevance(String url,String term,JedisIndex index) {
    BigDecimal tfidf = index.getTFIDF(url,term);
    return tfidf;
    }


    /**
    * @Author Ragty
    * @Description 执行搜索
    * @Date 23:49 2019/5/30
    **/
    public static WikiSearch search(String term,JedisIndex index) {
    Map<String,BigDecimal> map = new HashMap<String, BigDecimal>();
    Set<String> urls = index.getUrls(term);

    for (String url: urls) {
    BigDecimal tfidf = getRelevance(url,term,index).setScale(6,BigDecimal.ROUND_HALF_UP);
    map.put(url,tfidf);
    }

    return new WikiSearch(map);
    }


    /**
    * @Author Ragty
    * @Description 按搜索项频率顺序打印内容
    * @Date 13:46 2019/5/30
    **/
    private void print() {
    List<Entry<String,BigDecimal>> entries = sort();
    for(Entry<String,BigDecimal> entry: entries) {
    System.out.println(entry.getKey()+" "+entry.getValue());
    }
    }



    /**
    * @Author Ragty
    * @Description 根据相关性对数据排序
    * @Date 13:54 2019/5/30
    **/
    public List<Entry<String,BigDecimal>> sort(){
    List<Entry<String,BigDecimal>> entries = new LinkedList<Entry<String, BigDecimal>>(map.entrySet());

    Comparator<Entry<String,BigDecimal>> comparator = new Comparator<Entry<String, BigDecimal>>() {
    @Override
    public int compare(Entry<String, BigDecimal> o1, Entry<String, BigDecimal> o2) {
    return o2.getValue().compareTo(o1.getValue());
    }
    };

    Collections.sort(entries,comparator);
    return entries;
    }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61

    测试代码:

    public static void main(String[] args) throws IOException {
    Jedis jedis = JedisMaker.make();
    JedisIndex index = new JedisIndex(jedis);

    // search for the first term
    String term1 = "java";
    System.out.println("Query: " + term1);
    WikiSearch search1 = search(term1, index);
    search1.print();

    // search for the second term
    String term2 = "programming";
    System.out.println("Query: " + term2);
    WikiSearch search2 = search(term2, index);
    search2.print();

    }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17

    测试结果:

    Query: java
    https://baike.baidu.com/item/LiveScript 0.029956
    https://baike.baidu.com/item/Java/85979 0.019986
    https://baike.baidu.com/item/Brendan%20Eich 0.017188
    https://baike.baidu.com/item/%E7%94%B2%E9%AA%A8%E6%96%87/471435 0.013163
    https://baike.baidu.com/item/Sun/69463 0.005504
    https://baike.baidu.com/item/Rhino 0.004401
    https://baike.baidu.com/item/%E6%8E%92%E7%89%88%E5%BC%95%E6%93%8E 0.003452
    https://baike.baidu.com/item/javascript 0.002212
    https://baike.baidu.com/item/js/10687961 0.002212
    https://baike.baidu.com/item/%E6%BA%90%E7%A0%81 0.002205
    https://baike.baidu.com/item/%E6%BA%90%E7%A0%81/344212 0.002205
    https://baike.baidu.com/item/%E8%84%9A%E6%9C%AC%E8%AF%AD%E8%A8%80 0.001989
    https://baike.baidu.com/item/SQL 0.001779
    https://baike.baidu.com/item/PHP/9337 0.001503
    https://baike.baidu.com/item/iOS/45705 0.001499
    https://baike.baidu.com/item/Netscape 0.000863
    https://baike.baidu.com/item/%E6%93%8D%E4%BD%9C%E7%B3%BB%E7%BB%9F 0.000835
    https://baike.baidu.com/item/Mac%20OS%20X 0.000521
    https://baike.baidu.com/item/C%E8%AF%AD%E8%A8%80 0.000318

    Query: programming
    https://baike.baidu.com/item/C%E8%AF%AD%E8%A8%80 0.004854
    https://baike.baidu.com/item/%E8%84%9A%E6%9C%AC%E8%AF%AD%E8%A8%80 0.002529
    ---------------------

  • 相关阅读:
    Excel-单条件和多条件匹配搜索
    Excel-条件判断
    Excel-常用快捷键
    EXCEL-批量下拉填充
    Excel-数据分列的多种方法实现
    Excel-统一小括号格式(中文小括号,英文小括号)
    在WEB网页上模拟人的操作(批量操作)
    EXCEL-常用函数总结
    C语言学习——bsmap-2.74_main.cpp
    Linux --- awk
  • 原文地址:https://www.cnblogs.com/ly570/p/11106215.html
Copyright © 2011-2022 走看看