zoukankan      html  css  js  c++  java
  • tf-idf知多少?

    1、最完整的解释

    TF-IDF是一种统计方法,用以评估一个词对于语料库中的其中一份文件的重要程度。

    --------------就是给定语料库的情况下(给定语料库就是说已知语料库的属性信息),给定一个词语term,计算一个term对于文件的重要性(就是计算一个得分),文件是可变的;

    这样的话可以计算在词语在多个文件的得分然后做个排序就好了

    一个直观的感觉是:一个词语在当前文件出现次数越多越重要;但是有可能是停用词

    所以还得限制这个词语在语料库中出现文件次数;

    所以就是一个词语对于一个文件的重要性取决于:  当前这个词语在文件中出现的次数(越多越重要);以及这个词语在整个语料库中出现次数(越少越重要);

    其实就是评价单词对于文件的重要性,这个重要性怎么衡量?-就是后文所说  这里x是term,y是代表document,如果是一个query就会对x做sum

    2、关于 normalization 

    是有两种 normalization的操作,叫做 instance-wise和 feature-wise的normalization:

    >Normalizationしてみる(instance-wise normalization)

    >Scaleを合わせてみる(feature-wise normalization)

    最近在做text classificaion,所以会用到 tf-idf,这个时候一般只会做 instance-wise的,这个也就是针对每个样本的每个特征做nomalizaiton

    具体为什么要做instance normalization?一段文字告诉你

    According to our empirical experience, instance-wise  data normalization makes the optimization problem easier to be solved.  For example, for the following vector x (0, 0, 1, 1, 0, 1,0, 1, 0), the standard data normalization divides each element of x by by the 2-norm of x. The normalized x becomes (0, 0, 0.5, 0.5,0, 0.5, 0, 0.5, 0)

    1. Transforming infrequent features into a special tag. Conceptually, infrequent
    features should only include very little or even no information, so it
    should be very hard for a model to extract those information. In fact, these
    features can be very noisy. We got significant improvement (more than 0.0005)
    by transforming these features into a special tag.

    2. Transforming numerical features (I1-I13) to categorical features.
    Empirically we observe using categorical features is always better than
    using numerical features. However, too many features are generated if
    numerical features are directly transformed into categorical features, so we
    use

    v <- floor(log(v)^2)

    to reduce the number of features generated.

    3. Data normalization. According to our empirical experience, instance-wise
    data normalization makes the optimization problem easier to be solved.
    For example, for the following vector x

    (0, 0, 1, 1, 0, 1, 0, 1, 0),

    the standard data normalization divides each element of x by by the 2-norm
    of x. The normalized x becomes

    (0, 0, 0.5, 0.5, 0, 0.5, 0, 0.5, 0).

    The hashing trick do not contribute any improvement on the leaderboard. We
    apply the hashing trick only because it makes our life easier to generate
    features. :)

  • 相关阅读:
    xxl-job 使用相关
    Kettle 使用相关
    C# 调用 WinApi 中 ShellExecute 打开 Excel 的方法
    SQL Server 日志文件清理
    windows远程桌面无法粘贴复制的解决方法
    大数据、业务多元化将是商业银行未来的发展趋势
    TeraData金融数据模型
    整理ORACLE数据库备份常用术语
    ORACLE恢复神器之ODU/AUL/DUL
    ORACLE之UTL_FILE包详解
  • 原文地址:https://www.cnblogs.com/amazement/p/6053944.html
Copyright © 2011-2022 走看看