zoukankan      html  css  js  c++  java
  • integer encoding vs 1-hot (py)

    https://github.com/szilard/benchm-ml/issues/1

     

    glouppe commented on 7 May 2015

    Thanks for the benchmarks! Proper handling of categorical variables is not an easy issue anyway.

    I would expect faster, lower memory but decrease in AUC (or same in some cases).

    When the categories are ordered, it makes more sense indeed to handle them as numerical variables. I dont have a strong argument as to why it may be also better when there is no natural ordering. I guess it could boil down to the fact that one-hot encoding splits are often very unbalanced, while integer encoded splits may be less unbalanced.

    Thanks @glouppe. I read somewhere a paper that AFAIR suggested to sort the (non-ordered) categoricals in order of their frequency in the data and encode them as integers as such. Any idea what that paper might be?

    glouppe commented on 7 May 2015


    Yes, it is Breiman's book :) When your output is binary, this strategy is in fact optimal (it will find the best subset among the values of the categorical variables) and linear.

    See section 3.6.3.2 of my thesis if you dont have the CART book.
    http://orbi.ulg.ac.be/bitstream/2268/170309/1/thesis.pdf

    One-hot encoding could be helpful when the number of categories are small( in level of 10 to 100). In such case one-hot encoding can discover interesting interactions like (gender=male) AND (job = teacher).

    While ordering them makes it harder to be discovered(need two split on job). However, indeed there is not a unified way handling categorical features in trees, and usually what tree was really good at was ordered continuous features anyway..

     
     

     
  • 相关阅读:
    拉格朗日插值
    文档 所有空格变为Tab
    windows 计算器
    map 结构体
    插入图片 图片地址
    扩展中国剩余定理
    欧拉定理、欧拉函数、a/b%c
    hdu1033Defragment
    Minimum Inversion Number_线段树||树状数组
    hdu1166敌兵布阵_线段树单点更新
  • 原文地址:https://www.cnblogs.com/xinping-study/p/7085221.html
Copyright © 2011-2022 走看看