zoukankan      html  css  js  c++  java
  • Tree-based Model 如何处理categorical variable

    categorical variable 分为 order variale 和 non-order variable,其中order variable直接使用sklearn.preprocess.LabelEncoder是最好的处理方法。对于order variable的处理方法主要在于是否使用one-hot encoding。在这篇quora answer (author: Clem Wang) 中给出了其它的处理方法:

    One can try a few other approaches:

    • look at how the response variable responds to the categorical values and try to group them.
    • Find another ML algorithm that works better with categorical features or with one-hot encoding and use that to train a submodel that just uses the categorical features. Then replace the categorical feature with a probability score. For instance, use a Logistic Regression on the hot-encoded values.
    • Try to combine the categorical feature with some other features.
    • Build N xgboost classifiers, one for each category.

    This may require playing around with the data a bit. Plotting the data may help you see patterns that you didn't know that were there.

    这篇博客对于在xgboost中使用one-hot给出了一个总体结论:

    总结起来的结论,大至两条:

    • 1.对于类别有序的类别型变量,比如age等,当成数值型变量处理可以的。对于非类别有序的类别型变量,推荐one-hot。但是one-hot会增加内存开销以及训练时间开销。
    • 2.类别型变量在范围较小时(tqchen给出的是[10,100]范围内)推荐使用

    其他相关的资料

    comment:re sklearn -- integer encoding vs 1-hot

  • 相关阅读:
    vue 组件之间相互传值 父传子 子传父
    krpano 常用标签
    krpano生成全景图
    github的基本使用
    transform:rotate3d/tranlate3d
    css3水波纹效果
    原型的迷惑
    JS变量作用域
    LeetCode Rotate List
    LeetCode Divide Two Integers
  • 原文地址:https://www.cnblogs.com/ZeroTensor/p/10097069.html
Copyright © 2011-2022 走看看