关于one-hot encoding思考

zoukankan html css js c++ java

关于one-hot encoding思考

Many learning algorithms either learn a single weight per feature, or they use distances between samples. The former is the case for linear models such as logistic regression, which are easy to explain.

Suppose you have a dataset having only a single categorical feature "nationality", with values "UK", "French" and "US". Assume, without loss of generality, that these are encoded as 0, 1 and 2. You then have a weight w for this feature in a linear classifier, which will make some kind of decision based on the constraint w×x + b > 0, or equivalently w×x < b.

The problem now is that the weight w cannot encode a three-way choice. The three possible values of w×x are 0, w and 2×w. Either these three all lead to the same decision (they're all < b or ≥b) or "UK" and "French" lead to the same decision, or "French" and "US" give the same decision. There's no possibility for the model to learn that "UK" and "US" should be given the same label, with "French" the odd one out.（二分类问题，若dummy encoding，us和uk始终不能单独成为一类，而若one-hot encoding，则可以适应任何情况）

By one-hot encoding, you effectively blow up the feature space to three features, which will each get their own weights, so the decision function is now w[UK]x[UK] + w[FR]x[FR] + w[US]x[US] < b, where all the x's are booleans. In this space, such a linear function can express any sum/disjunction of the possibilities (e.g. "UK or US", which might be a predictor for someone speaking English).

Similarly, any learner based on standard distance metrics (such as k-nearest neighbors) between samples will get confused without one-hot encoding. With the naive encoding and Euclidean distance, the distance between French and US is 1. The distance between US and UK is 2. But with the one-hot encoding, the pairwise distances between [1, 0, 0], [0, 1, 0] and [0, 0, 1] are all equal to √2.

This is not true for all learning algorithms; decision trees and derived models such as random forests, if deep enough, can handle categorical variables without one-hot encoding.

dataframe one-hot encoding：pandas.get_dummies方法

参考：

https://gist.github.com/ramhiser/982ce339d5f8c9a769a0

http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.get_dummies.html

查看全文

相关阅读:
代码整洁之道读书笔记 1
PhoneGap开发环境搭建
 Android + Eclipse + PhoneGap 2.9.0 安卓最新环境配置,部分资料整合网上资料,已成功安装.
Oracle只读用户角色的建立
 linux系统下创建oracle表空间和用户权限查询
 Extjs的grid的单元格中加载超链接和按钮
 从网上找的Android实用代码，记录备用
 Android上实现各种风格的隐藏菜单,比如左右滑动菜单、上下滑动显示隐藏菜单
 android highcharts 柱状图例子
 web打印控件Lodop轻松输出清晰的图表和条码

原文地址：https://www.cnblogs.com/ljygoodgoodstudydaydayup/p/6874046.html