高级特征工程II

zoukankan html css js c++ java

高级特征工程II
以下是Coursera上的How to Win a Data Science Competition: Learn from Top Kagglers课程笔记。

Statistics and distance based features

该部分专注于此高级特征工程：计算由另一个分组的一个特征的各种统计数据和从给定点的邻域分析得到的特征。

groupby and nearest neighbor methods

例子：这里有一些CTR任务的数据

我们可以暗示广告有页面上的最低价格将吸引大部分注意力。页面上的其他广告不会很有吸引力。计算与这种含义相关的特征非常容易。我们可以为每个广告的每个用户和网页添加最低和最高价格。在这种情况下，具有最低价格的广告的位置也可以使用。

代码实现
- More feature
- How many pages user visited
- Standard deviation of prices
- Most visited page
- Many, many more
如果没有特征可以像这样使用groupby呢？可以使用最近邻点

Neighbors
- Explicit group is not needed
- More flexible
- Much harder to implement
Examples
- Number of houses in 500m, 1000m,..
- Average price per square meter in 500m, 1000m,..
- Number of schools/supermarkets/parking lots in 500m, 1000m,..
- Distance to colsest subway station
讲师在Springleaf比赛中使用了它。

KNN features in springleaf
- Mean encode all the variables
- For every point, find 2000 nearst neighbors using Bray-Curtis metric
[frac{sum{|u_i - v_i|}}{sum{|u_i + v_i|}} ]
- Calculate various features from those 2000 neighbors
Evaluate
- Mean target of neatrest 5,10,15,500,2000, neighbors
- Mean distance to 10 closest neighbors
- Mean distance to 10 closest neighbors with target 1
- Mean distance to 10 closest neighbors with target 0
Matrix factorizations for feature extraction
- Example of feature fusion
Notes about Matrix Fatorization
- Can be apply only for some columns
- Can provide additional diversity
- Good for ensembles
- It is lossy transformation.Its' efficirncy depends on:
- Particular task
- Number of latent factors
  
  Usually 5-100
Implementtation
- Serveral MF methods you can find in sklearn
- SVD and PCA
- Standart tools for Matrix Fatorization
- TruncatedSVD
- Works with sparse matrices
- Non-negative Matrix Fatorization(NMF)
- Ensures that all latent fators are non-negative
- Good for counts-like data
NMF for tree-based methods

non-negative matrix factorization简称NMF，它以一种使数据更适合决策树的方式转换数据。

可以看出，NMF变换数据形成平行于轴的线。

因子分解

可以使用与线性模型的技巧来分解矩阵。

Conclusion
- Matrix Factorization is a very general approach for dimensionality reduction and feature extraction
- It can be applied for transforming categorical features into real-valued
- Many of tricks trick suitable for linear models can be useful for MF
Feature interactions

特征值的所有组合
- Example:banner selection
假设我们正在构建一个预测模型，在网站上显示的最佳广告横幅。

... category_ad category_site ... is_clicked

... auto_part game_news ... 0

... music_tickets music_news .. 1

... mobile_phones auto_blog ... 0

将广告横幅本身的类别和横幅将显示的网站类别，进行组合将构成一个非常强的特征。

... ad_site ... is_clicked

... auto_part | game_news ... 0

... music_tickets | music_news .. 1

... mobile_phones | auto_blog ... 0

构建这两个特征的组合特征ad_site

从技术角度来看，有两种方法可以构建这种交互。
- Example of interactions
方法1

方法2
- 相似的想法也可用于数值变量
事实上，这不限于乘法操作，还可以是其他的
- Multiplication
- Sum
- Diff
- Division
- ..
Practival Notes
- We have a lot of possible interactions -N*N for N features.
- a. Even more if use several types in interactions
- Need ti reduce it's number
- a. Dimensionality reduction
- b. Feature selection
通过这种方法生成了大量的特征，可以使用特征选择或降维的方法减少特征。以下用特征选择举例说明

Interactions' order
- We looked at 2nd order interactions.
- Such approach can be generalized for higher orders.
- It is hard to do generation and selection automatically.
- Manual building of high-order interactions is some kind of art.
Extract features from DT

看一下决策树。让我们将每个叶子映射成二进制特征。对象叶子的索引可以用作新分类特征的值。如果我们不使用单个树而是使用它们的整体。例如，随机森林，那么这种操作可以应用于每个条目。这是一种提取高阶交互的强大方法。
- How to use it
In sklearn:
```
tree_model.apply()
```
In xgboost:
```
booster.predict(pred_leaf=True)
```
Conclusion
- We looked at ways to build an interaction of categorical attributes
- Extended this approach to real-valued features
- Learn how to extract features via decision trees
t-SNE

用于探索数据分析。可以被视为从数据中获取特征的方法。

Practical Notes
- Result heavily depends on hyperparameters(perplexity)
- Good practice is to use several projections with different perplexities(5-100)
- Due to stochastic nature, tSNE provides different projections even for the same datahyperparams
- Train and test should be projected together
- tSNE runs for a long time with a big number of features
- it is common to do dimensionality reduction before projection.
- Implementation of tSNE can be found in sklearn library.
- But personally I perfer you use stand-alone implementation python package tsne due to its' faster speed.
Conclusion
- tSNE is a great tool for visualization
- It can be used as feature as well
- Be careful with interpretation of results
- Try different perplexities
矩阵分解：
- 矩阵分解方法概述（sklearn）
T-SNOW：
互动：
- Facebook Research的论文关于从树中提取分类特征
- 示例：使用树集合进行要素转换（sklearn）
查看全文

相关阅读:
VANET
OTCL，面向对象的脚本一
 NetFPGA-SUME下reference_nic测试
 Mininet-wifi安装和简单使用
 18寒假
 DistBlockNet:A Distributed Blockchains-Based Secure SDN Architecture for IOT Network
SDVN
Papers3
高级软件工程实践总结
 Beta集合

原文地址：https://www.cnblogs.com/ishero/p/11136437.html

...	category_ad	category_site	...	is_clicked
...	auto_part	game_news	...	0
...	music_tickets	music_news	..	1
...	mobile_phones	auto_blog	...	0

...	ad_site	...	is_clicked
...	auto_part \| game_news	...	0
...	music_tickets \| music_news	..	1
...	mobile_phones \| auto_blog	...	0

高级特征工程II

Statistics and distance based features

例子：这里有一些CTR任务的数据

Neighbors

KNN features in springleaf

Matrix factorizations for feature extraction

Notes about Matrix Fatorization

Implementtation

NMF for tree-based methods

因子分解

Conclusion

Feature interactions

Practival Notes

Interactions' order

Extract features from DT

Conclusion

t-SNE

Practical Notes

Conclusion

矩阵分解：

T-SNOW：

互动：