Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms

zoukankan html css js c++ java

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms
Feature selection is a process of extracting valuable features that have significant influence ondependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Borutaand entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison onVenn Diagram carried out on data from the RTCGA factory of R data packages.

I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.

RTCGA data

Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available viaRTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.
```
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA.rnaseq")
```
```
library(RTCGA.rnaseq)
BRCA.rnaseq$bcr_patient_barcode <- 
   substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)
```
The dependent variable, bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).

Check another RTCGA use case: TCGA and The Curse of BigData.

GLMnet

Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.
```
library(doMC)
registerDoMC(cores=6)
library(glmnet)
# fit the model
cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]),
          y = factor(BRCA.rnaseq[, 1]),
          family = "binomial", 
          type.measure = "class", 
          parallel = TRUE) -> cvfit
# extract feature names that have 
# non zero coefficiant
names(which(
   coef(cvfit, s = "lambda.min")[, 1] != 0)
   )[-1] -> glmnet.features
# first name is intercept
```
Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them - lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.
```
plot(cvfit)
```
Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.

转自：http://r-addict.com/2016/06/19/Venn-Diagram-RTCGA-Feature-Selection.html
---------------------------------------------------------------------------------- 数据和特征决定了效果上限，模型和算法决定了逼近这个上限的程度 ----------------------------------------------------------------------------------
查看全文

相关阅读:
按次计费简单实现思路
 java读取和写入excel
SpringBoot定时任务自动停止关闭
 class path resource [applicationContext.xml] cannot be opened because it does not exist
Tomcat安装配置idea
Git rebase
MongoDB高可用集群配置方案
 keepalived主从及双主配置
 openssl 生成免费证书
 Nginx proxy_cache 缓存静态文件

原文地址：https://www.cnblogs.com/payton/p/5604104.html

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms

RTCGA data

GLMnet