Python for Data Science

zoukankan html css js c++ java

Python for Data Science
Python for Data Science - Decision Trees With CART

Decision Tree

A decision tree is a decision-support tool that models decisions in order to predict probable outcomes of those decisions.

Decision Trees Have Three Types of Nodes
- Root node
- Decision nodes
- Leaf nodes
Decision Tree Algorithms

A class of supervised machine learning methods that are useful for making predictions from nonlinear data

Decision tree algorithms are an appropriate fit for:
- Continuous input and/or output variables
- Categorical input and/or output variables
Two Types of Decision Trees
- Categorical variable decision tree (or classification tree): when you use a decision tree to predict for a categorical target variable
- Continuous variable decision tree (or regression trees): when you use decision tree to predict for a continuous target variable
Main Benefits of Decision Trees Algorithms
- Decision trees are nonlinear models
- Decision trees are very easy to interpret
- Trees can be easily represented graphically, helping in interpretation
- Decision trees require less data preparation
Assumptions of Decision Trees Algorithms
- Root node = Entire training set
- Predictive features are either categorical, or (if continuous) they're binned prior to model deployment
- Rows in the dataset have a recursive distribution based on the values of attributes
How Decision Trees Algorithms Work
1. Deploy recursive binary splitting to stratify or segment the predictor space into a number of simple, nonoverlapping regions.
2. Make a prediction for a given observation by using the mean or the mode of the training observations in the region to which it belongs.
3. Use the set of splitting rules to summarize the tree.
Recursive Binary Splitting

Recursive binary splitting is the process that's used to segment a predictor space into regions in order to create a binary decision tree.

How Recursive Binary Splitting Works

At every stage we split the region into two, and we do this by the following criteria:
- In regression trees: use the SSE to calculate the loss function, thus identifying the best split
- In classification trees: use the Gini index to calculate the loss function, thus identifying the best split
Characteristics of Recursive Binary Splitting
- Top-down
- Greedy
Disadvantages of Decision Tree Algorithms
- Very non-robust
- Sensitive to training data
- Globally optimum tree not guaranteed
Using Decision Trees for Regression

Regression trees are appropriate when:
- Target is a continuous variable
- Linear relationship between features and target
Output values from terminal nodes represent the mean response:
- Values of new data points will be predicted from that mean
Recursive Splitting in Regression Trees

At every state in a regression tree, the region is split into two according to sum of squares error:

The model begins with the entire data set, S, and searches every distinct value of every predictor to find the predictor and split value that partitions the data into two groups(S1 and S2), such that the overall sums of squares error are minimized:

[SSE=sum_{i∈S_1}{(y_i-overline{y}_1)^2}+sum_{i∈S_2}{(y_i-overline{y}_2)^2} ]
Using Decision Trees for Classification

Classification trees are appropriate when:
- Target is a binary, categorical variable
Output values from terminal nodes represent the mode response:
- Values of new data points will be predicted from that mode
Recursive Splitting in Classification Trees

At every state in a classification tree, the region is split into two according to a user-defined metric, for example:

The Gini index(G) is a measure of total variance across the K classes; it measures the probability of misclassification

[G=sum^{K}_{k=1}{widehat{p}_{mk}}(1-widehat{p}_{mk}) ]
- G takes on a small value if all of the Pmk's are close to zero or one
- "Measure of node purify": where a small G value indicates that a node contains predominantly observations from a single class
Tree Pruning

Tree pruning is the process that's used to overcome model overfitting by removing sub nodes of a decision tree (that is, replacing a whole subtree by a leaf node).

Why Prune a Decision Tree

Why tree pruning is necessary:

Deep tree = Model overfitting = Bad performance

If expected error rate (in subtree) > Single leaf

Two Popular Tree Pruning Methods
- Hold-out test: Fastest, simplest pruning method
- Cost-complexity pruning
相信未来 - 该面对的绝不逃避，该执著的永不怨悔，该舍弃的不再留念，该珍惜的好好把握。
查看全文

相关阅读:
Tomcat && Servlet
List，Set，Collections工具类
 多表查询
 常用的API--集合
 msmpeng.exe阻止移动硬盘弹出
 接口400错误解析
 JDBC/Mybatis连接数据库报错：The server time zone value 'ÖÐ¹ú±ê×¼Ê±¼ä' is unrecognized or represents more than one time zone.
tomcat启动报错：A child container failed during start
PAT 1019 数字黑洞
 PAT 1017 A除以B

原文地址：https://www.cnblogs.com/keepmoving1113/p/14347155.html

Python for Data Science

Python for Data Science - Decision Trees With CART