zoukankan      html  css  js  c++  java
  • Python for Data Science

    Python for Data Science - Decision Trees With CART

    Decision Tree

    A decision tree is a decision-support tool that models decisions in order to predict probable outcomes of those decisions.

    Decision Trees Have Three Types of Nodes

    • Root node
    • Decision nodes
    • Leaf nodes

    Decision Tree Algorithms

    A class of supervised machine learning methods that are useful for making predictions from nonlinear data

    Decision tree algorithms are an appropriate fit for:

    • Continuous input and/or output variables
    • Categorical input and/or output variables

    Two Types of Decision Trees

    • Categorical variable decision tree (or classification tree): when you use a decision tree to predict for a categorical target variable
    • Continuous variable decision tree (or regression trees): when you use decision tree to predict for a continuous target variable

    Main Benefits of Decision Trees Algorithms

    • Decision trees are nonlinear models
    • Decision trees are very easy to interpret
    • Trees can be easily represented graphically, helping in interpretation
    • Decision trees require less data preparation

    Assumptions of Decision Trees Algorithms

    • Root node = Entire training set
    • Predictive features are either categorical, or (if continuous) they're binned prior to model deployment
    • Rows in the dataset have a recursive distribution based on the values of attributes

    How Decision Trees Algorithms Work

    1. Deploy recursive binary splitting to stratify or segment the predictor space into a number of simple, nonoverlapping regions.
    2. Make a prediction for a given observation by using the mean or the mode of the training observations in the region to which it belongs.
    3. Use the set of splitting rules to summarize the tree.

    Recursive Binary Splitting

    Recursive binary splitting is the process that's used to segment a predictor space into regions in order to create a binary decision tree.

    How Recursive Binary Splitting Works

    At every stage we split the region into two, and we do this by the following criteria:

    • In regression trees: use the SSE to calculate the loss function, thus identifying the best split
    • In classification trees: use the Gini index to calculate the loss function, thus identifying the best split

    Characteristics of Recursive Binary Splitting

    • Top-down
    • Greedy

    Disadvantages of Decision Tree Algorithms

    • Very non-robust
    • Sensitive to training data
    • Globally optimum tree not guaranteed

    Using Decision Trees for Regression

    Regression trees are appropriate when:

    • Target is a continuous variable
    • Linear relationship between features and target

    Output values from terminal nodes represent the mean response:

    • Values of new data points will be predicted from that mean

    Recursive Splitting in Regression Trees

    At every state in a regression tree, the region is split into two according to sum of squares error:

    The model begins with the entire data set, S, and searches every distinct value of every predictor to find the predictor and split value that partitions the data into two groups(S1 and S2), such that the overall sums of squares error are minimized:

    [SSE=sum_{i∈S_1}{(y_i-overline{y}_1)^2}+sum_{i∈S_2}{(y_i-overline{y}_2)^2} ]

    Using Decision Trees for Classification

    Classification trees are appropriate when:

    • Target is a binary, categorical variable

    Output values from terminal nodes represent the mode response:

    • Values of new data points will be predicted from that mode

    Recursive Splitting in Classification Trees

    At every state in a classification tree, the region is split into two according to a user-defined metric, for example:

    The Gini index(G) is a measure of total variance across the K classes; it measures the probability of misclassification

    [G=sum^{K}_{k=1}{widehat{p}_{mk}}(1-widehat{p}_{mk}) ]

    • G takes on a small value if all of the Pmk's are close to zero or one
    • "Measure of node purify": where a small G value indicates that a node contains predominantly observations from a single class

    Tree Pruning

    Tree pruning is the process that's used to overcome model overfitting by removing sub nodes of a decision tree (that is, replacing a whole subtree by a leaf node).

    Why Prune a Decision Tree

    Why tree pruning is necessary:

    Deep tree = Model overfitting = Bad performance

    If expected error rate (in subtree) > Single leaf

    Two Popular Tree Pruning Methods

    • Hold-out test: Fastest, simplest pruning method
    • Cost-complexity pruning
    相信未来 - 该面对的绝不逃避,该执著的永不怨悔,该舍弃的不再留念,该珍惜的好好把握。
  • 相关阅读:
    Spring MVC-静态页面示例(转载实践)
    Spring MVC-页面重定向示例(转载实践)
    Spring中获取Session的方法汇总
    Spring Boot项目@RestController使用重定向redirect
    MySQL Workbench常用快捷键及修改快捷键的方法
    Eclipse安装Jetty插件(Web容器)
    Java EE: XML Schemas for Java EE Deployment Descriptors(Java Web的web.xml头web-app标签上的XML模式)
    Ubuntu 16.04 GNOME在桌面左侧添加启动器(Launcher)
    Ubuntu 16.04 GNOME添加桌面图标/在桌面上显示图标
    Ubuntu 16.04修改显示字体大小(包括GNOME/Unity)
  • 原文地址:https://www.cnblogs.com/keepmoving1113/p/14347155.html
Copyright © 2011-2022 走看看