zoukankan      html  css  js  c++  java
  • 决策树 机器学习,西瓜书p80 表4.2 使用信息增益生成决策树及后剪枝

    使用信息增益构造决策树,完成后剪枝

    1 构造决策树

    1 根结点的选择

    色泽 信息增益

    根据色泽划分为 青绿,乌黑,浅白 三个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{2}{4} log_2 frac{2}{4}+frac{2}{4} log_2 frac{2}{4})=1 \ Ent(D^2) &= -(frac{1}{4} log_2 frac{1}{4}+frac{3}{4} log_2 frac{3}{4})=0.811 \ Ent(D^3)&= -(frac{2}{2} log_2 frac{2}{2}+frac{0}{2} log_2 frac{0}{2})=0 \ Ent(D)&= -(frac{5}{10} log_2 frac{5}{10}+frac{5}{10} log_2 frac{5}{10})=1 \ Gani(D,色泽)&=Ent(D)-sum_{v=1}^3 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{4}{10} imes 1+frac{4}{10} imes 0.811+frac{2}{10} imes0) \ &= 0.2756 end{aligned} ]

    根蒂 信息增益

    根据根蒂划分为 蜷缩 稍蜷 硬挺 三个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{2}{5} log_2 frac{2}{5}+frac{3}{5} log_2 frac{3}{5})=0.971 \ Ent(D^2) &= -(frac{2}{4} log_2 frac{2}{4}+frac{2}{4} log_2 frac{2}{4})=1 \ Ent(D^3)&= -(frac{1}{1} log_2 frac{1}{1}+frac{0}{1} log_2 frac{0}{1})=0 \ Ent(D)&= -(frac{5}{10} log_2 frac{5}{10}+frac{5}{10} log_2 frac{5}{10})=1 \ Gani(D,根蒂)&=Ent(D)-sum_{v=1}^3 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{5}{10} imes 0.971+frac{4}{10} imes 1+frac{1}{10} imes0) \ &= 0.1145 end{aligned} ]

    敲声 信息增益

    根据色泽划分为 浊响,沉闷,清脆 三个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{2}{6} log_2 frac{2}{6}+frac{4}{6} log_2 frac{4}{6})=0.918 \ Ent(D^2) &= -(frac{2}{3} log_2 frac{2}{3}+frac{1}{3} log_2 frac{1}{3})=0.918 \ Ent(D^3)&= -(frac{1}{1} log_2 frac{1}{1}+frac{0}{1} log_2 frac{0}{1})=0 \ Ent(D)&= -(frac{5}{10} log_2 frac{5}{10}+frac{5}{10} log_2 frac{5}{10})=1 \ Gani(D,敲声)&=Ent(D)-sum_{v=1}^3 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{6}{10} imes 0.918+frac{3}{10} imes 0.918+frac{1}{10} imes0) \ &=0.2346 end{aligned} ]

    纹理 信息增益

    根据 纹理 划分为 清晰 稍糊 模糊 三个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{2}{6} log_2 frac{2}{6}+frac{4}{6} log_2 frac{4}{6})=0.918 \ Ent(D^2) &= -(frac{2}{3} log_2 frac{2}{3}+frac{1}{3} log_2 frac{1}{3})=0.918 \ Ent(D^3)&= -(frac{1}{1} log_2 frac{1}{1}+frac{0}{1} log_2 frac{0}{1})=0 \ Ent(D)&= -(frac{5}{10} log_2 frac{5}{10}+frac{5}{10} log_2 frac{5}{10})=1 \ Gani(D,纹理)&=Ent(D)-sum_{v=1}^3 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{6}{10} imes 0.918+frac{3}{10} imes 0.918+frac{1}{10} imes0) \ &= 0.2346 end{aligned} ]

    脐部 信息增益

    根据色泽划分为 凹陷,稍凹,平坦 三个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{1}{4} log_2 frac{1}{4}+frac{3}{4} log_2 frac{3}{4})=0.811 \ Ent(D^2) &= -(frac{2}{4} log_2 frac{2}{4}+frac{2}{4} log_2 frac{2}{4})=1 \ Ent(D^3)&= -(frac{2}{2} log_2 frac{2}{2}+frac{0}{2} log_2 frac{0}{2})=0 \ Ent(D)&= -(frac{5}{10} log_2 frac{5}{10}+frac{5}{10} log_2 frac{5}{10})=1 \ Gani(D,脐部)&=Ent(D)-sum_{v=1}^3 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{4}{10} imes 0.811+frac{4}{10} imes 1+frac{2}{10} imes0) \ &= 0.2756 end{aligned} ]

    触感 信息增益

    根据色泽划分为 硬滑,软粘 两个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{3}{6} log_2 frac{3}{6}+frac{3}{6} log_2 frac{3}{6})=1 \ Ent(D^2) &= -(frac{2}{4} log_2 frac{2}{4}+frac{2}{4} log_2 frac{2}{4})=1 \ Ent(D)&= -(frac{5}{10} log_2 frac{5}{10}+frac{5}{10} log_2 frac{5}{10})=1 \ Gani(D,触感)&=Ent(D)-sum_{v=1}^2 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{6}{10} imes 1 +frac{4}{10} imes 1 \ &= 0 end{aligned} ]

    选择根结点构建决策树

    [egin{aligned} Gain(D,色泽)=0.2756 Gain(D,根蒂)=0.1145 Gain(D,敲声)=0.2346 \ Gain(D,纹理)=0.2346 Gain(D,脐部)=0.2756 Gain(D,触感)=0 end{aligned} ]

    比较六个属性的信息增益大小,选择脐部作为根结点

    则数据集被划分为

    image-20211031203958969

    2 对分支结点({1,2,3,14})进行划分

    色泽 信息增益

    根据色泽划分为 青绿,乌黑,浅白 三个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{0}{1} log_2 frac{0}{1}+frac{1}{1} log_2 frac{1}{1})=0 \ Ent(D^2) &= -(frac{0}{2} log_2 frac{0}{2}+frac{2}{2} log_2 frac{2}{2})=0 \ Ent(D^3)&= -(frac{1}{1} log_2 frac{1}{1}+frac{0}{1} log_2 frac{0}{1})=0 \ Ent(D)&= -(frac{1}{4} log_2 frac{1}{4}+frac{3}{4} log_2 frac{3}{4})=0.811 \ Gani(D,色泽)&=Ent(D)-sum_{v=1}^3 frac{|D^v|}{|D|}Ent(D^v) \ &= 0.811 - (frac{1}{4} imes 0+frac{2}{4} imes 0 +frac{1}{4} imes 0) \ &= 0.811 end{aligned} ]

    根蒂 信息增益

    根据根蒂划分为 蜷缩 稍蜷 两个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{0}{3} log_2 frac{0}{3}+frac{3}{3} log_2 frac{3}{3})=0 \ Ent(D^2) &= -(frac{1}{1} log_2 frac{1}{1}+frac{0}{1} log_2 frac{0}{1})=0 \ Ent(D)&= -(frac{1}{4} log_2 frac{1}{4}+frac{3}{4} log_2 frac{3}{4})= 0.811\ Gani(D,根蒂)&=Ent(D)-sum_{v=1}^2 frac{|D^v|}{|D|}Ent(D^v) \ &= 0.811 - (frac{3}{4} imes 0 +frac{1}{4} imes 0) \ &= 0.811 end{aligned} ]

    敲声 信息增益

    根据色泽划分为 浊响,沉闷 两个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{0}{2} log_2 frac{0}{2}+frac{2}{2} log_2 frac{2}{2})=0 \ Ent(D^2) &= -(frac{1}{2} log_2 frac{1}{2}+frac{1}{2} log_2 frac{1}{2})=1 \ Ent(D)&= -(frac{1}{4} log_2 frac{1}{4}+frac{3}{4} log_2 frac{3}{4})=0.811 \ Gani(D,敲声)&=Ent(D)-sum_{v=1}^2 frac{|D^v|}{|D|}Ent(D^v) \ &= 0.811 - (frac{2}{4} imes 0 +frac{2}{4} imes 1 ) \ &=0.311 end{aligned} ]

    纹理 信息增益

    根据 纹理 划分为 清晰 稍糊 两个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{0}{3} log_2 frac{0}{3}+frac{3}{3} log_2 frac{3}{3})=0 \ Ent(D^2) &= -(frac{1}{1} log_2 frac{1}{1}+frac{0}{1} log_2 frac{0}{1})=0 \ Ent(D)&= -(frac{1}{4} log_2 frac{1}{4}+frac{3}{4} log_2 frac{3}{4})=0.811 \ Gani(D,纹理)&=Ent(D)-sum_{v=1}^2 frac{|D^v|}{|D|}Ent(D^v) \ &= 0.811 - (frac{3}{4} imes 0+frac{1}{4} imes 0 ) \ &= 0.811 end{aligned} ]

    触感 信息增益

    根据触感划分为 硬滑 一个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{1}{4} log_2 frac{1}{4}+frac{3}{4} log_2 frac{3}{4})=0.811 \ Ent(D)&= -(frac{1}{4} log_2 frac{1}{4}+frac{3}{4} log_2 frac{3}{4})= 0.811 \ Gani(D,触感)&=Ent(D)-sum_{v=1}^1 frac{|D^v|}{|D|}Ent(D^v) \ &= 0.811 - (frac{4}{4} imes 0.811 ) \ &= 0 end{aligned} ]

    选择分类结点构建决策树

    [egin{aligned} Gain(D,色泽)=0.811 Gain(D,根蒂)=0.811 Gain(D,敲声)=0.311 \ Gain(D,纹理)=0.811 Gain(D,触感)=0 end{aligned} ]

    不妨选择色泽作为分类依据

    形成的决策树

    image-20211031211024832

    3 对分支 ({6,7,15,17})进行划分

    色泽 信息增益

    根据色泽划分为 青绿,乌黑 两个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{1}{2} log_2 frac{1}{2}+frac{1}{2} log_2 frac{1}{2})=1 \ Ent(D^2) &= -(frac{1}{2} log_2 frac{0}{2}+frac{1}{2} log_2 frac{1}{2})=1 \ Ent(D)&= -(frac{2}{4} log_2 frac{2}{4}+frac{2}{4} log_2 frac{2}{4})= 1 \ Gani(D,色泽)&=Ent(D)-sum_{v=1}^2 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{2}{4} imes 1+frac{2}{4} imes 1 ) \ &= 0 end{aligned} ]

    根蒂 信息增益

    根据根蒂划分为 蜷缩 稍蜷 两个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{1}{3} log_2 frac{1}{3}+frac{2}{3} log_2 frac{2}{3})=0。918\ Ent(D^2) &= -(frac{1}{1} log_2 frac{1}{1}+frac{0}{1} log_2 frac{0}{1})=0 \ Ent(D)&= -(frac{2}{4} log_2 frac{2}{4}+frac{2}{4} log_2 frac{2}{4})= 1\ Gani(D,根蒂)&=Ent(D)-sum_{v=1}^2 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{3}{4} imes 0.918 +frac{1}{4} imes 0) \ &= 0.3115 end{aligned} ]

    敲声 信息增益

    根据色泽划分为 浊响,沉闷 两个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{1}{3} log_2 frac{1}{3}+frac{2}{3} log_2 frac{2}{3})=0.918 \ Ent(D^2) &= -(frac{1}{1} log_2 frac{1}{1}+frac{0}{1} log_2 frac{0}{1})= 0 \ Ent(D)&= -(frac{2}{4} log_2 frac{2}{4}+frac{2}{4} log_2 frac{2}{4})=1 \ Gani(D,敲声)&=Ent(D)-sum_{v=1}^2 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{3}{4} imes 0.918 +frac{1}{4} imes 0 ) \ &=0.3115 end{aligned} ]

    纹理 信息增益

    根据 纹理 划分为 清晰 稍糊 两个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{1}{2} log_2 frac{1}{2}+frac{1}{2} log_2 frac{1}{2})=1 \ Ent(D^2) &= -(frac{1}{2} log_2 frac{1}{2}+frac{1}{2} log_2 frac{1}{2})=1 \ Ent(D)&= -(frac{2}{4} log_2 frac{2}{4}+frac{2}{4} log_2 frac{2}{4})=1 \ Gani(D,纹理)&=Ent(D)-sum_{v=1}^2 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{2}{4} imes 1+frac{2}{4} imes 1 ) \ &= 0 end{aligned} ]

    触感 信息增益

    根据触感划分为 硬滑,软粘 两个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{1}{3} log_2 frac{1}{3}+frac{2}{3} log_2 frac{2}{3})=0.918 \ Ent(D^2) &= -(frac{1}{1} log_2 frac{1}{1}+frac{0}{1} log_2 frac{0}{1})=0 \ Ent(D)&= -(frac{2}{4} log_2 frac{2}{4}+frac{2}{4} log_2 frac{2}{4})=1 \ Gani(D,触感)&=Ent(D)-sum_{v=1}^2 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{3}{4} imes 0.918+frac{1}{4} imes 0 ) \ &= 0.2295 end{aligned} ]

    选择分类结点构建决策树

    [egin{aligned} Gain(D,色泽)=0 Gain(D,根蒂)=0.3115 Gain(D,敲声)=0.3115 \ Gain(D,纹理)=0 Gain(D,触感)=0.2295 end{aligned} ]

    不妨选择根蒂作为分类依据

    此时决策树为

    image-20211031213416296

    4 对分支({6,7,15})进行划分

    色泽 信息增益

    根据色泽划分为 青绿,乌黑 两个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{0}{1} log_2 frac{0}{1}+frac{1}{1} log_2 frac{1}{1})=0 \ Ent(D^2) &= -(frac{1}{2} log_2 frac{1}{2}+frac{1}{2} log_2 frac{1}{2})=1 \ Ent(D)&= -(frac{1}{3} log_2 frac{1}{3}+frac{2}{3} log_2 frac{2}{3})= 0.918 \ Gani(D,色泽)&=Ent(D)-sum_{v=1}^2 frac{|D^v|}{|D|}Ent(D^v) \ &= 0.918 - (frac{1}{3} imes 0+frac{2}{3} imes 1 ) \ &= 0.252 end{aligned} ]

    敲声 信息增益

    根据色泽划分为 浊响 一个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{1}{3} log_2 frac{1}{3}+frac{2}{3} log_2 frac{2}{3})=0.918 \ Ent(D)&= -(frac{2}{3} log_2 frac{2}{3}+frac{2}{3} log_2 frac{2}{3})=0.918 \ Gani(D,敲声)&=Ent(D)-sum_{v=1}^1 frac{|D^v|}{|D|}Ent(D^v) \ &= 0.918 - (frac{3}{3} imes 0.918 ) \ &=0 end{aligned} ]

    纹理 信息增益

    根据 纹理 划分为 清晰 稍糊 两个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{1}{2} log_2 frac{1}{2}+frac{1}{2} log_2 frac{1}{2})=1 \ Ent(D^2) &= -(frac{0}{1} log_2 frac{0}{1}+frac{1}{1} log_2 frac{1}{1})=0\ Ent(D)&= -(frac{1}{3} log_2 frac{1}{3}+frac{2}{3} log_2 frac{2}{3})=0.918 \ Gani(D,纹理)&=Ent(D)-sum_{v=1}^2 frac{|D^v|}{|D|}Ent(D^v) \ &= 0.918 - (frac{2}{3} imes 1+frac{1}{3} imes 0 ) \ &= 0.252 end{aligned} ]

    触感 信息增益

    根据触感划分为 软粘 一个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{1}{3} log_2 frac{1}{3}+frac{2}{3} log_2 frac{2}{3})=0.918 \ Ent(D)&= -(frac{1}{3} log_2 frac{1}{3}+frac{2}{3} log_2 frac{2}{3})=0.918\ Gani(D,触感)&=Ent(D)-sum_{v=1}^1 frac{|D^v|}{|D|}Ent(D^v) \ &= 0.918 - (frac{3}{3} imes 0.918 ) \ &= 0 end{aligned} ]

    选择分类结点构建决策树

    [egin{aligned} Gain(D,色泽)=0 .252 Gain(D,敲声)=0 \ Gain(D,纹理)=0.252 Gain(D,触感)=0 end{aligned} ]

    不妨选择色泽作为分类依据

    此时决策树为

    image-20211031214745472

    5 对分支({7,15})​进行划分

    敲声 信息增益

    根据色泽划分为 浊响 一个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{1}{2} log_2 frac{1}{2}+frac{1}{2} log_2 frac{1}{2})=1 \ Ent(D) &= -(frac{1}{2} log_2 frac{1}{2}+frac{1}{2} log_2 frac{1}{2})=1\ Gani(D,敲声)&=Ent(D)-sum_{v=1}^1 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{2}{2} imes 0.918 ) \ &= 0 end{aligned} ]

    纹理 信息增益

    根据 纹理 划分为 清晰 稍糊 两个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{0}{1} log_2 frac{0}{1}+frac{1}{1} log_2 frac{1}{1})=0 \ Ent(D^2) &= -(frac{1}{1} log_2 frac{1}{1}+frac{0}{1} log_2 frac{0}{1})=0\ Ent(D)&= -(frac{1}{2} log_2 frac{1}{2}+frac{1}{2} log_2 frac{1}{2})=1 \ Gani(D,纹理)&=Ent(D)-sum_{v=1}^2 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{1}{2} imes 0+frac{1}{2} imes 0 ) \ &= 1 end{aligned} ]

    触感 信息增益

    根据触感划分为 软粘 一个子集

    计算信息熵

    [egin{aligned} Ent(D^1) &= -(frac{1}{2} log_2 frac{1}{2}+frac{1}{2} log_2 frac{1}{2})=1 \ Ent(D) &= -(frac{1}{2} log_2 frac{1}{2}+frac{1}{2} log_2 frac{1}{2})=1\ Gani(D,触感)&=Ent(D)-sum_{v=1}^1 frac{|D^v|}{|D|}Ent(D^v) \ &= 1 - (frac{2}{2} imes 0.918 ) \ &= 0 end{aligned} ]

    选择分类结点构建决策树

    [egin{aligned} Gain(D,敲声)=0 Gain(D,纹理)=1 Gain(D,触感)=0 end{aligned} ]

    选择纹理作为分类依据

    此时决策树为

    image-20211031215835835

    2 决策树后剪枝

    1 考虑结点(7,15)

    原分支(剪枝前),有三个样本被正确分类 验证集精度为 42.8%

    剪枝后的决策树

    image-20211101090712293

    此时验证集有四个样本被正确分类,精度为57.1%

    于是后剪枝策略决定剪枝,得到上图的决策树

    2 考虑结点(6,715)色泽=?

    由上图,决策树精度为57.1%

    剪去结点后的决策树为

    image-20211101091759701

    此时验证集有四个样本被正确分类,精度为57.1%

    与未剪枝时的精度相同,西瓜书中采用了不剪枝的策略。在这里我们不妨采用剪枝的策略,于是得到上图的决策树

    3 考虑结点(1,2,3,14)​​色泽=?

    在上图基础上来考虑剪去结点(1,2,3,14)色泽=? ,剪枝后的决策树为

    image-20211101092352485

    此时的决策树正确分类的样本5个,精度为71.4%

    根据后剪枝策略,进行剪枝,得到上图的决策树

    4考虑 (6,7,15,17)根蒂=?

    剪枝后的决策树为

    image-20211101092815594

    此时的决策树的精度仍然为71.4%

    与未剪枝时的精度相同,西瓜书中采用了不剪枝的策略。在这里我们不妨采用剪枝的策略,于是得到上图的决策树

    最终得到上图的决策树

  • 相关阅读:
    day5-装饰器的正确打开方式------------->懒
    python_day5--->递归函数,二分法查找
    day4_生成器yield
    day4_迭代器
    day4_装饰器
    day4_函数嵌套和闭包以及作用域
    this作用域,改变作用域的方法
    $.ajax传数据.php
    ul标签的引用
    URL
  • 原文地址:https://www.cnblogs.com/zuti666/p/15492141.html
Copyright © 2011-2022 走看看