zoukankan      html  css  js  c++  java
  • 实验四 决策树算法及应用

    一、相关信息

    实验班级 https://edu.cnblogs.com/campus/ahgc/machinelearning
    实验要求 https://edu.cnblogs.com/campus/ahgc/machinelearning/homework/12086
    实验目标 掌握决策树算法的算法原理及其具体应用
    学号 3180701338

    二、实验信息

    【实验目的】

    1.理解决策树算法原理,掌握决策树算法框架;
    2.理解决策树学习算法的特征选择、树的生成和树的剪枝;
    3.能根据不同的数据类型,选择不同的决策树算法;
    4.针对特定应用场景及数据,能应用决策树算法解决实际问题。

    【实验内容】

    1.设计算法实现熵、经验条件熵、信息增益等方法。
    2.实现ID3算法。
    3.熟悉sklearn库中的决策树算法;
    4.针对iris数据集,应用sklearn的决策树算法进行类别预测。
    5.针对iris数据集,利用自编决策树算法进行类别预测。

    【实验报告要求】

    1.对照实验内容,撰写实验过程、算法及测试结果;
    2.代码规范化:命名规则、注释;
    3.分析核心算法的复杂度;
    4.查阅文献,讨论ID3、5算法的应用场景;
    5.查询文献,分析决策树剪枝策略。

    三、实验具体完成情况

    (1)实验主要代码及部分注释:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    %matplotlib inline
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from collections import Counter
    import math
    from math import log
    import pprint
    
    # 书上题目5.1
    def create_data():
        datasets = [['青年', '否', '否', '一般', '否'],
                    ['青年', '否', '否', '好', '否'],
                    ['青年', '是', '否', '好', '是'],
                    ['青年', '是', '是', '一般', '是'],
                    ['青年', '否', '否', '一般', '否'],
                    ['中年', '否', '否', '一般', '否'],
                    ['中年', '否', '否', '好', '否'],
                    ['中年', '是', '是', '好', '是'],
                    ['中年', '否', '是', '非常好', '是'],
                    ['中年', '否', '是', '非常好', '是'],
                    ['老年', '否', '是', '非常好', '是'],
                    ['老年', '否', '是', '好', '是'],
                    ['老年', '是', '否', '好', '是'],
                    ['老年', '是', '否', '非常好', '是'],
                    ['老年', '否', '否', '一般', '否'],]
        labels = [u'年龄', u'有工作', u'有自己的房子', u'信贷情况', u'类别']
        # 返回数据集和每个维度的名称
        return datasets, labels
    
    datasets, labels = create_data()
    train_data = pd.DataFrame(datasets, columns=labels)
    train_data
    
    # 熵
    def calc_ent(datasets):
        data_length = len(datasets)
        label_count = {}
        for i in range(data_length):
            label = datasets[i][-1]
            if label not in label_count:
                label_count[label] = 0
            label_count[label] += 1
        ent = -sum([(p / data_length) * log(p / data_length, 2)
                    for p in label_count.values()])
        return ent
    # def entropy(y):
    # """
    # Entropy of a label sequence
    # """
    # hist = np.bincount(y)
    # ps = hist / np.sum(hist)
    # return -np.sum([p * np.log2(p) for p in ps if p > 0])
    # 经验条件熵 
    def cond_ent(datasets, axis=0):
        data_length = len(datasets)
        feature_sets = {}
        for i in range(data_length):
            feature = datasets[i][axis]
            if feature not in feature_sets:
                feature_sets[feature] = []
            feature_sets[feature].append(datasets[i])
        cond_ent = sum([(len(p) / data_length) * calc_ent(p) for p in feature_sets.values()])
        return cond_ent
    # 信息增益 
    def info_gain(ent, cond_ent):
        return ent - cond_ent
    def info_gain_train(datasets):
        count = len(datasets[0]) - 1
        ent = calc_ent(datasets)
    # ent = entropy(datasets)
     
        best_feature = []
        for c in range(count):
            c_info_gain = info_gain(ent, cond_ent(datasets, axis=c))
            best_feature.append((c, c_info_gain))
            print('特征({}) - info_gain - {:.3f}'.format(labels[c], c_info_gain))
        # 比较大小
        best_ = max(best_feature, key=lambda x: x[-1])
        return '特征({})的信息增益最大,选择为根节点特征'.format(labels[best_[0]])
    
    info_gain_train(np.array(datasets))
    
    # 定义节点类 二叉树 
    class Node:
        def __init__(self, root=True, label=None, feature_name=None, feature=None):
            self.root = root
            self.label = label
            self.feature_name = feature_name
            self.feature = feature
            self.tree = {}
            self.result = {
                'label:': self.label,
                'feature': self.feature,
                'tree': self.tree}
        def __repr__(self):
            return '{}'.format(self.result)
        def add_node(self, val, node):
            self.tree[val] = node
        def predict(self, features):
            if self.root is True:
                return self.label
            return self.tree[features[self.feature]].predict(features) 
    
    class DTree:
        def __init__(self, epsilon=0.1):
            self.epsilon = epsilon
            self._tree = {}
        # 熵
        @staticmethod
        def calc_ent(datasets):
            data_length = len(datasets)
            label_count = {}
            for i in range(data_length):   
                label = datasets[i][-1]
                if label not in label_count:
                    label_count[label] = 0
                label_count[label] += 1
            ent = -sum([(p / data_length) * log(p / data_length, 2)
                        for p in label_count.values()])
            return ent
        # 经验条件熵
        def cond_ent(self, datasets, axis=0):
            data_length = len(datasets)
            feature_sets = {}
            for i in range(data_length):
                feature = datasets[i][axis]
                if feature not in feature_sets:
                    feature_sets[feature] = []
                feature_sets[feature].append(datasets[i])
            cond_ent = sum([(len(p) / data_length) * self.calc_ent(p)
                            for p in feature_sets.values()])
            return cond_ent
        # 信息增益
        @staticmethod
        def info_gain(ent, cond_ent):
            return ent - cond_ent
        def info_gain_train(self, datasets):
            count = len(datasets[0]) - 1
            ent = self.calc_ent(datasets)
            best_feature = []
            for c in range(count):
                c_info_gain = self.info_gain(ent, self.cond_ent(datasets, axis=c))
                best_feature.append((c, c_info_gain))
                # 比较大小
                best_ = max(best_feature, key=lambda x: x[-1])
                return best_
        def train(self, train_data):
            """
            input:数据集D(DataFrame格式),特征集A,阈值eta
            output:决策树T
            """
            _, y_train, features = train_data.iloc[:, :
                                                   -1], train_data.iloc[:,-1], train_data.columns[:-1]
            # 1,若D中实例属于同一类Ck,则T为单节点树,并将类Ck作为结点的类标记,返回T
            if len(y_train.value_counts()) == 1:
                return Node(root=True, label=y_train.iloc[0])
            # 2, 若A为空,则T为单节点树,将D中实例树最大的类Ck作为该节点的类标记,返回T
            if len(features) == 0:
                return Node(
                    root=True,
                    label=y_train.value_counts().sort_values(
                        ascending=False).index[0])
            # 3,计算最大信息增益 同5.1,Ag为信息增益最大的特征
            max_feature, max_info_gain = self.info_gain_train(np.array(train_data))
            max_feature_name = features[max_feature]
            # 4,Ag的信息增益小于阈值eta,则置T为单节点树,并将D中是实例数最大的类Ck作为该节点的类标记,返
            if max_info_gain < self.epsilon:
                return Node(
                    root=True,
                    label=y_train.value_counts().sort_values(ascending=False).index[0])
            # 5,构建Ag子集
            node_tree = Node(
                root=False, feature_name=max_feature_name, feature=max_feature)
            feature_list = train_data[max_feature_name].value_counts().index
            for f in feature_list:
                sub_train_df = train_data.loc[train_data[max_feature_name] ==f].drop([max_feature_name], axis=1)
                # 6, 递归生成树
                sub_tree = self.train(sub_train_df)
                node_tree.add_node(f, sub_tree)
            # pprint.pprint(node_tree.tree)
            return node_tree
        def fit(self, train_data):
            self._tree = self.train(train_data)
            return self._tree
        def predict(self, X_test):
            return self._tree.predict(X_test)
    
    datasets, labels = create_data()
    data_df = pd.DataFrame(datasets, columns=labels)
    dt = DTree()
    tree = dt.fit(data_df)
    tree
    
    dt.predict(['老年', '否', '否', '一般'])
    
    # data
    def create_data():
        iris = load_iris()
        df = pd.DataFrame(iris.data, columns=iris.feature_names)
        df['label'] = iris.target
        df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
        data = np.array(df.iloc[:100, [0, 1, -1]])
        # print(data)
        return data[:, :2], data[:, -1] 
    X, y = create_data()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.tree import export_graphviz
    import graphviz
    clf = DecisionTreeClassifier()
    clf.fit(X_train, y_train,)
    
    clf.score(X_test, y_test)
    
    tree_pic = export_graphviz(clf, out_file="mytree.pdf") 
    with open('mytree.pdf') as f:
        dot_graph = f.read()
    
    graphviz.Source(dot_graph)
    

    (2)实验运行结果截图:






    四、实验小结

    ID3算法是基本的决策树构建算法,作为决策树经典的构建算法,具有算法结构简单、理论清晰易懂、学习能力较强和灵活方便的特点。但也存在着不能处理连续型数据,不适用于增量数据集,处理大型数据速度较慢,可能会出现过拟合等缺点。ID3算法在世界上广为流传,得到极大的关注。ID3算法特别在机器学习、知识发现和数据挖掘等领域得到了极大发展。
    此次实验有明显可见的结果,结果可以帮助我更好理解ID3算法的算法核心:ID3算法核心是“信息熵”。ID3算法通过计算每个属性的信息增益,认为信息增益高的是好属性,每次划分选取信息增益最高的属性为划分标准,重复这个过程,直至生成一个能完美分类训练样例的决策树。

  • 相关阅读:
    Atitit. visual studio vs2003 vs2005 vs2008  VS2010 vs2012 vs2015新特性 新功能.doc
    Atitit. C#.net clr 2.0  4.0新特性
    Atitit. C#.net clr 2.0  4.0新特性
    Atitit.通过null 参数 反射  动态反推方法调用
    Atitit.通过null 参数 反射  动态反推方法调用
    Atitit..net clr il指令集 以及指令分类  与指令详细说明
    Atitit..net clr il指令集 以及指令分类  与指令详细说明
    Atitit.变量的定义 获取 储存 物理结构 基本类型简化 隐式转换 类型推导 与底层原理 attilaxDSL
    Atitit.变量的定义 获取 储存 物理结构 基本类型简化 隐式转换 类型推导 与底层原理 attilaxDSL
    Atitit.跨语言反射api 兼容性提升与增强 java c#。Net  php  js
  • 原文地址:https://www.cnblogs.com/666888ZWL/p/14939023.html
Copyright © 2011-2022 走看看