zoukankan      html  css  js  c++  java
  • UDA机器学习基础—误差原因

    1.模型误差产生的原因

    (1)模型无法表示基本数据的复杂度,而造成偏差。

    (2)因模型对训练它所用到的数据过度敏感造成的方差。

    2.由偏差造成的误差——准确率和欠拟合

    有足够数据表示模型,但是由于模型不够复杂,不能捕捉基本关系,因而造成误差。

    这样一来模型会系统的错误表示数据,从而导致准确率降低,这种现象叫做欠拟合。

    简单说来就是模型不合适就会造成偏差。

    3.方差造成的误差——精度和过拟合

    在训练模型时,通常使用较大量数据的有限数据集,如果选择随机选择的数据子集不断对模型进行训练,可以预料它的预测结果会因提供给它的不同训练子集而不同。方差是用来衡量预测结果和所给的测试样本之间的差距。出现方差是正常的,但是方差过高说明该模型无法将预测结果泛化到更多数据。对训练集过渡敏感,称之为过拟合。高方差会导致训练集上效果很好,测试集上效果很差。

    通常可以用更多数据来训练降低模型预测的方差,提高模型预测的准确率。如果没有很多数据,可以降低模型的复杂度来减小方差。

    # In this exercise we'll examine a learner which has high variance, and tries to learn
    # nonexistant patterns in the data.
    # Use the learning curve function from sklearn.learning_curve to plot learning curves
    # of both training and testing error.
    # CODE YOU HAVE TO TYPE IN IS IN LINE 35
    
    from sklearn.tree import DecisionTreeRegressor
    import matplotlib.pyplot as plt
    # PLEASE NOTE:
    # In sklearn 0.18, the import would be from sklearn.model_selection import learning_curve
    from sklearn.learning_curve import learning_curve # sklearn version 0.17
    from sklearn.cross_validation import KFold
    from sklearn.metrics import explained_variance_score, make_scorer
    import numpy as np
    
    # Set the learning curve parameters; you'll need this for learning_curves
    size = 1000
    cv = KFold(size,shuffle=True)
    score = make_scorer(explained_variance_score)
    
    # Create a series of data that forces a learner to have high variance
    X = np.round(np.reshape(np.random.normal(scale=5,size=2*size),(-1,2)),2)
    y = np.array([[np.sin(x[0]+np.sin(x[1]))] for x in X])
    
    def plot_curve():
        # Defining our regression algorithm
        reg = DecisionTreeRegressor()
        # Fit our model using X and y
        reg.fit(X,y)
        print "Regressor score: {:.4f}".format(reg.score(X,y))
        
        # TODO: Use learning_curve imported above to create learning curves for both the
        #       training data and testing data. You'll need reg, X, y, cv and score from above.
        
        train_sizes, train_scores, test_scores = learning_curve(reg,X,y,cv=cv,scoring=score)
        
        # Taking the mean of the test and training scores
        train_scores_mean = np.mean(train_scores,axis=1)
        test_scores_mean = np.mean(test_scores,axis=1)
        
        # Plotting the training curves and the testing curves using train_scores_mean and test_scores_mean 
        plt.plot(train_sizes ,train_scores_mean,'-o',color='b',label="train_scores_mean")
        plt.plot(train_sizes,test_scores_mean ,'-o',color='r',label="test_scores_mean")
        
        # Plot aesthetics
        plt.ylim(-0.1, 1.1)
        plt.ylabel("Curve Score")
        plt.xlabel("Training Points")
        plt.legend(bbox_to_anchor=(1.1, 1.1))
        plt.show()
    

      

  • 相关阅读:
    通过secureCRT连接虚拟机VMware workstation问题记录
    redis 数据类型
    windows下redis安装及配置
    js获取response头信息
    flask token认证
    IDEA 双击只选择了一个变量的某部分单词
    IDEA 对spring boot Maven 项目打 Jar 包
    ElasticSearch 7.3.0 查询、修改、删除 文档操作
    ElasticSearch Kibana 创建索引,删除索引,查看索引配置
    elasticsearch-analysis-ik windows 环境 IK 中文分词器 的 下载 和 安装
  • 原文地址:https://www.cnblogs.com/fuhang/p/8515614.html
Copyright © 2011-2022 走看看