zoukankan      html  css  js  c++  java
  • Python for Data Science

    Chapter 6 - Other Popular Machine Learning Methods

    Segment 6 - Ensemble methods with random forest

    Ensemble Models

    Ensemble models are machine learning methods that combine several base models to produce one optimal predictive model.

    They combine decisions from multiple models to improve the overall performance.

    How Ensemble Learning Works

    • Ensemble learning involves creating a collection (or "ensemble") of multiple algorithms for the purpose of generating a single model that's far more powerful and reliable than its component parts.

    The Ensemble Can Be Comprised Of

    Same algorithms more than once

    • Random forest is an ensemble of decision trees

    Many types of algorithms aggregated

    Types of Ensemble Methods

    • Max voting
    • Averaging
    • Weighted averaging
    • Bagging
    • Boosting

    Majority Voting Method

    The majority voting method picks the result based on the majority votes from different models.

    This method is generally used in classification problems.

    Averaging Method

    The averaging method involves running multiple models and then averaging the predictions. Averaging method can be used in both classification(calculate average of the probabilities) and regression problems.

    Bagging Method

    The bagging method takes results from multiple models and combines them to get a final result.

    Decision trees are used frequently with bagging.

    Process overview: create subsets of the original data and run different models on the subsets; aggregate result; run the models in parallel.

    Boosting Method

    The boosting method takes results from multiple models and combines them to get a final result.

    Process Overview: create subsets of the original data and run different models on the subsets; run the models sequentially

    1. Create a subset of data.
    2. Run a model on the subsets of data and get the predictions.
    3. Calculate errors on these predictions.
    4. Assign weights to the incorrect predictions.
    5. Create another model with the same data and the next subset of data is created.
    6. The cycle repeats until a strong learner is created.

    Random Forest

    Random forest is an ensemble model which follows the bagging method.

    This model uses decision trees to form ensembles.

    This approach is useful for both classification and regression problems.

    Random Forests - How It Works

    • When predicting a new value for a target feature, each tree is either using regression or classification to come up with a value that serves as a "vote"
    • The random forest algorithm then takes an average of all the votes from all the trees in the ensemble
    • This average is the predicted value of the target feature for the variable in question

    Random Forest Process

    1. Create a random subset from the original data.
    2. Randomly select a set of features at each node in the decision tree.
    3. Decide the best split.
    4. For each subset of data, create a separate model (a "base learner").
    5. Compute the final prediction by averaging the predictions from all the individual models.

    Advantages RF

    • Easy to understand
    • Useful for data exploration
    • Reduced data cleaning (scaling not required)
    • Handle multiple data types
    • Highly flexible and gives a good accuracy
    • Works well on large datasets
    • Overfitting is avoided (due to averaging)

    Disadvantages RF

    • Overfitting
    • Not for continuous variables
    • Does not work well with sparse datasets
    • Computationally expensive
    • No interpretability

    This is a classification problem, where in we will be estimating the species label for iris flowers.

    import numpy as np
    import pandas as pd
    
    import sklearn.datasets as datasets
    from sklearn.model_selection import train_test_split 
    from sklearn import metrics
    
    from sklearn.ensemble import RandomForestClassifier
    
    iris = datasets.load_iris()
    
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    y = pd.DataFrame(iris.target)
    
    y.columns = ['labels']
    
    print(df.head())
    y[0:5]
    
       sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
    0                5.1               3.5                1.4               0.2
    1                4.9               3.0                1.4               0.2
    2                4.7               3.2                1.3               0.2
    3                4.6               3.1                1.5               0.2
    4                5.0               3.6                1.4               0.2
    
    labels
    0 0
    1 0
    2 0
    3 0
    4 0

    The data set contains information on the:

    • sepal length (cm)
    • sepal width (cm)
    • petal length (cm)
    • petal width (cm)
    • species type
    df.isnull().any() == True
    
    sepal length (cm)    False
    sepal width (cm)     False
    petal length (cm)    False
    petal width (cm)     False
    dtype: bool
    
    print(y.labels.value_counts())
    
    2    50
    1    50
    0    50
    Name: labels, dtype: int64
    

    Preparing the data for training the model

    X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=.2, random_state=17)
    

    Build a Random Forest model

    classifier = RandomForestClassifier(n_estimators=200, random_state=0)
    
    y_train_array = np.ravel(y_train)
    
    classifier.fit(X_train, y_train_array)
    
    y_pred = classifier.predict(X_test)
    

    Evaluating the model on the test data

    print(metrics.classification_report(y_test, y_pred))
    
                  precision    recall  f1-score   support
    
               0       1.00      1.00      1.00         7
               1       0.92      1.00      0.96        11
               2       1.00      0.92      0.96        12
    
        accuracy                           0.97        30
       macro avg       0.97      0.97      0.97        30
    weighted avg       0.97      0.97      0.97        30
    
    y_test_array = np.ravel(y_test)
    print(y_test_array)
    
    [0 1 2 1 2 2 1 2 1 2 2 0 1 0 2 0 0 2 2 2 2 0 2 1 1 1 1 1 0 1]
    
    print(y_pred)
    
    [0 1 2 1 2 2 1 2 1 2 2 0 1 0 2 0 0 2 2 2 1 0 2 1 1 1 1 1 0 1]
    相信未来 - 该面对的绝不逃避,该执著的永不怨悔,该舍弃的不再留念,该珍惜的好好把握。
  • 相关阅读:
    Android ADB的一些用法
    我想搞个python版的http代理服务器出来
    用Python创建大文件
    串口号都属于“使用中”的解决方法
    wxPython入门(二)
    Android Studio安装后不能启动的解决办法
    STC单片机不需要按电源开关下载的方法!
    最近要做的一些事情
    关于“用香蕉弹琴”的一点解释
    stc15系列单片机ISP编程失败率是相当高啊!
  • 原文地址:https://www.cnblogs.com/keepmoving1113/p/14349646.html
Copyright © 2011-2022 走看看