zoukankan      html  css  js  c++  java
  • Feature Quality

    转载的,出处找不到了,记录一下:

    Select Inputs
    Here the focus is on the quality of your data, specifically the quality of each column of data. You may want to consider discarding data columns (Attributes) that provide less value.
    How do you know which Attributes are valuable, and which are worthless? A key point is that you're looking for patterns. Without some variation in the data and some discernible patterns, the data is not likely to be useful. A quick summary of things to look out for (more details below) includes:
    (C) Columns that too closely mirror the target column,
    (I) Columns where nearly all values are different,
    (S) Columns where nearly all values are identical,
    (M) Columns with missing values.
    To help you make a decision, we indicate the Attribute value with a color-coded status bubble (red / yellow / green). Details are provided by the quality bars (C / I / S / M). As a general rule, it is a good idea to deselect at least those Attributes that have a red status bubble. The input for the machine learning model will only include the selected Attributes.

    You can deselect Attributes by clicking on them individually. Or you can deselect a group of Attributes by clicking the buttons marked Deselect Red or Deselect Yellow at the top of the screen.
    First Time?
    For example, several columns in the Titanic data are problematic and should be removed. The Attributes "Name" and "Ticket Number" are unique to each passenger; they are equivalent to IDs, and machine learning cannot learn anything from them. Those Attributes have a large blue bar for ID-ness (I) and - consequently - a red status bubble. The "Cabin" information is missing (M) in most cases (red bar), and it should also be removed.
    "Lifeboat" is the only Attribute with a yellow status bubble. It has a very high correlation (C) with our target Attribute, "Survived". While a high correlation is sometimes desirable, it is problematic in this case. The machine learning model will quickly discover that a person survived because they made it to a lifeboat, but you know that already! "Lifeboat" and "Survived" are effectively synonyms, so it is better to remove the "Lifeboat" Attribute and let the model discover the underlying reasons for survival.
    In summary, you should remove all those Attributes with a red status bubble from the data. And in this case, you should remove the Attribute with a yellow status bubble, as well. You can deselect them manually, or by clicking Deselect Red and Deselect Yellow. Then click on Next.
    Background
    In the "Background" section, we provide optional additional information about machine learning and about RapidMiner.
    Status
    The colored status bubble provides a quality indicator for a data column.
    Red: A red bubble indicates a column of poor quality, which in most cases you should remove from the data set. Red can indicate one of the following problems:
    More than 70% of all values in this column are missing,
    The column is practically an ID with (almost) as many different values as you have rows in your data set, or
    The column is practically constant, with more than 90% of all values being the same (stable).
    Yellow: A yellow bubble indicates a column which has either a very low or a very high correlation with the target column. It can only appear if the task is "Predict".
    Low Correlation: a correlation of less than 0.01% indicates that this column is not likely to contribute to the predictions. While keeping such a column is not problematic, removing it may speed up the model building.
    High Correlation: a correlation of more than 50% may be an indicator for information you don't have at prediction time. In that case, you should remove this column. Sometimes, however, the prediction problem is simple, and you will get a better model when the column is included. Only you can decide.
    Quality Bars
    The color of the status bubble is based on the following quality measures, displayed as bars together with each Attribute:
    Correlation (C): measures the linear correlation between the data column and the target column. This quality bar is only available when the task is "Predict".
    ID-ness (I): measures the degree to which this Attribute resembles an ID. The number of different values for the Attribute divided by the number of data rows.
    Stability (S): measures how stable or constant this column is. The number of rows with the most frequent non-missing value divided by the total number of data rows with non-missing values.
    Missing (M): the number of missing values in this column as a fraction of the total number of data rows.
    In general, you should prefer Attributes with low values for Missing, Stability, and ID-ness. Columns with high Correlation are typically preferred, but not if the high correlation occurs because of a direct cause-and-effect relationship with the data you want to predict.

    算法来自于RapidMiner,定义了特征的质量状态:

    1. 红色: 质量很差,删除即可
      a. 超过70%的缺失值
      b. 只有一个值(常量)
      c. ID-ness 每个值都与其他值不同
      d. 超过90%的值都相同(stable, 稳定性)

    2. 黄色:可以交给用户处理,仅对用户进行提示
      a. 相关性低于 0.01%
      a. 相关性高于 50%

    上述策略可以产生3个0-1的指标:

    1. 缺失值率:越低越好[0,0.7)合理
    2. 不同值率:不能太高,也不能只有一个值,在(0, 90)合理
    3. 相关性:不能太高或者太低, 在(0.01, 0.5)合理
  • 相关阅读:
    逻辑回归与最大熵模型
    提升方法-AdaBoost
    Python中的类属性、实例属性与类方法、静态方法
    mysqldump详解
    12.python 模块使用,面向对象介绍
    11 python 内置函数
    10.函数的变量与返回值
    9. 函数的定义和参数,默认参数
    linux下iptables详解
    把linux下的yum源更换为阿里云的国内源
  • 原文地址:https://www.cnblogs.com/oaks/p/13749742.html
Copyright © 2011-2022 走看看