zoukankan      html  css  js  c++  java
  • 我的R之路:数据探索

    一、查看数据

    首先,我们查看iris数据集的大小和结构,其维度和名称分别使用函数dim()和names()获取。

    函数str()和attributes()返回数据的结构和属性

    二、单变量分析

    > head(iris)
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    1 5.1 3.5 1.4 0.2 setosa
    2 4.9 3.0 1.4 0.2 setosa
    3 4.7 3.2 1.3 0.2 setosa
    4 4.6 3.1 1.5 0.2 setosa
    5 5.0 3.6 1.4 0.2 setosa
    6 5.4 3.9 1.7 0.4 setosa
    > attach(iris)
    The following objects are masked from iris (pos = 3):

    Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species

    The following objects are masked from iris (pos = 4):

    Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species

    The following objects are masked from iris (pos = 5):

    Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species

    > dim(iris)
    [1] 150 5
    > names(iris)
    [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
    [5] "Species"
    > str(iris)
    'data.frame': 150 obs. of 5 variables:
    $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
    $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
    $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
    $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
    $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
    > attributes(iris)
    $names
    [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
    [5] "Species"

    $row.names
    [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
    [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
    [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
    [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
    [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
    [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
    [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
    [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
    [145] 145 146 147 148 149 150

    $class
    [1] "data.frame"

    > summary(iris)
    Sepal.Length Sepal.Width Petal.Length Petal.Width
    Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
    1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
    Median :5.800 Median :3.000 Median :4.350 Median :1.300
    Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
    3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
    Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
    Species
    setosa :50
    versicolor:50
    virginica :50



    > head(iris)
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    1 5.1 3.5 1.4 0.2 setosa
    2 4.9 3.0 1.4 0.2 setosa
    3 4.7 3.2 1.3 0.2 setosa
    4 4.6 3.1 1.5 0.2 setosa
    5 5.0 3.6 1.4 0.2 setosa
    6 5.4 3.9 1.7 0.4 setosa
    > quantile(Sepal.Length)###求取各段百分点
    0% 25% 50% 75% 100%
    4.3 5.1 5.8 6.4 7.9
    > library(scatterplot3d)
    > scatterplot3d(iris$Sepal.Length, Petal.Width, Petal.Length)
    > var(Sepal.Length)###方差
    [1] 0.6856935
    >
    > ris#####iris数据集
    错误: 找不到对象'ris'
    > head(iris)
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    1 5.1 3.5 1.4 0.2 setosa
    2 4.9 3.0 1.4 0.2 setosa
    3 4.7 3.2 1.3 0.2 setosa
    4 4.6 3.1 1.5 0.2 setosa
    5 5.0 3.6 1.4 0.2 setosa
    6 5.4 3.9 1.7 0.4 setosa
    > attach(iris)
    The following objects are masked from iris (pos = 3):

    Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species

    The following objects are masked from iris (pos = 4):

    Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species

    The following objects are masked from iris (pos = 5):

    Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species

    The following objects are masked from iris (pos = 6):

    Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species

    > dim(iris)
    [1] 150 5
    > names(iris)
    [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
    [5] "Species"
    > str(iris)
    'data.frame': 150 obs. of 5 variables:
    $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
    $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
    $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
    $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
    $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
    > attributes(iris)
    $names
    [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
    [5] "Species"

    $row.names
    [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
    [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
    [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
    [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
    [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
    [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
    [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
    [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
    [145] 145 146 147 148 149 150

    $class
    [1] "data.frame"

    > summary(iris)
    Sepal.Length Sepal.Width Petal.Length Petal.Width
    Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
    1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
    Median :5.800 Median :3.000 Median :4.350 Median :1.300
    Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
    3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
    Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
    Species
    setosa :50
    versicolor:50
    virginica :50



    > head(iris)
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    1 5.1 3.5 1.4 0.2 setosa
    2 4.9 3.0 1.4 0.2 setosa
    3 4.7 3.2 1.3 0.2 setosa
    4 4.6 3.1 1.5 0.2 setosa
    5 5.0 3.6 1.4 0.2 setosa
    6 5.4 3.9 1.7 0.4 setosa
    > quantile(Sepal.Length)###求取各段百分点
    0% 25% 50% 75% 100%
    4.3 5.1 5.8 6.4 7.9

    > var(Sepal.Length)###方差
    [1] 0.6856935

    hist(Sepal.Length,col=3,main="图一",sub="Sepal.Length",ylab="频数")###主标题,副标题,纵轴名称

     plot(density(Sepal.Length),col=3)###密度估计值

    > table(Sepal.Length)###table(iris)
    Sepal.Length
    4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1
    1 3 1 4 2 5 6 10 9 4 1 6 7 6 8 7 3 6 6
    6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.6 7.7 7.9
    4 9 7 5 2 8 3 4 1 1 3 1 1 1 4 1

    > pie(table(Sepal.Length))

     > boxplot(iris,col=c(1,2,3,4,5,6))

    +

    三、多变量探索分析

    > cov(Sepal.Length, Sepal.Width)
    [1] -0.042434
    > cor(Sepal.Length,Sepal.Width)
    [1] -0.1175698

    > plot(Sepal.Length,Sepal.Width,col=Species,pch=as.numeric(Species))###添加不同颜色和标志

    当数据量很大时,途中数据可能会出现重叠,可以添加jitter()函数添加少量白噪声

     >plot(jitter(Sepal.Length),jitter(Sepal.Width),col=Species,pch=as.numeric(Species))

     > pairs(iris)#####散布图矩阵

    四、更多探索

    在这里,我们说道简单的3D散布图,等级图、等高图、交互图以及平行坐标。

    一个3D散布图可以通过scatterplot3d包生成。

    head(iris)
    attach(iris)
    install.packages("scatterplot3d")
    library(scatterplot3d)
    with(iris,scatterplot3d(Sepal.Length,Sepal.Width,Petal.Length,pch=as.numeric(Sepal.Length)))

    平行坐标为多维数据提供了良好的可视化效果。平行坐标可以通过MASS包中的函数parcoord()和lattice包中函数

    parallelplot()绘制。

    install.packages("MASS")
    library(MASS)
    parcoord(iris[,1:4],col=c(2,3,4),main=“平行图”)

    install.packages("lattice")
    library(lattice)
    parallelplot(~iris[,1:5],data=iris)

    在ggplot2包中支持复杂的图像,对探索数据十分有用。同样以鸢尾花为例子,关于ggplot2包的更多例子在http://had.co.nz/ggplot2/上看到:

    install.packages("ggplot2")
    library(ggplot2)
    qplot(Sepal.Length,Sepal.Width,col=2,data=iris,facets=Species~.)

  • 相关阅读:
    Java中equals和==的区别
    Golang 中的 defer 关键字
    浅拷贝与深拷贝
    svn 忽略某些文件夹或者文件类型
    使用 nvm 管理 nodejs 版本
    数据库索引
    解决 vscode 安装 golang 环境出现 connection failed 的情况
    TSQL 如何批量修改/转移大数据量数据.
    从别人那儿陶的一个配置文件处理方法.
    分析sqlserver查询计划
  • 原文地址:https://www.cnblogs.com/alsely/p/6711392.html
Copyright © 2011-2022 走看看