拓端数据tecdat|R语言多元逐步回归模型分析房价和葡萄酒价格：选择最合适的预测变量

zoukankan html css js c++ java

拓端数据tecdat|R语言多元逐步回归模型分析房价和葡萄酒价格：选择最合适的预测变量
原文链接：http://tecdat.cn/?p=19405

包含更多的预测变量不是免费的：在系数估算的更多可变性，更难的解释以及可能包含高度依赖的预测变量方面要付出代价。确实，对于样本大小，在线性模型中可以考虑的预测变量最大数量为 p 。或等效地，使用预测变量p 拟合模型需要最小样本量。

如果我们考虑p = 1 和 p = 2 的几何，这一事实的解释很简单：
- 如果p = 1，则至少需要n = 2个点才能唯一地拟合一条线。但是，这条线没有给出关于其周围变化的信息，因此无法估计。因此，我们至少需要个点，换句话说就是。
- 如果p = 2 ，则至少需要n = 3个点才能唯一地拟合平面。但是同样，该平面没有提供有关其周围数据变化的信息，因此无法估计。因此，我们需要。
下一部分代码的输出阐明了和之间的区别。
2. # 数据：n个观测值，p = n-1个预测变量
4. n <- 5
5. p <- n - 1
6. df <- data.frame(y = rnorm(n), x = matrix(rnorm(n * p), nrow = n, ncol = p))
8. # 情况p = n-1 = 2：可以估计beta，但不能估计sigma ^ 2（因此，不能执行推断，因为它需要估计的sigma ^ 2）
11. summary(lm(y ~ ., data = df))
13. # 情况p = n-2 = 1：可以估计beta和sigma ^ 2（因此可以进行推断）
15. summary(lm(y ~ . - x.1, data = df))
当减小时，自由度量化的变异性的增加。

既然我们已经更多地了解了预测变量过多的问题，我们将重点放在为多元回归模型选择最合适的预测变量上。如果没有独特的解决方案，这将是一项艰巨的任务。但是，有一个行之有效的程序通常会产生良好的结果：逐步模型选择。其原理是依次比较具有不同预测变量的多个线性回归模型。

在介绍该方法之前，我们需要了解什么是信息准则。信息标准在模型的适用性与采用的预测变量数量之间取得平衡。两个常见标准是 贝叶斯信息标准 （BIC）和 赤池信息标准 （AIC）。两者都基于模型适用性和复杂性之间的平衡：

其中是模型的对 数似然度 （模型拟合数据的程度），而是考虑的参数数量在模型中，对于具有p个预测变量的多元线性回归模型，则为p + 2。AIC在用替换了，因此，与BIC相比，它对较复杂的模型的处罚较少。这就是为什么一些从业者更喜欢BIC进行模型比较的原因之一。BIC和AIC可以通过BIC 和计算 AIC。

我们使用地区房价数据，变量介绍：

（1）town：每一个人口普查区所在的城镇

（2）LON: 人口普查区中心的经度

（3）LAT: 人口普查区中心的纬度

（4）MEDV: 每一个人口普查区所对应的房子价值的中位数 (单位为$1000)

（5）CRIM: 人均犯罪率

（6）ZN: 土地中有多少是地区是大量住宅物业

（7）INDUS: 区域中用作工业用途的土地占比

（8）CHAS: 1：该人口普查区紧邻查尔斯河；0：该人口普查区没有紧邻查尔斯河

（9）NOX: 空气中氮氧化物的集中度 (衡量空气污染的指标)

（10）RM: 每个房子的平均房间数目

（11）AGE: 建于1940年以前的房子的比例

（12）DIS: 该人口普查区距离波士顿市中心的距离

（13）RAD: 距离重要高速路的远近程度 (1代表最近；24代表最远)

（14）TAX: 房子每$10,000价值所对应的税收金额

（15）PTRATIO: 该城镇学生与老师的比例

他们将作为模型输入。
2. # 具有不同预测变量的两个模型
4. mod1 <- lm(medv ~ age + crim, data = Boston)
5. mod2 <- lm(medv ~ age + crim + lstat, data = Boston)
7. # BICs
8. BIC(mod1)
9. ## [1] 3581.893
10. BIC(mod2) # 较小->较好
11. ## [1] 3300.841
13. # AICs
14. AIC(mod1)
15. ## [1] 3564.987
16. AIC(mod2) # 较小->较好
17. ## [1] 3279.708
19. # 检查摘要
22. ##
23. ## Residuals:
24. ## Min 1Q Median 3Q Max
25. ## -13.940 -4.991 -2.420 2.110 32.033
26. ##
27. ## Coefficients:
28. ## Estimate Std. Error t value Pr(>|t|)
29. ## (Intercept) 29.80067 0.97078 30.698 < 2e-16 ***
30. ## age -0.08955 0.01378 -6.499 1.95e-10 ***
31. ## crim -0.31182 0.04510 -6.914 1.43e-11 ***
32. ## ---
33. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
34. ##
35. ## Residual standard error: 8.157 on 503 degrees of freedom
36. ## Multiple R-squared: 0.2166, Adjusted R-squared: 0.2134
37. ## F-statistic: 69.52 on 2 and 503 DF, p-value: < 2.2e-16
38. summary(mod2)
39. ##
40. ## Call:
41. ## lm(formula = medv ~ age + crim + lstat, data = Boston)
42. ##
43. ## Residuals:
44. ## Min 1Q Median 3Q Max
45. ## -16.133 -3.848 -1.380 1.970 23.644
46. ##
47. ## Coefficients:
48. ## Estimate Std. Error t value Pr(>|t|)
49. ## (Intercept) 32.82804 0.74774 43.903 < 2e-16 ***
50. ## age 0.03765 0.01225 3.074 0.00223 **
51. ## crim -0.08262 0.03594 -2.299 0.02193 *
52. ## lstat -0.99409 0.05075 -19.587 < 2e-16 ***
53. ## ---
54. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
55. ##
56. ## Residual standard error: 6.147 on 502 degrees of freedom
57. ## Multiple R-squared: 0.5559, Adjusted R-squared: 0.5533
58. ## F-statistic: 209.5 on 3 and 502 DF, p-value: < 2.2e-16
让我们回到预测变量的选择。如果我们有p个预测变量，那么一个简单的过程就是检查所有可用它们构建的可能模型，然后根据BIC / AIC选择最佳模型。这就是所谓的 最佳子集选择。问题在于存在个可能的模型！
让我们看看如何研究 wine 数据集，将使用所有可用预测变量的数据作为初始模型。

波尔多是法国的葡萄酒产区。尽管这种酒的生产方式几乎相同，但已有数百年历史，但每年的价格和质量差异有时非常显着。人们普遍认为波尔多葡萄酒陈年越老越好，因此有动力去储存葡萄酒直至成熟。主要问题在于，仅通过品尝就很难确定葡萄酒的质量，因为在实际饮用时，味道会发生很大变化。这就是为什么葡萄酒品尝师和专家会有所帮助的原因。他们品尝葡萄酒，然后预测以后将是最好的葡萄酒。
1990年3月4日，《纽约时报》宣布普林斯顿大学经济学教授奥利·阿森费尔特（Orley Ashenfelter）可以预测波尔多葡萄酒的质量而无需品尝一滴。 Ashenfelter使用了一种称为线性回归的方法。该方法预测结果变量或因变量。作为自变量，他使用了酒的年份（因此，较老的酒会更昂贵）和与天气有关的信息，特别是平均生长季节温度，收成雨和冬雨。

stepAIC 将参数 k 设为2 （默认值）或，其中n是样本大小。k = 2 它采用AIC准则， k = log(n) 它采用BIC准则。
2. # 完整模型
5. # 用 BIC
6. ## Start: AIC=-53.29
7. ## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop
8. ##
9. ##
10. ## Step: AIC=-53.29
11. ## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop
12. ##
13. ## Df Sum of Sq RSS AIC
14. ## - FrancePop 1 0.0026 1.8058 -56.551
15. ## - Year 1 0.0048 1.8080 -56.519
16. ## <none> 1.8032 -53.295
17. ## - WinterRain 1 0.4585 2.2617 -50.473
18. ## - HarvestRain 1 1.8063 3.6095 -37.852
19. ## - AGST 1 3.3756 5.1788 -28.105
20. ##
21. ## Step: AIC=-56.55
22. ## Price ~ Year + WinterRain + AGST + HarvestRain
23. ##
24. ## Df Sum of Sq RSS AIC
25. ## <none> 1.8058 -56.551
26. ## - WinterRain 1 0.4809 2.2867 -53.473
27. ## - Year 1 0.9089 2.7147 -48.840
28. ## - HarvestRain 1 1.8760 3.6818 -40.612
29. ## - AGST 1 3.4428 5.2486 -31.039
31. summary(modBIC)
32. ##
33. ## Call:
34. ## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
35. ## data = wine)
36. ##
37. ## Residuals:
38. ## Min 1Q Median 3Q Max
39. ## -0.46024 -0.23862 0.01347 0.18601 0.53443
40. ##
41. ## Coefficients:
42. ## Estimate Std. Error t value Pr(>|t|)
43. ## (Intercept) 43.6390418 14.6939240 2.970 0.00707 **
44. ## Year -0.0238480 0.0071667 -3.328 0.00305 **
45. ## WinterRain 0.0011667 0.0004820 2.420 0.02421 *
46. ## AGST 0.6163916 0.0951747 6.476 1.63e-06 ***
47. ## HarvestRain -0.0038606 0.0008075 -4.781 8.97e-05 ***
48. ## ---
49. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
50. ##
51. ## Residual standard error: 0.2865 on 22 degrees of freedom
52. ## Multiple R-squared: 0.8275, Adjusted R-squared: 0.7962
53. ## F-statistic: 26.39 on 4 and 22 DF, p-value: 4.057e-08
55. # 用 AIC
56. ## Start: AIC=-61.07
57. ## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop
58. ##
59. ##
60. ## Step: AIC=-61.07
61. ## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop
62. ##
63. ## Df Sum of Sq RSS AIC
64. ## - FrancePop 1 0.0026 1.8058 -63.031
65. ## - Year 1 0.0048 1.8080 -62.998
66. ## <none> 1.8032 -61.070
67. ## - WinterRain 1 0.4585 2.2617 -56.952
68. ## - HarvestRain 1 1.8063 3.6095 -44.331
69. ## - AGST 1 3.3756 5.1788 -34.584
70. ##
71. ## Step: AIC=-63.03
72. ## Price ~ Year + WinterRain + AGST + HarvestRain
73. ##
74. ## Df Sum of Sq RSS AIC
75. ## <none> 1.8058 -63.031
76. ## - WinterRain 1 0.4809 2.2867 -58.656
77. ## - Year 1 0.9089 2.7147 -54.023
78. ## - HarvestRain 1 1.8760 3.6818 -45.796
79. ## - AGST 1 3.4428 5.2486 -36.222
80. summary(modAIC)
81. ##
82. ## Call:
83. ## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
84. ## data = wine)
85. ##
86. ## Residuals:
87. ## Min 1Q Median 3Q Max
88. ## -0.46024 -0.23862 0.01347 0.18601 0.53443
89. ##
90. ## Coefficients:
91. ## Estimate Std. Error t value Pr(>|t|)
92. ## (Intercept) 43.6390418 14.6939240 2.970 0.00707 **
93. ## Year -0.0238480 0.0071667 -3.328 0.00305 **
94. ## WinterRain 0.0011667 0.0004820 2.420 0.02421 *
95. ## AGST 0.6163916 0.0951747 6.476 1.63e-06 ***
96. ## HarvestRain -0.0038606 0.0008075 -4.781 8.97e-05 ***
97. ## ---
98. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
99. ##
100. ## Residual standard error: 0.2865 on 22 degrees of freedom
101. ## Multiple R-squared: 0.8275, Adjusted R-squared: 0.7962
102. ## F-statistic: 26.39 on 4 and 22 DF, p-value: 4.057e-08
接下来是stepAIC 对执行情况的解释。在每个步骤中， stepAIC 显示有关信息标准当前值的信息。例如，对于 modBIC，第一步的BIC是Step: AIC=-53.29 ，然后在第二步进行了改进 Step: AIC=-56.55 （即使使用“ BIC”，该功能也会始终输出“ AIC”）。下一个继续前进的模型是stepAIC 通过研究添加或删除预测变量后得出的不同模型的信息标准来决定的（取决于 direction 参数，在下文中进行解释）。例如modBIC在第一步中，删除导致的模型 FrancePop 的BIC等于 -56.551，如果 Year 删除，则BIC将为 -56.519。逐步回归，然后删除 FrancePop （因为它给出了最低的BIC），然后重复此过程，最终导致删除 <none> 预测变量是可能的最佳操作。下面的代码块说明了stepsAIC的输出 extractAIC，和BIC / AIC的输出BIC/ AIC。
2. # 相同的BIC，标准不同
5. AIC(modBIC, k = log(n))
6. ## [1] -56.55135
7. BIC(modBIC)
8. ## [1] 23.36717
9. # 观察到MASS :: stepAIC（mod，k = log（nrow（wine）））返回的最终BIC是由extractAIC（）而不是BIC（）给出的！但是两者是等效的
10. # 相同的AIC，标准不同
13. AIC(modBIC, k = 2)
14. ## [1] -63.03053
15. BIC(modBIC)
16. ## [1] 23.36717
19. BIC(modBIC) - AIC(modBIC
20. ## [1] 79.91852
21. n * (log(2 * pi+ 1) + log(n)
22. ## [1] 79.91852
24. #与AIC相同
27. AIC(modAIC) - AIC(modAIC
28. ## [1] 78.62268
29. n * (log(2 * pi + 1 + 2
30. ## [1] 78.62268
请注意，所选模型 modBIC 和 modAIC 等效于 modWine2 ，我们选择的最佳模型。这说明，选择的模型 stepAIC 通常是进一步添加或删除预测变量的良好起点。

使用 stepAIC BIC / AIC时，可能会选择不同的最终模型 direction。这是解释：
- “backward”：从给定模型中顺序删除预测变量。
- “forward”：将预测变量顺序添加到给定模型中。
- “both” （默认）：以上的组合。
该建议是尝试几种这些方法并保留一个最小的BIC / AIC。设置 trace = 0 为省略冗长的搜索过程信息输出。下面的代码块清楚地说明了如何使用数据集的修改版本来利用 direction 参数和的其他选项。stepAICwinedirection = "forward"direction = "both"scope
2. # 将无关的预测变量添加到葡萄酒数据集中
4. # 向后选择：从给定模型中顺序删除预测变量
7. # 从具有所有预测变量的模型开始
11. modAll, direction = "backward", k = log(n)
12. ## Start: AIC=-50.13
13. ## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop +
14. ## noisePredictor
15. ##
16. ##
17. ## Step: AIC=-50.13
18. ## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop +
19. ## noisePredictor
20. ##
21. ## Df Sum of Sq RSS AIC
22. ## - FrancePop 1 0.0036 1.7977 -53.376
23. ## - Year 1 0.0038 1.7979 -53.374
24. ## - noisePredictor 1 0.0090 1.8032 -53.295
25. ## <none> 1.7941 -50.135
26. ## - WinterRain 1 0.4598 2.2539 -47.271
27. ## - HarvestRain 1 1.7666 3.5607 -34.923
28. ## - AGST 1 3.3658 5.1599 -24.908
29. ##
30. ## Step: AIC=-53.38
31. ## Price ~ Year + WinterRain + AGST + HarvestRain + noisePredictor
32. ##
33. ## Df Sum of Sq RSS AIC
34. ## - noisePredictor 1 0.0081 1.8058 -56.551
35. ## <none> 1.7977 -53.376
36. ## - WinterRain 1 0.4771 2.2748 -50.317
37. ## - Year 1 0.9162 2.7139 -45.552
38. ## - HarvestRain 1 1.8449 3.6426 -37.606
39. ## - AGST 1 3.4234 5.2212 -27.885
40. ##
41. ## Step: AIC=-56.55
42. ## Price ~ Year + WinterRain + AGST + HarvestRain
43. ##
44. ## Df Sum of Sq RSS AIC
45. ## <none> 1.8058 -56.551
46. ## - WinterRain 1 0.4809 2.2867 -53.473
47. ## - Year 1 0.9089 2.7147 -48.840
48. ## - HarvestRain 1 1.8760 3.6818 -40.612
49. ## - AGST 1 3.4428 5.2486 -31.039
50. ##
51. ## Call:
52. ## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
53. ## data = wineNoise)
54. ##
55. ## Coefficients:
56. ## (Intercept) Year WinterRain AGST HarvestRain
57. ## 43.639042 -0.023848 0.001167 0.616392 -0.003861
59. # 从中间模型开始
62. AIC(modInter, direction = "backward", k = log(n)
63. ## Start: AIC=-32.38
64. ## Price ~ noisePredictor + Year + AGST
65. ##
66. ## Df Sum of Sq RSS AIC
67. ## - noisePredictor 1 0.0146 5.0082 -35.601
68. ## <none> 4.9936 -32.384
69. ## - Year 1 0.7522 5.7459 -31.891
70. ## - AGST 1 3.2211 8.2147 -22.240
71. ##
72. ## Step: AIC=-35.6
73. ## Price ~ Year + AGST
74. ##
75. ## Df Sum of Sq RSS AIC
76. ## <none> 5.0082 -35.601
77. ## - Year 1 0.7966 5.8049 -34.911
78. ## - AGST 1 3.2426 8.2509 -25.417
79. ##
80. ## Call:
81. ## lm(formula = Price ~ Year + AGST, data = wineNoise)
82. ##
83. ## Coefficients:
84. ## (Intercept) Year AGST
85. ## 41.49441 -0.02221 0.56067
86. # 回想一下，在搜索过程中未探索未包含在modInter中的其他预测变量（因此未添加相关的预测变量HarvestRain）
90. # 正向选择：从给定模型顺序添加预测变量
93. # 从没有预测变量的模型开始，仅截距模型（表示为〜1）
96. AIC(modZero, direction = "forward"
97. ## Start: AIC=-22.28
98. ## Price ~ 1
99. ##
100. ## Df Sum of Sq RSS AIC
101. ## + AGST 1 4.6655 5.8049 -34.911
102. ## + HarvestRain 1 2.6933 7.7770 -27.014
103. ## + FrancePop 1 2.4231 8.0472 -26.092
104. ## + Year 1 2.2195 8.2509 -25.417
105. ## + Age 1 2.2195 8.2509 -25.417
106. ## <none> 10.4703 -22.281
107. ## + WinterRain 1 0.1905 10.2798 -19.481
108. ## + noisePredictor 1 0.1761 10.2942 -19.443
109. ##
110. ## Step: AIC=-34.91
111. ## Price ~ AGST
112. ##
113. ## Df Sum of Sq RSS AIC
114. ## + HarvestRain 1 2.50659 3.2983 -46.878
115. ## + WinterRain 1 1.42392 4.3809 -39.214
116. ## + FrancePop 1 0.90263 4.9022 -36.178
117. ## + Year 1 0.79662 5.0082 -35.601
118. ## + Age 1 0.79662 5.0082 -35.601
119. ## <none> 5.8049 -34.911
120. ## + noisePredictor 1 0.05900 5.7459 -31.891
121. ##
122. ## Step: AIC=-46.88
123. ## Price ~ AGST + HarvestRain
124. ##
125. ## Df Sum of Sq RSS AIC
126. ## + FrancePop 1 1.03572 2.2625 -53.759
127. ## + Year 1 1.01159 2.2867 -53.473
128. ## + Age 1 1.01159 2.2867 -53.473
129. ## + WinterRain 1 0.58356 2.7147 -48.840
130. ## <none> 3.2983 -46.878
131. ## + noisePredictor 1 0.06084 3.2374 -44.085
132. ##
133. ## Step: AIC=-53.76
134. ## Price ~ AGST + HarvestRain + FrancePop
135. ##
136. ## Df Sum of Sq RSS AIC
137. ## + WinterRain 1 0.45456 1.8080 -56.519
138. ## <none> 2.2625 -53.759
139. ## + noisePredictor 1 0.00829 2.2542 -50.562
140. ## + Age 1 0.00085 2.2617 -50.473
141. ## + Year 1 0.00085 2.2617 -50.473
142. ##
143. ## Step: AIC=-56.52
144. ## Price ~ AGST + HarvestRain + FrancePop + WinterRain
145. ##
146. ## Df Sum of Sq RSS AIC
147. ## <none> 1.8080 -56.519
148. ## + noisePredictor 1 0.0100635 1.7979 -53.374
149. ## + Year 1 0.0048039 1.8032 -53.295
150. ## + Age 1 0.0048039 1.8032 -53.295
151. ##
152. ## Call:
153. ## lm(formula = Price ~ AGST + HarvestRain + FrancePop + WinterRain,
154. ## data = wineNoise)
155. ##
156. ## Coefficients:
157. ## (Intercept) AGST HarvestRain FrancePop WinterRain
158. ## -5.945e-01 6.127e-01 -3.804e-03 -5.189e-05 1.136e-03
159. # 在进行正向搜索时，充分设置范围参数非常重要！在范围中，您必须定义包含可探索模型集的“最小”（下部）和“最大”（上部）模型。如果未提供，则将最大模型用作传递的起始模型（在这种情况下为modZero），而stepAIC将不执行任何搜索
163. #从中间模型开始
167. ## Start: AIC=-32.38
168. ## Price ~ noisePredictor + Year + AGST
169. ##
170. ## Df Sum of Sq RSS AIC
171. ## + HarvestRain 1 2.71878 2.2748 -50.317
172. ## + WinterRain 1 1.35102 3.6426 -37.606
173. ## <none> 4.9936 -32.384
174. ## + FrancePop 1 0.24004 4.7536 -30.418
175. ##
176. ## Step: AIC=-50.32
177. ## Price ~ noisePredictor + Year + AGST + HarvestRain
178. ##
179. ## Df Sum of Sq RSS AIC
180. ## + WinterRain 1 0.47710 1.7977 -53.376
181. ## <none> 2.2748 -50.317
182. ## + FrancePop 1 0.02094 2.2539 -47.271
183. ##
184. ## Step: AIC=-53.38
185. ## Price ~ noisePredictor + Year + AGST + HarvestRain + WinterRain
186. ##
187. ## Df Sum of Sq RSS AIC
188. ## <none> 1.7977 -53.376
189. ## + FrancePop 1 0.0036037 1.7941 -50.135
190. ##
191. ## Call:
192. ## lm(formula = Price ~ noisePredictor + Year + AGST + HarvestRain +
193. ## WinterRain, data = wineNoise)
194. ##
195. ## Coefficients:
196. ## (Intercept) noisePredictor Year AGST HarvestRain WinterRain
197. ## 44.096639 -0.019617 -0.024126 0.620522 -0.003840 ＃回想一下，在搜索期间不会删除modInter中包含的预测变量（因此会保留无关的noisePredictor）
199. ＃两种选择：如果从中间模型开始，则很有用
201. ＃消除了与从中间模型完成的“向后”和“向前”搜索相关的问题
204. ## Start: AIC=-32.38
205. ## Price ~ noisePredictor + Year + AGST
206. ##
207. ## Df Sum of Sq RSS AIC
208. ## + HarvestRain 1 2.7188 2.2748 -50.317
209. ## + WinterRain 1 1.3510 3.6426 -37.606
210. ## - noisePredictor 1 0.0146 5.0082 -35.601
211. ## <none> 4.9936 -32.384
212. ## - Year 1 0.7522 5.7459 -31.891
213. ## + FrancePop 1 0.2400 4.7536 -30.418
214. ## - AGST 1 3.2211 8.2147 -22.240
215. ##
216. ## Step: AIC=-50.32
217. ## Price ~ noisePredictor + Year + AGST + HarvestRain
218. ##
219. ## Df Sum of Sq RSS AIC
220. ## - noisePredictor 1 0.01182 2.2867 -53.473
221. ## + WinterRain 1 0.47710 1.7977 -53.376
222. ## <none> 2.2748 -50.317
223. ## + FrancePop 1 0.02094 2.2539 -47.271
224. ## - Year 1 0.96258 3.2374 -44.085
225. ## - HarvestRain 1 2.71878 4.9936 -32.384
226. ## - AGST 1 2.94647 5.2213 -31.180
227. ##
228. ## Step: AIC=-53.47
229. ## Price ~ Year + AGST + HarvestRain
230. ##
231. ## Df Sum of Sq RSS AIC
232. ## + WinterRain 1 0.48087 1.8058 -56.551
233. ## <none> 2.2867 -53.473
234. ## + FrancePop 1 0.02497 2.2617 -50.473
235. ## + noisePredictor 1 0.01182 2.2748 -50.317
236. ## - Year 1 1.01159 3.2983 -46.878
237. ## - HarvestRain 1 2.72157 5.0082 -35.601
238. ## - AGST 1 2.96500 5.2517 -34.319
239. ##
240. ## Step: AIC=-56.55
241. ## Price ~ Year + AGST + HarvestRain + WinterRain
242. ##
243. ## Df Sum of Sq RSS AIC
244. ## <none> 1.8058 -56.551
245. ## - WinterRain 1 0.4809 2.2867 -53.473
246. ## + noisePredictor 1 0.0081 1.7977 -53.376
247. ## + FrancePop 1 0.0026 1.8032 -53.295
248. ## - Year 1 0.9089 2.7147 -48.840
249. ## - HarvestRain 1 1.8760 3.6818 -40.612
250. ## - AGST 1 3.4428 5.2486 -31.039
251. ##
252. ## Call:
253. ## lm(formula = Price ~ Year + AGST + HarvestRain + WinterRain,
254. ## data = wineNoise)
255. ##
256. ## Coefficients:
257. ## (Intercept) Year AGST HarvestRain WinterRain
258. ## 43.639042 -0.023848 0.616392 -0.003861 0.001167
259. # 正确定义范围也很重要，因为“两个”都求助于“前进”（以及“后退”）
262. #使用完整模型中的默认值实质上会进行向后选择，但允许已删除的预测变量在以后的步骤中再次输入
265. AIC(modAll direction = "both", k = log(n)
266. ## Start: AIC=-50.13
267. ## Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop +
268. ## noisePredictor
269. ##
270. ##
271. ## Step: AIC=-50.13
272. ## Price ~ Year + WinterRain + AGST + HarvestRain + FrancePop +
273. ## noisePredictor
274. ##
275. ## Df Sum of Sq RSS AIC
276. ## - FrancePop 1 0.0036 1.7977 -53.376
277. ## - Year 1 0.0038 1.7979 -53.374
278. ## - noisePredictor 1 0.0090 1.8032 -53.295
279. ## <none> 1.7941 -50.135
280. ## - WinterRain 1 0.4598 2.2539 -47.271
281. ## - HarvestRain 1 1.7666 3.5607 -34.923
282. ## - AGST 1 3.3658 5.1599 -24.908
283. ##
284. ## Step: AIC=-53.38
285. ## Price ~ Year + WinterRain + AGST + HarvestRain + noisePredictor
286. ##
287. ## Df Sum of Sq RSS AIC
288. ## - noisePredictor 1 0.0081 1.8058 -56.551
289. ## <none> 1.7977 -53.376
290. ## - WinterRain 1 0.4771 2.2748 -50.317
291. ## + FrancePop 1 0.0036 1.7941 -50.135
292. ## - Year 1 0.9162 2.7139 -45.552
293. ## - HarvestRain 1 1.8449 3.6426 -37.606
294. ## - AGST 1 3.4234 5.2212 -27.885
295. ##
296. ## Step: AIC=-56.55
297. ## Price ~ Year + WinterRain + AGST + HarvestRain
298. ##
299. ## Df Sum of Sq RSS AIC
300. ## <none> 1.8058 -56.551
301. ## - WinterRain 1 0.4809 2.2867 -53.473
302. ## + noisePredictor 1 0.0081 1.7977 -53.376
303. ## + FrancePop 1 0.0026 1.8032 -53.295
304. ## - Year 1 0.9089 2.7147 -48.840
305. ## - HarvestRain 1 1.8760 3.6818 -40.612
306. ## - AGST 1 3.4428 5.2486 -31.039
307. ##
308. ## Call:
309. ## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
310. ## data = wineNoise)
311. ##
312. ## Coefficients:
313. ## (Intercept) Year WinterRain AGST HarvestRain
314. ## 43.639042 -0.023848 0.001167 0.616392 -0.003861
316. # 省略冗长的输出
319. AIC(modAll, direction = "both", trace = 0
320. ##
321. ## Call:
322. ## lm(formula = Price ~ Year + WinterRain + AGST + HarvestRain,
323. ## data = wineNoise)
324. ##
325. ## Coefficients:
326. ## (Intercept) Year WinterRain AGST HarvestRain
327. ## 43.639042 -0.023848 0.001167 0.616392 -0.003861
对Boston 数据集运行逐步选择，目的是清楚地了解不同的搜索方向。特别：
- "forward" 从逐步拟合 medv ~ 1开始做。
- "forward" 从逐步拟合 medv ~ crim + lstat + age开始做。
- "both" 从逐步拟合 medv ~ crim + lstat + age开始做。
- "both" 从逐步拟合 medv ~ .开始做。
- "backward" 从逐步拟合 medv ~ .开始做。
stepAIC 假定数据中不存在 NA（缺失值）。建议先删除数据中的缺失值。它们的存在可能会导致错误。为此，请使用 data = na.omit(dataset) 调用 lm （如果您的数据集为 dataset）。

我们通过强调使用BIC和AIC得出结论：它们的构造是假设样本大小n 远大于模型中参数的数量p + 2。因此，如果n >> p + 2 ，它们将工作得相当好，但是如果不是这样，则它们可能会支持不切实际的复杂模型。下图对此现象进行了说明。BIC和AIC曲线倾向于使局部最小值接近p = 2，然后增加。但是当p + 2 接近n 时，它们会迅速下降。

图：n = 200和p从1 到198 的BIC和AIC的比较。M = 100数据集仅在前两个预测变量有效的情况下进行了模拟。较粗的曲线是每种颜色曲线的平均值。

房价案例研究应用

我们要建立一个线性模型进行预测和解释 medv。有大量的预测模型，其中一些可能对预测medv没什么用。但是，目前尚不清楚哪个预测变量可以更好地解释 medv 的信息。因此，我们可以对所有预测变量进行线性模型处理：
2. summary(modHouse)
3. ##
5. ##
6. ## Residuals:
7. ## Min 1Q Median 3Q Max
8. ## -15.595 -2.730 -0.518 1.777 26.199
9. ##
10. ## Coefficients:
11. ## Estimate Std. Error t value Pr(>|t|)
12. ## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
13. ## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
14. ## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
15. ## indus 2.056e-02 6.150e-02 0.334 0.738288
16. ## chas 2.687e+00 8.616e-01 3.118 0.001925 **
17. ## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
18. ## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
19. ## age 6.922e-04 1.321e-02 0.052 0.958229
20. ## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
21. ## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
22. ## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
23. ## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
24. ## black 9.312e-03 2.686e-03 3.467 0.000573 ***
25. ## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
26. ## ---
27. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
28. ##
29. ## Residual standard error: 4.745 on 492 degrees of freedom
30. ## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
31. ## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
有几个不重要的变量，但是到目前为止，该模型具有R ^ 2 = 0.74，并且拟合系数对预期的结果很敏感。例如 crim， tax， ptratio，和 nox 对medv有负面影响，同时 rm， rad和 chas 有正面的影响。但是，不重要的系数不会显着影响模型，而只会增加噪声并降低系数估计的总体准确性。让我们稍微完善一下以前的模型。
2. # 最佳模型
4. AIC(modHouse, k = log(nrow(Boston)
5. ## Start: AIC=1648.81
6. ## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
7. ## tax + ptratio + black + lstat
8. ##
9. ## Df Sum of Sq RSS AIC
10. ## - age 1 0.06 11079 1642.6
11. ## - indus 1 2.52 11081 1642.7
12. ## <none> 11079 1648.8
13. ## - chas 1 218.97 11298 1652.5
14. ## - tax 1 242.26 11321 1653.5
15. ## - crim 1 243.22 11322 1653.6
16. ## - zn 1 257.49 11336 1654.2
17. ## - black 1 270.63 11349 1654.8
18. ## - rad 1 479.15 11558 1664.0
19. ## - nox 1 487.16 11566 1664.4
20. ## - ptratio 1 1194.23 12273 1694.4
21. ## - dis 1 1232.41 12311 1696.0
22. ## - rm 1 1871.32 12950 1721.6
23. ## - lstat 1 2410.84 13490 1742.2
24. ##
25. ## Step: AIC=1642.59
26. ## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +
27. ## ptratio + black + lstat
28. ##
29. ## Df Sum of Sq RSS AIC
30. ## - indus 1 2.52 11081 1636.5
31. ## <none> 11079 1642.6
32. ## - chas 1 219.91 11299 1646.3
33. ## - tax 1 242.24 11321 1647.3
34. ## - crim 1 243.20 11322 1647.3
35. ## - zn 1 260.32 11339 1648.1
36. ## - black 1 272.26 11351 1648.7
37. ## - rad 1 481.09 11560 1657.9
38. ## - nox 1 520.87 11600 1659.6
39. ## - ptratio 1 1200.23 12279 1688.4
40. ## - dis 1 1352.26 12431 1694.6
41. ## - rm 1 1959.55 13038 1718.8
42. ## - lstat 1 2718.88 13798 1747.4
43. ##
44. ## Step: AIC=1636.48
45. ## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
46. ## black + lstat
47. ##
48. ## Df Sum of Sq RSS AIC
49. ## <none> 11081 1636.5
50. ## - chas 1 227.21 11309 1640.5
51. ## - crim 1 245.37 11327 1641.3
52. ## - zn 1 257.82 11339 1641.9
53. ## - black 1 270.82 11352 1642.5
54. ## - tax 1 273.62 11355 1642.6
55. ## - rad 1 500.92 11582 1652.6
56. ## - nox 1 541.91 11623 1654.4
57. ## - ptratio 1 1206.45 12288 1682.5
58. ## - dis 1 1448.94 12530 1692.4
59. ## - rm 1 1963.66 13045 1712.8
60. ## - lstat 1 2723.48 13805 1741.5
62. # 模型比较
63. compare(modBIC, modAIC)
64. ## Calls:
65. ## 1: lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + black + lstat, data = Boston)
66. ## 2: lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + black + lstat, data = Boston)
67. ##
68. ## Model 1 Model 2
69. ## (Intercept) 36.34 36.34
70. ## SE 5.07 5.07
71. ##
72. ## crim -0.1084 -0.1084
73. ## SE 0.0328 0.0328
74. ##
75. ## zn 0.0458 0.0458
76. ## SE 0.0135 0.0135
77. ##
78. ## chas 2.719 2.719
79. ## SE 0.854 0.854
80. ##
81. ## nox -17.38 -17.38
82. ## SE 3.54 3.54
83. ##
84. ## rm 3.802 3.802
85. ## SE 0.406 0.406
86. ##
87. ## dis -1.493 -1.493
88. ## SE 0.186 0.186
89. ##
90. ## rad 0.2996 0.2996
91. ## SE 0.0634 0.0634
92. ##
93. ## tax -0.01178 -0.01178
94. ## SE 0.00337 0.00337
95. ##
96. ## ptratio -0.947 -0.947
97. ## SE 0.129 0.129
98. ##
99. ## black 0.00929 0.00929
100. ## SE 0.00267 0.00267
101. ##
102. ## lstat -0.5226 -0.5226
103. ## SE 0.0474 0.0474
104. ##
105. summary(modBIC)
106. ##
107. ## Call:
108. ## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
109. ## tax + ptratio + black + lstat, data = Boston)
110. ##
111. ## Residuals:
112. ## Min 1Q Median 3Q Max
113. ## -15.5984 -2.7386 -0.5046 1.7273 26.2373
114. ##
115. ## Coefficients:
116. ## Estimate Std. Error t value Pr(>|t|)
117. ## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
118. ## crim -0.108413 0.032779 -3.307 0.001010 **
119. ## zn 0.045845 0.013523 3.390 0.000754 ***
120. ## chas 2.718716 0.854240 3.183 0.001551 **
121. ## nox -17.376023 3.535243 -4.915 1.21e-06 ***
122. ## rm 3.801579 0.406316 9.356 < 2e-16 ***
123. ## dis -1.492711 0.185731 -8.037 6.84e-15 ***
124. ## rad 0.299608 0.063402 4.726 3.00e-06 ***
125. ## tax -0.011778 0.003372 -3.493 0.000521 ***
126. ## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
127. ## black 0.009291 0.002674 3.475 0.000557 ***
128. ## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
129. ## ---
130. ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
131. ##
132. ## Residual standard error: 4.736 on 494 degrees of freedom
133. ## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
134. ## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
136. # 置信区间
138. conf(modBIC)
139. ## 2.5 % 97.5 %
140. ## (Intercept) 26.384649126 46.29764088
141. ## crim -0.172817670 -0.04400902
142. ## zn 0.019275889 0.07241397
143. ## chas 1.040324913 4.39710769
144. ## nox -24.321990312 -10.43005655
145. ## rm 3.003258393 4.59989929
146. ## dis -1.857631161 -1.12779176
147. ## rad 0.175037411 0.42417950
148. ## tax -0.018403857 -0.00515209
149. ## ptratio -1.200109823 -0.69293932
150. ## black 0.004037216 0.01454447
151. ## lstat -0.615731781 -0.42937513
请注意，相对于完整模型，略有增加，以及所有预测变量显着。

我们已经量化了预测变量对房价（Q1）的影响，可以得出结论，在最终模型（Q2）中，显着性水平为：
- chas， age， rad， black 对medv有显著正面的影响 ;
- nox， dis， tax， ptratio， lstat 对medv有显著负面的影响。
检查：
- modBIC 不能通过消除预测指标来改善BIC。
- modBIC 无法通过添加预测变量来改进BIC。使用 addterm(modBIC, scope = lm(medv ~ ., data = Boston), k = log(nobs(modBIC))) 。
1. 应用其公式，我们将获得，因此将不会定义。
2. 具有相同的因变量。
3. 如果是，则。
4. 同样，由于BIC 在选择真实的分布/回归模型时是 一致的：如果提供了足够的数据，则可以保证BIC在候选列表中选择真实的数据生成模型。如果真实模型包含在该列表中，则模型为线性模型。但是，由于实际模型可能是非线性的，因此在实践中这可能是不现实的。
最受欢迎的见解

1.R语言多元Logistic逻辑回归应用案例

2.面板平滑转移回归(PSTR)分析案例实现

3.matlab中的偏最小二乘回归（PLSR）和主成分回归（PCR）

4.R语言泊松Poisson回归模型分析案例

5.R语言回归中的Hosmer-Lemeshow拟合优度检验

6.r语言中对LASSO回归，Ridge岭回归和Elastic Net模型实现

7.在R语言中实现Logistic逻辑回归

8.python用线性回归预测股票价格

9.R语言如何在生存分析与Cox回归中计算IDI，NRI指标
▍关注我们【大数据部落】第三方数据服务提供商,提供全面的统计分析与数据挖掘咨询服务,为客户定制个性化的数据解决方案与行业报告等。 ▍咨询链接：http://y0.cn/teradat ▍联系邮箱：3025393450@qq.com
查看全文

相关阅读:
CentOS 6.3 安装过程
 thinkphp的学习笔记
 全栈工程师之路
 Anne pro 2 开箱实录
 IDEA创建SpringBoot项目整合JPA，连接Oracle数据库，使用Swagger进行测试
 AOP思想的一点想法
 STS（Eclipse）修改Project Exploer背景颜色及字体大小
 IDEA运行Tomcat，控制台日志乱码
 MyBatis自动生成实体类、DAO和XML映射文件，并去掉实体类中的getter和setter方法
 vue+jspdf+html2canvas导出PDF文件

原文地址：https://www.cnblogs.com/tecdat/p/14314866.html

拓端数据tecdat|R语言多元逐步回归模型分析房价和葡萄酒价格：选择最合适的预测变量

原文链接：http://tecdat.cn/?p=19405

房价案例研究应用