今日内容概要
- 数学统计分析
今日内容详细
数学统计分析
- 高数
- 统计学
统计学必备基础模型
判断变量之间是否有关系
-
绘图的形式(散点图)
-
求变量之间的相关系数(0.8,0.5,0.3,小于0.3只能说明两者之间没有线性关系)
相关系数代码求解
import pandas as pd
import numpy as np
x = [52,19,7,33,2]
y = [162,61,22,100,6]
# 均值(mean)
xmean = np.mean(x)
xmean=22.6
ymean = np.mean(y)
ymean = 70.2
# 标准差(SD)
xsd = np.std(x)
xsd = 18.183509012289132
ysd = np.std(y)
ysd = 56.29351650057047
# Z分数
zx = (x-xmean)/xsd
zx = array([ 1.61684964, -0.19798159, -0.85792022, 0.57194681, -1.13289465])
zy = (y-mean)/ysd
zy = array([ 1.63073842, -0.16342912, -0.85622649, 0.52936824, -1.14045105])
# 相关系数
r = np.sum(zx*zy)/(len(z))
r = 0.999674032661831
# numpy中的corrcoef方法直接计算
t = np.corrcoef(x,y)
t = array([[1. , 0.99967403],
[0.99967403, 1. ]])
# pandas中的coor方法直接计算
data = pd.DataFrame(['x':x,'y':y])
t2 = data.corr()
t2
运行结果:
x y
x 1.000000 0.999674
y 0.999674 1.000000
公式推导(了解)
y = a + bx + ε
ε = (y-(a + bx))**2
代码求解
# 导入第三方模块
import statsmodels.api as sm
sm.ols(formula,data,subset=None,drop_cols=None)
formula:以字符串的形式指定线性回归模型的公式,如'y~x'就表示简单线性回归模型
data:指定建模的数据集
subset:通过bool类型的数组对象,获取data的子集用于建模
drop_cols:指定需要从data中删除的变量
# 导入第三方模块
import pandas as pd
import statsmodels.api as sm
income = pd.read_csv('Salary_Data.csv')
# income
# 利用收入数据集,构建回归模型
fit = sm.formula.ols('Sarlary~YearsExperience',data = income).fit()
fit.params
多元线性回归代码实现
# 导入模块
import pandas as pd
from sklearn import model_selection
# 导入数据
Profit = pd.read_excel(r'C:UsersAdministratorDesktopPredict to Profit.xlsx')
Profit
运行结果:
RD_Spend Administration Marketing_Spend State Profit
0 165349.20 136897.80 471784.10 New York 192261.83
1 162597.70 151377.59 443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
3 144372.41 118671.85 383199.62 New York 182901.99
4 142107.34 91391.77 366168.42 Florida 166187.94
5 131876.90 99814.71 362861.36 New York 156991.12
6 134615.46 147198.87 127716.82 California 156122.51
7 130298.13 145530.06 323876.68 Florida 155752.60
8 120542.52 148718.95 311613.29 New York 152211.77
9 123334.88 108679.17 304981.62 California 149759.96
10 101913.08 110594.11 229160.95 Florida 146121.95
11 100671.96 91790.61 249744.55 California 144259.40
12 93863.75 127320.38 249839.44 Florida 141585.52
13 91992.39 135495.07 252664.93 California 134307.35
14 119943.24 156547.42 256512.92 Florida 132602.65
15 114523.61 122616.84 261776.23 New York 129917.04
16 78013.11 121597.55 264346.06 California 126992.93
17 94657.16 145077.58 282574.31 New York 125370.37
18 91749.16 114175.79 294919.57 Florida 124266.90
19 86419.70 153514.11 0.00 New York 122776.86
20 76253.86 113867.30 298664.47 California 118474.03
21 78389.47 153773.43 299737.29 New York 111313.02
22 73994.56 122782.75 303319.26 Florida 110352.25
23 67532.53 105751.03 304768.73 Florida 108733.99
24 77044.01 99281.34 140574.81 New York 108552.04
25 64664.71 139553.16 137962.62 California 107404.34
26 75328.87 144135.98 134050.07 Florida 105733.54
27 72107.60 127864.55 353183.81 New York 105008.31
28 66051.52 182645.56 118148.20 Florida 103282.38
29 65605.48 153032.06 107138.38 New York 101004.64
30 61994.48 115641.28 91131.24 Florida 99937.59
31 61136.38 152701.92 88218.23 New York 97483.56
32 63408.86 129219.61 46085.25 California 97427.84
33 55493.95 103057.49 214634.81 Florida 96778.92
34 46426.07 157693.92 210797.67 California 96712.80
35 46014.02 85047.44 205517.64 New York 96479.51
36 28663.76 127056.21 201126.82 Florida 90708.19
37 44069.95 51283.14 197029.42 California 89949.14
38 20229.59 65947.93 185265.10 New York 81229.06
39 38558.51 82982.09 174999.30 California 81005.76
40 28754.33 118546.05 172795.67 California 78239.91
41 27892.92 84710.77 164470.71 Florida 77798.83
42 23640.93 96189.63 148001.11 California 71498.49
43 15505.73 127382.30 35534.17 New York 69758.98
44 22177.74 154806.14 28334.72 California 65200.33
45 1000.23 124153.04 1903.93 New York 64926.08
46 1315.46 115816.21 297114.46 Florida 49490.75
47 0.00 135426.92 0.00 California 42559.73
48 542.05 51743.15 0.00 New York 35673.41
# 将数据集拆分为训练集和测试集
train,test = model_selection.train_test_split(Profit,test_size = 0.2,random_state=1234)
train,test
运行结果:
( RD_Spend Administration Marketing_Spend State Profit
36 28663.76 127056.21 201126.82 Florida 90708.19
43 15505.73 127382.30 35534.17 New York 69758.98
17 94657.16 145077.58 282574.31 New York 125370.37
10 101913.08 110594.11 229160.95 Florida 146121.95
21 78389.47 153773.43 299737.29 New York 111313.02
20 76253.86 113867.30 298664.47 California 118474.03
22 73994.56 122782.75 303319.26 Florida 110352.25
1 162597.70 151377.59 443898.53 California 191792.06
32 63408.86 129219.61 46085.25 California 97427.84
46 1315.46 115816.21 297114.46 Florida 49490.75
27 72107.60 127864.55 353183.81 New York 105008.31
34 46426.07 157693.92 210797.67 California 96712.80
25 64664.71 139553.16 137962.62 California 107404.34
33 55493.95 103057.49 214634.81 Florida 96778.92
0 165349.20 136897.80 471784.10 New York 192261.83
11 100671.96 91790.61 249744.55 California 144259.40
7 130298.13 145530.06 323876.68 Florida 155752.60
3 144372.41 118671.85 383199.62 New York 182901.99
37 44069.95 51283.14 197029.42 California 89949.14
6 134615.46 147198.87 127716.82 California 156122.51
2 153441.51 101145.55 407934.54 Florida 191050.39
35 46014.02 85047.44 205517.64 New York 96479.51
45 1000.23 124153.04 1903.93 New York 64926.08
9 123334.88 108679.17 304981.62 California 149759.96
16 78013.11 121597.55 264346.06 California 126992.93
5 131876.90 99814.71 362861.36 New York 156991.12
28 66051.52 182645.56 118148.20 Florida 103282.38
40 28754.33 118546.05 172795.67 California 78239.91
39 38558.51 82982.09 174999.30 California 81005.76
30 61994.48 115641.28 91131.24 Florida 99937.59
26 75328.87 144135.98 134050.07 Florida 105733.54
41 27892.92 84710.77 164470.71 Florida 77798.83
23 67532.53 105751.03 304768.73 Florida 108733.99
15 114523.61 122616.84 261776.23 New York 129917.04
24 77044.01 99281.34 140574.81 New York 108552.04
12 93863.75 127320.38 249839.44 Florida 141585.52
38 20229.59 65947.93 185265.10 New York 81229.06
19 86419.70 153514.11 0.00 New York 122776.86
47 0.00 135426.92 0.00 California 42559.73,
RD_Spend Administration Marketing_Spend State Profit
8 120542.52 148718.95 311613.29 New York 152211.77
48 542.05 51743.15 0.00 New York 35673.41
14 119943.24 156547.42 256512.92 Florida 132602.65
42 23640.93 96189.63 148001.11 California 71498.49
29 65605.48 153032.06 107138.38 New York 101004.64
44 22177.74 154806.14 28334.72 California 65200.33
4 142107.34 91391.77 366168.42 Florida 166187.94
31 61136.38 152701.92 88218.23 New York 97483.56
13 91992.39 135495.07 252664.93 California 134307.35
18 91749.16 114175.79 294919.57 Florida 124266.90)
# 根据train数据集建模
model = sm.formula.ols('Profit~RD_Spend+Administration+Marketing_Spend+C(State)',data=train).fit()
# print('模型的偏回归系数分别为:
',model.params)
# 删除test数据集中的Profit变量,用剩下的自变量进行预测
test_X = test.drop(labels = 'Profit',axis = 1)
pred = model.predict(exog = test_x)
print('对比预测值和实际值的差异:
',pd.DataFrame({'Prediction':pred,'Real':test.Profit}))
运行结果:
对比预测值和实际值的差异:
Prediction Real
8 150621.345802 152211.77
48 55513.218079 35673.41
14 150369.022458 132602.65
42 74057.015562 71498.49
29 103413.378282 101004.64
44 67844.850378 65200.33
4 173454.059691 166187.94
31 99580.888895 97483.56
13 128147.138396 134307.35
18 130693.433835 124266.90