import stata_setup
'C:/Program Files/Stata18', 'mp', splash=False) stata_setup.config(
12-多元线性回归
Multiple Linear Regression
1 导入数据
%%stata
sysuse auto.dta,clear
(1978 automobile data)
2 区分简单线性回归和多元线性回归
简单线性回归形如:\(y=\beta_0+\beta_1 x\),例如:
%%stata
reg price weight
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(1, 72) = 29.42
Model | 184233937 1 184233937 Prob > F = 0.0000
Residual | 450831459 72 6261548.04 R-squared = 0.2901
-------------+---------------------------------- Adj R-squared = 0.2802
Total | 635065396 73 8699525.97 Root MSE = 2502.3
------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
weight | 2.044063 .3768341 5.42 0.000 1.292857 2.795268
_cons | -6.707353 1174.43 -0.01 0.995 -2347.89 2334.475
------------------------------------------------------------------------------
\(\beta_1\):汽车的重量每增加一个单位,售价增加 2.04 个单位(95%CI:1.29,2.80)
_cons
是 \(\beta_0\),即是截距,但是要注意符号与实际的区别
多元线性模型即不止一个自变量,形如:\(y=\beta_0+\beta_1 x_1 +\beta_2 x_2 + \dots\)
%%stata
reg price weight length
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(2, 71) = 18.91
Model | 220725280 2 110362640 Prob > F = 0.0000
Residual | 414340116 71 5835776.28 R-squared = 0.3476
-------------+---------------------------------- Adj R-squared = 0.3292
Total | 635065396 73 8699525.97 Root MSE = 2415.7
------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
weight | 4.699065 1.122339 4.19 0.000 2.461184 6.936946
length | -97.96031 39.1746 -2.50 0.015 -176.0722 -19.84838
_cons | 10386.54 4308.159 2.41 0.019 1796.316 18976.76
------------------------------------------------------------------------------
\(\beta_2\)(length):在控制 weight 变量后,汽车的 length 每增加一个单位,售价减少 97.96 个单位(95%CI:-176.07,2.-19.85)
%%stata
reg price weight length mpg
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(3, 70) = 12.98
Model | 226957412 3 75652470.6 Prob > F = 0.0000
Residual | 408107984 70 5830114.06 R-squared = 0.3574
-------------+---------------------------------- Adj R-squared = 0.3298
Total | 635065396 73 8699525.97 Root MSE = 2414.6
------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
weight | 4.364798 1.167455 3.74 0.000 2.036383 6.693213
length | -104.8682 39.72154 -2.64 0.010 -184.0903 -25.64607
mpg | -86.78928 83.94335 -1.03 0.305 -254.209 80.63046
_cons | 14542.43 5890.632 2.47 0.016 2793.94 26290.93
------------------------------------------------------------------------------
%%stata
reg price weight length mpg rep78
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(4, 64) = 12.68
Model | 255066807 4 63766701.9 Prob > F = 0.0000
Residual | 321730151 64 5027033.62 R-squared = 0.4422
-------------+---------------------------------- Adj R-squared = 0.4074
Total | 576796959 68 8482308.22 Root MSE = 2242.1
------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
weight | 4.959534 1.119624 4.43 0.000 2.722827 7.196241
length | -115.0177 38.56456 -2.98 0.004 -192.0592 -37.97612
mpg | -106.7122 81.15836 -1.31 0.193 -268.8446 55.42027
rep78 | 910.9859 304.5274 2.99 0.004 302.6226 1519.349
_cons | 11934.51 5774.178 2.07 0.043 399.2604 23469.75
------------------------------------------------------------------------------
\(\beta_4\) (rep78): 在控制了汽车的重量、长度、里程后,汽车的修理次数每增加一个单位,售价增加910.99个单位(95%CI:302.62,1519.35)
这里 rep78 是一个连续变量,如果处理为多分类变量(哑变量)
2.1 什么是哑变量?为什么要设置哑变量?
当自变量X为多分类变量时,例如职业、学历、血型、疾病严重程度等等此时仅用一个回归系数来解释多分类变量之间的变化关系,及其对因变量Y的影响,就显得太不理想。
2.2 如何设置哑变量
- 在多分类变量前加上 “i.” 告诉Stata这不是连续变量
- 形如:
%%stata
reg price weight length mpg i.rep78
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(7, 61) = 7.25
Model | 262008114 7 37429730.6 Prob > F = 0.0000
Residual | 314788844 61 5160472.86 R-squared = 0.4542
-------------+---------------------------------- Adj R-squared = 0.3916
Total | 576796959 68 8482308.22 Root MSE = 2271.7
------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
weight | 5.186695 1.163383 4.46 0.000 2.860367 7.513022
length | -124.1544 40.07637 -3.10 0.003 -204.292 -44.01671
mpg | -126.8367 84.49819 -1.50 0.138 -295.8012 42.12791
|
rep78 |
2 | 1137.284 1803.332 0.63 0.531 -2468.701 4743.269
3 | 1254.642 1661.545 0.76 0.453 -2067.823 4577.108
4 | 2267.188 1698.018 1.34 0.187 -1128.208 5662.584
5 | 3850.759 1787.272 2.15 0.035 276.8886 7424.63
|
_cons | 14614.49 6155.842 2.37 0.021 2305.125 26923.86
------------------------------------------------------------------------------
rep78 2
:在控制了汽车的重量、长度、里程后,汽 车的修理次数两次和一次相比,售价增加1137.28个单 位(95%CI:-2468.70,4743.27)
3 回归分析需要注意的事情
- 时刻注意纳入模型的观测值数量
- Case-wise Deletion:只有纳入模型的变量都没有缺失值,这个观测值才会被纳入回归模型分析(有缺失会被系统默认剔除)
- 如何让\(\beta_0\)有意义?
- 使用\(y=\beta_0+\beta_1(x-\bar x)\)
- 让\(\beta_0\geq 0\)