12-多元线性回归

Multiple Linear Regression

作者

Simon Zhou

发布于

2025年5月7日

import stata_setup
stata_setup.config('C:/Program Files/Stata18', 'mp', splash=False)

1 导入数据

%%stata
sysuse auto.dta,clear
(1978 automobile data)

2 区分简单线性回归和多元线性回归

简单线性回归形如:\(y=\beta_0+\beta_1 x\),例如:

%%stata
reg price weight

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =     29.42
       Model |   184233937         1   184233937   Prob > F        =    0.0000
    Residual |   450831459        72  6261548.04   R-squared       =    0.2901
-------------+----------------------------------   Adj R-squared   =    0.2802
       Total |   635065396        73  8699525.97   Root MSE        =    2502.3

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |   2.044063   .3768341     5.42   0.000     1.292857    2.795268
       _cons |  -6.707353    1174.43    -0.01   0.995     -2347.89    2334.475
------------------------------------------------------------------------------

\(\beta_1\):汽车的重量每增加一个单位,售价增加 2.04 个单位(95%CI:1.29,2.80)

_cons\(\beta_0\),即是截距,但是要注意符号与实际的区别

多元线性模型即不止一个自变量,形如:\(y=\beta_0+\beta_1 x_1 +\beta_2 x_2 + \dots\)

%%stata
reg price weight length

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     18.91
       Model |   220725280         2   110362640   Prob > F        =    0.0000
    Residual |   414340116        71  5835776.28   R-squared       =    0.3476
-------------+----------------------------------   Adj R-squared   =    0.3292
       Total |   635065396        73  8699525.97   Root MSE        =    2415.7

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |   4.699065   1.122339     4.19   0.000     2.461184    6.936946
      length |  -97.96031    39.1746    -2.50   0.015    -176.0722   -19.84838
       _cons |   10386.54   4308.159     2.41   0.019     1796.316    18976.76
------------------------------------------------------------------------------

\(\beta_2\)(length):在控制 weight 变量后,汽车的 length 每增加一个单位,售价减少 97.96 个单位(95%CI:-176.07,2.-19.85)

%%stata
reg price weight length mpg

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(3, 70)        =     12.98
       Model |   226957412         3  75652470.6   Prob > F        =    0.0000
    Residual |   408107984        70  5830114.06   R-squared       =    0.3574
-------------+----------------------------------   Adj R-squared   =    0.3298
       Total |   635065396        73  8699525.97   Root MSE        =    2414.6

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |   4.364798   1.167455     3.74   0.000     2.036383    6.693213
      length |  -104.8682   39.72154    -2.64   0.010    -184.0903   -25.64607
         mpg |  -86.78928   83.94335    -1.03   0.305     -254.209    80.63046
       _cons |   14542.43   5890.632     2.47   0.016      2793.94    26290.93
------------------------------------------------------------------------------
%%stata
reg price weight length mpg rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(4, 64)        =     12.68
       Model |   255066807         4  63766701.9   Prob > F        =    0.0000
    Residual |   321730151        64  5027033.62   R-squared       =    0.4422
-------------+----------------------------------   Adj R-squared   =    0.4074
       Total |   576796959        68  8482308.22   Root MSE        =    2242.1

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |   4.959534   1.119624     4.43   0.000     2.722827    7.196241
      length |  -115.0177   38.56456    -2.98   0.004    -192.0592   -37.97612
         mpg |  -106.7122   81.15836    -1.31   0.193    -268.8446    55.42027
       rep78 |   910.9859   304.5274     2.99   0.004     302.6226    1519.349
       _cons |   11934.51   5774.178     2.07   0.043     399.2604    23469.75
------------------------------------------------------------------------------

\(\beta_4\) (rep78): 在控制了汽车的重量、长度、里程后,汽车的修理次数每增加一个单位,售价增加910.99个单位(95%CI:302.62,1519.35)

这里 rep78 是一个连续变量,如果处理为多分类变量(哑变量)

2.1 什么是哑变量?为什么要设置哑变量?

当自变量X为多分类变量时,例如职业、学历、血型、疾病严重程度等等此时仅用一个回归系数来解释多分类变量之间的变化关系,及其对因变量Y的影响,就显得太不理想。

2.2 如何设置哑变量

  • 在多分类变量前加上 “i.” 告诉Stata这不是连续变量
  • 形如:
%%stata
reg price weight length mpg i.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(7, 61)        =      7.25
       Model |   262008114         7  37429730.6   Prob > F        =    0.0000
    Residual |   314788844        61  5160472.86   R-squared       =    0.4542
-------------+----------------------------------   Adj R-squared   =    0.3916
       Total |   576796959        68  8482308.22   Root MSE        =    2271.7

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |   5.186695   1.163383     4.46   0.000     2.860367    7.513022
      length |  -124.1544   40.07637    -3.10   0.003     -204.292   -44.01671
         mpg |  -126.8367   84.49819    -1.50   0.138    -295.8012    42.12791
             |
       rep78 |
          2  |   1137.284   1803.332     0.63   0.531    -2468.701    4743.269
          3  |   1254.642   1661.545     0.76   0.453    -2067.823    4577.108
          4  |   2267.188   1698.018     1.34   0.187    -1128.208    5662.584
          5  |   3850.759   1787.272     2.15   0.035     276.8886     7424.63
             |
       _cons |   14614.49   6155.842     2.37   0.021     2305.125    26923.86
------------------------------------------------------------------------------

rep78 2:在控制了汽车的重量、长度、里程后,汽 车的修理次数两次和一次相比,售价增加1137.28个单 位(95%CI:-2468.70,4743.27)

3 回归分析需要注意的事情

  • 时刻注意纳入模型的观测值数量
    • Case-wise Deletion:只有纳入模型的变量都没有缺失值,这个观测值才会被纳入回归模型分析(有缺失会被系统默认剔除)
  • 如何让\(\beta_0\)有意义?
    • 使用\(y=\beta_0+\beta_1(x-\bar x)\)
    • \(\beta_0\geq 0\)