13-二分类Logistic回归

Logistic regression for binary outcomes

作者

Simon Zhou

发布于

2025年5月7日

import stata_setup
stata_setup.config('C:/Program Files/Stata18', 'mp', splash=False)

1 什么时候该用 Logistic 回归

当outcome发生率 >15% 时logistic regression得出的OR值会overestimate实际的RR值

  • 传统的 Logistic Regression(得出OR值)
  • Mantel-Haenszel(得出RR值)
  • Poisson Regression with robust variance estimate

1.1 新方法

  • 1998 Zhang and Yu What’s the Relative Risk?

\[RR=\frac{OR}{(1-P_0)+(P_0\times OR)}\]

  • 2003 McNutt Outcomes Estimating the Relative Risk in Cohort Studies and Clinical Trials of Common

  • 金标准:Log Binomial

2 二分类 Logistic 模型的假设

  1. 假设1:因变量(结局)是二分类变量。
  2. 假设2:有至少1个自变量,自变量可以是连续变量,也可以是分类变量,
  3. 假设3:每条观测间相互独立。分类变量(包括因变量和自变量)的分类必须全面且每一个分类间互斥
  4. 假设4:最小样本量要求为自变量数目的15倍,但一些研究者认为样本量应达到自变量数目的50倍
  5. 假设5:连续的自变量与因变量的logit转换值之间存在线性关系。
  6. 假设6:自变量之间无多重共线性,
  7. 假设7:没有明显的离群点、杠杆点和强影响点。

3 做 Logistic 回归的要求

  1. Y是二分类变量
  2. Y的发生率 <15%

4 导入数据

变量 low 是我们的结局事件,我们想看什么因素和孩子的 low birthweight 相关

%%stata
webuse lbw,clear
(Hosmer & Lemeshow data)
%%stata
tab low

Birthweight |
     <2500g |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        130       68.78       68.78
          1 |         59       31.22      100.00
------------+-----------------------------------
      Total |        189      100.00

Disclaimer: 本节的数据集中,结局事件发生率远远大于15%,应使用Log binomial模型进行分析。这里使用Logistic regression进行分析仅仅为了讲 解如何使用Stata进行操作、以及为下节的Log-binomial进行铺垫。

5 Logistic regression

语法:

logistic y x1 x2 x3 ...

5.1 Model 1: \(low=\beta_0+\beta_1 age\)

%%stata
logistic low age

Logistic regression                                     Number of obs =    189
                                                        LR chi2(1)    =   2.76
                                                        Prob > chi2   = 0.0966
Log likelihood = -115.95598                             Pseudo R2     = 0.0118

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9501333   .0299423    -1.62   0.105     .8932232    1.010669
       _cons |      1.469   1.075492     0.53   0.599     .3498129    6.168901
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

\(\beta_1\):母亲的年龄每增加1岁,孩子低体重的风险是之前的0.95倍(95%CI:0.89,1.01)

5.2 Model 2: \(low=\beta_0+\beta_1 age+\beta_2 smoke\)

%%stata
logistic low age i.smoke

Logistic regression                                     Number of obs =    189
                                                        LR chi2(2)    =   7.40
                                                        Prob > chi2   = 0.0248
Log likelihood = -113.63815                             Pseudo R2     = 0.0315

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9514394   .0304194    -1.56   0.119     .8936482    1.012968
             |
       smoke |
     Smoker  |   1.997405    .642777     2.15   0.032     1.063027    3.753081
       _cons |   1.062798   .8048781     0.08   0.936     .2408901    4.689025
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

\(\beta_1\):控制了母亲的吸烟状况以后,母亲的年龄每增加1岁孩子低体重的风险是之前的0.95倍(95% CI:0.89,1.01)

5.3 Model 2: \(low=\beta_0+\beta_1 age+\beta_2 smoke+\beta_3 race\)

%%stata
logistic low age i.smoke i.race

Logistic regression                                     Number of obs =    189
                                                        LR chi2(4)    =  15.81
                                                        Prob > chi2   = 0.0033
Log likelihood = -109.4311                              Pseudo R2     = 0.0674

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9657186   .0322573    -1.04   0.296     .9045206    1.031057
             |
       smoke |
     Smoker  |    3.00582   1.118001     2.96   0.003     1.449982    6.231081
             |
        race |
      Black  |   2.749483   1.356659     2.05   0.040     1.045318    7.231924
      Other  |   2.876948   1.167921     2.60   0.009     1.298314    6.375062
             |
       _cons |    .365111   .3146026    -1.17   0.242     .0674491    1.976395
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

\(\beta_1\):控制了母亲的吸烟状况和种族以后,母亲的年龄每增加1岁孩子低体重的风险是之前的0.97倍(95%CI:0.90,1.03)