import stata_setup
'C:/Program Files/Stata18', 'mp', splash=False) stata_setup.config(
13-二分类Logistic回归
Logistic regression for binary outcomes
1 什么时候该用 Logistic 回归
当outcome发生率 >15% 时logistic regression得出的OR值会overestimate实际的RR值
- 传统的 Logistic Regression(得出OR值)
- Mantel-Haenszel(得出RR值)
- Poisson Regression with robust variance estimate
1.1 新方法
- 1998 Zhang and Yu What’s the Relative Risk?
\[RR=\frac{OR}{(1-P_0)+(P_0\times OR)}\]
2003 McNutt Outcomes Estimating the Relative Risk in Cohort Studies and Clinical Trials of Common
金标准:Log Binomial
2 二分类 Logistic 模型的假设
- 假设1:因变量(结局)是二分类变量。
- 假设2:有至少1个自变量,自变量可以是连续变量,也可以是分类变量,
- 假设3:每条观测间相互独立。分类变量(包括因变量和自变量)的分类必须全面且每一个分类间互斥
- 假设4:最小样本量要求为自变量数目的15倍,但一些研究者认为样本量应达到自变量数目的50倍
- 假设5:连续的自变量与因变量的logit转换值之间存在线性关系。
- 假设6:自变量之间无多重共线性,
- 假设7:没有明显的离群点、杠杆点和强影响点。
3 做 Logistic 回归的要求
- Y是二分类变量
- Y的发生率 <15%
4 导入数据
变量 low
是我们的结局事件,我们想看什么因素和孩子的 low birthweight
相关
%%stata
webuse lbw,clear
(Hosmer & Lemeshow data)
%%stata
tab low
Birthweight |
<2500g | Freq. Percent Cum.
------------+-----------------------------------
0 | 130 68.78 68.78
1 | 59 31.22 100.00
------------+-----------------------------------
Total | 189 100.00
Disclaimer: 本节的数据集中,结局事件发生率远远大于15%,应使用Log binomial模型进行分析。这里使用Logistic regression进行分析仅仅为了讲 解如何使用Stata进行操作、以及为下节的Log-binomial进行铺垫。
5 Logistic regression
语法:
logistic y x1 x2 x3 ...
5.1 Model 1: \(low=\beta_0+\beta_1 age\)
%%stata
logistic low age
Logistic regression Number of obs = 189
LR chi2(1) = 2.76
Prob > chi2 = 0.0966
Log likelihood = -115.95598 Pseudo R2 = 0.0118
------------------------------------------------------------------------------
low | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | .9501333 .0299423 -1.62 0.105 .8932232 1.010669
_cons | 1.469 1.075492 0.53 0.599 .3498129 6.168901
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
\(\beta_1\):母亲的年龄每增加1岁,孩子低体重的风险是之前的0.95倍(95%CI:0.89,1.01)
5.2 Model 2: \(low=\beta_0+\beta_1 age+\beta_2 smoke\)
%%stata
logistic low age i.smoke
Logistic regression Number of obs = 189
LR chi2(2) = 7.40
Prob > chi2 = 0.0248
Log likelihood = -113.63815 Pseudo R2 = 0.0315
------------------------------------------------------------------------------
low | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | .9514394 .0304194 -1.56 0.119 .8936482 1.012968
|
smoke |
Smoker | 1.997405 .642777 2.15 0.032 1.063027 3.753081
_cons | 1.062798 .8048781 0.08 0.936 .2408901 4.689025
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
\(\beta_1\):控制了母亲的吸烟状况以后,母亲的年龄每增加1岁孩子低体重的风险是之前的0.95倍(95% CI:0.89,1.01)
5.3 Model 2: \(low=\beta_0+\beta_1 age+\beta_2 smoke+\beta_3 race\)
%%stata
logistic low age i.smoke i.race
Logistic regression Number of obs = 189
LR chi2(4) = 15.81
Prob > chi2 = 0.0033
Log likelihood = -109.4311 Pseudo R2 = 0.0674
------------------------------------------------------------------------------
low | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | .9657186 .0322573 -1.04 0.296 .9045206 1.031057
|
smoke |
Smoker | 3.00582 1.118001 2.96 0.003 1.449982 6.231081
|
race |
Black | 2.749483 1.356659 2.05 0.040 1.045318 7.231924
Other | 2.876948 1.167921 2.60 0.009 1.298314 6.375062
|
_cons | .365111 .3146026 -1.17 0.242 .0674491 1.976395
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
\(\beta_1\):控制了母亲的吸烟状况和种族以后,母亲的年龄每增加1岁孩子低体重的风险是之前的0.97倍(95%CI:0.90,1.03)