14-广义线性模型(GLM)

Generalized Linear Model

作者

Simon Zhou

发布于

2025年5月8日

import stata_setup
stata_setup.config('C:/Program Files/Stata18', 'mp', splash=False)
%%stata
sysuse auto.dta,clear
(1978 automobile data)

大于15%会产生Bias的这个问题,只有在我们想用odds去estimaterisk的时候才会发生。

  1. 在cohort study和clinicaltrial中,outcome是结局事件的发生率;
  2. 在cross-sectional中,outcome是结局事件的prevalence;
  3. 在case-control中,我们不报告risk,只报告odds,因此就没有这个问题了。

1 什么是广义线性模型

  • 线性回归 Linear regression:
    • \(exp Y = \beta_0 + \beta_1*X_1+\cdots + \beta_n *X_n\)
  • 广义线性回归 Generalized linear model:
    • \(f(exp Y)= β_0 + β_1*X_1 + \cdots + \beta_n* X_n\)
    • Linear regression is a special case of GLM
    • Same for logistic regression, Poisson regression, Log-Binomial regression
    • Even for Cox regression (Aug.31st & Sep.7th), GEE model (Sep.14th)
  • 总结:在GLM中,转化Y使得转化之后的Y和X们呈线性关系

2 Simple GLM vs. Multiple GLM

2.1 简单广义线性模型

\[f(exp Y)= β_0 + β_1*X_1\]

2.2 多元广义线性模型

\[f(exp Y)= β_0 + β_1*X_1 + \cdots + \beta_n* X_n\]

X变量可以是各种类型的变量(连续、分类)

3 Specify Options

3.1 Family

Y 变量的分布类型

4 linear regression

4.1 Y var

  1. 连续分布(continuous variable)
  2. 如何分布? 高斯分布(Gaussian Distribution)
  3. Model:
    • \[f(exp Y)= β_0 + β_1*X_1 + \cdots + \beta_n* X_n\]
    • Y 怎么转化才和 X 成 linear 关系?
      • 不用转化
      • Link:Identity(恒等)

5 Logistic regression

5.1 Y var

  1. Binary variable
  2. 如何分布? 二项分布(Binomial)
  3. Model:
    • \[ln[P(Y=1)/P(Y=0)]= β_0 + β_1*X_1+\cdots + β_n*X_n\]
    • Exp Y = P(Y=1)
    • Y 怎么转化才和X成linear关系?Link:logit

6 Poisson regression

6.1 Y var

  1. Count variable:整数(1,2,3,.n)
  2. 如何分布? 泊松分布(Poisson)
  3. Model:
    • \[ln[risk of event)]= β_0 + β_1*X_1+\cdots + β_n*X_n\]
    • Y 怎么转化才和X成linear关系?Link:Log

7 Log-binomial Regression

7.1 Y var

  1. Binary variable
  2. 如何分布? 二项分布(Binomial)
  3. Model:
    • \[ln[P(Y=1)]= β_0 + β_1*X_1+\cdots + β_n*X_n\]
    • Exp Y = P(Y=1)
    • Y 怎么转化才和X成linear关系?Link:Log
%%stata
help glm

[R] glm -- Generalized linear models
           (View complete PDF manual entry)


Syntax
------

        glm depvar [indepvars] [if] [in] [weight] [, options]

    options                     Description
    -------------------------------------------------------------------------
    Model
      family(familyname)        distribution of depvar; default is
                                  family(gaussian)
      link(linkname)            link function; default is canonical link for
                                  family() specified

    Model 2
      noconstant                suppress constant term
      exposure(varname)         include ln(varname) in model with coefficient
                                  constrained to 1
      offset(varname)           include varname in model with coefficient
                                  constrained to 1
      constraints(constraints)  apply specified linear constraints
      asis                      retain perfect predictor variables
      mu(varname)               use varname as the initial estimate for the
                                  mean of depvar
      init(varname)             synonym for mu(varname)

    SE/Robust
      vce(vcetype)              vcetype may be oim, robust, cluster clustvar,
                                  eim, opg, bootstrap, jackknife, hac kernel,
                                  jackknife1, or unbiased
      vfactor(#)                multiply variance matrix by scalar #
      disp(#)                   quasilikelihood multiplier
      scale(x2|dev|#)           set the scale parameter

    Reporting
      level(#)                  set confidence level; default is level(95)
      eform                     report exponentiated coefficients
      nocnsreport               do not display constraints
      display_options           control columns and column formats, row
                                  spacing, line width, display of omitted
                                  variables and base and empty cells, and
                                  factor-variable labeling

    Maximization
      ml                        use maximum likelihood optimization; the
                                  default
      irls                      use iterated, reweighted least-squares
                                  optimization of the deviance
      maximize_options          control the maximization process; seldom used
      fisher(#)                 use the Fisher scoring Hessian or expected
                                  information matrix (EIM)
      search                    search for good starting values

      noheader                  suppress header table from above coefficient
                                  table
      notable                   suppress coefficient table
      nodisplay                 suppress the output; iteration log is still
                                  displayed
      collinear                 keep collinear variables
      coeflegend                display legend instead of statistics
    -------------------------------------------------------------------------

    familyname               Description
    -------------------------------------------------------------------------
    gaussian                 Gaussian (normal)
    igaussian                inverse Gaussian
    binomial[varnameN|#N]    Bernoulli/binomial
    poisson                  Poisson
    nbinomial[#k|ml]         negative binomial
    gamma                    gamma
    -------------------------------------------------------------------------

    linkname                 Description
    -------------------------------------------------------------------------
    identity                 identity
    log                      log
    logit                    logit
    probit                   probit
    cloglog                  cloglog
    power #                  power
    opower #                 odds power
    nbinomial                negative binomial
    loglog                   log-log
    logc                     log-complement
    -------------------------------------------------------------------------

    indepvars may contain factor variables; see fvvarlist.
    depvar and indepvars may contain time-series operators; see tsvarlist.
    bayes, bootstrap, by, collect, fmm, fp, jackknife, mfp, mi estimate,
      nestreg, rolling, statsby, stepwise, and svy are allowed; see prefix.
      For more details, see [BAYES] bayes: glm and [FMM] fmm: glm.
    vce(bootstrap), vce(jackknife), and vce(jackknife1) are not allowed with
      the mi estimate prefix.
    Weights are not allowed with the bootstrap prefix.
    aweights are not allowed with the jackknife prefix.
    vce(), vfactor(), disp(), scale(), irls, fisher(), noheader, notable,
      nodisplay, and weights are not allowed with the svy prefix.
    fweights, aweights, iweights, and pweights are allowed; see weight.
    noheader, notable, nodisplay, collinear, and coeflegend do not appear in
      the dialog box.
    See [R] glm postestimation for features available after estimation.


Menu
----

    Statistics > Generalized linear models > Generalized linear models (GLM)


Description
-----------

    glm fits generalized linear models.  It can fit models by using either
    IRLS (maximum quasilikelihood) or Newton-Raphson (maximum likelihood)
    optimization, which is the default.

    See [U] 27 Overview of Stata estimation commands for a description of all
    of Stata's estimation commands, several of which fit models that can also
    be fit using glm.


Links to PDF documentation
--------------------------

        Quick start

        Remarks and examples

        Methods and formulas

    The above sections are not included in this help file.


Options
-------

        +-------+
    ----+ Model +------------------------------------------------------------

    family(familyname) specifies the distribution of depvar; family(gaussian)
        is the default.

    link(linkname) specifies the link function; the default is the canonical
        link for the family() specified (except for family(nbinomial)).

        +---------+
    ----+ Model 2 +----------------------------------------------------------

    noconstant, exposure(varname), offset(varname), constraints(constraints);
        see [R] Estimation options.  constraints(constraints) is not allowed
        with irls.

    asis forces retention of perfect predictor variables and their
        associated, perfectly predicted observations and may produce
        instabilities in maximization; see [R] probit.  This option is
        allowed only with option family(binomial) with a denominator of 1.

    mu(varname) specifies varname as the initial estimate for the mean of 
        depvar.  This option can be useful with models that experience
        convergence difficulties, such as family(binomial) models with power
        or odds-power links.  init(varname) is a synonym.

        +-----------+
    ----+ SE/Robust +--------------------------------------------------------

    vce(vcetype) specifies the type of standard error reported, which
        includes types that are derived from asymptotic theory (oim, opg),
        that are robust to some kinds of misspecification (robust), that
        allow for intragroup correlation (cluster clustvar), and that use
        bootstrap or jackknife methods (bootstrap, jackknife); see [R]
        vce_option.

        In addition to the standard vcetypes, glm allows the following
        alternatives:

        vce(eim) specifies that the EIM estimate of variance be used.

        vce(jackknife1) specifies that the one-step jackknife estimate of
            variance be used.

        vce(hac kernel [#]) specifies that a heteroskedasticity- and
            autocorrelation-consistent (HAC) variance estimate be used.  HAC
            refers to the general form for combining weighted matrices to
            form the variance estimate.  There are three kernels built into
            glm.  kernel is a user-written program or one of

                          nwest | gallant | anderson

            # specifies the number of lags.  If # is not specified, N - 2 is
            assumed.  If you wish to specify vce(hac ... ), you must tsset
            your data before calling glm.

        vce(unbiased) specifies that the unbiased sandwich estimate of
            variance be used.

    vfactor(#) specifies a scalar by which to multiply the resulting variance
        matrix.  This option allows you to match output with other packages,
        which may apply degrees of freedom or other small-sample corrections
        to estimates of variance.

    disp(#) multiplies the variance of depvar by # and divides the deviance
        by #.  The resulting distributions are members of the quasilikelihood
        family.  This option is allowed only with option irls.

    scale(x2|dev|#) overrides the default scale parameter.  This option is
        allowed only with Hessian (information matrix) variance estimates.

        By default, scale(1) is assumed for the discrete distributions
        (binomial, Poisson, and negative binomial), and scale(x2) is assumed
        for the continuous distributions (Gaussian, gamma, and inverse
        Gaussian).

        scale(x2) specifies that the scale parameter be set to the Pearson
        chi-squared (or generalized chi-squared) statistic divided by the
        residual degrees of freedom, which is recommended by McCullagh and
        Nelder (1989) as a good general choice for continuous distributions.

        scale(dev) sets the scale parameter to the deviance divided by the
        residual degrees of freedom.  This option provides an alternative to
        scale(x2) for continuous distributions and overdispersed or
        underdispersed discrete distributions.  This option is allowed only
        with option irls.

        scale(#) sets the scale parameter to #.  For example, using scale(1)
        in family(gamma) models results in exponential-errors regression.
        Additional use of link(log) rather than the default link(power -1)
        for family(gamma) essentially reproduces Stata's streg, dist(exp)
        nohr command (see [ST] streg) if all the observations are uncensored.

        +-----------+
    ----+ Reporting +--------------------------------------------------------

    level(#); see [R] Estimation options.

    eform displays the exponentiated coefficients and corresponding standard
        errors and confidence intervals.  For family(binomial) link(logit)
        (that is, logistic regression), exponentiation results are odds
        ratios; for family(nbinomial) link(log) (that is, negative binomial
        regression) and for family(poisson) link(log) (that is, Poisson
        regression), exponentiated coefficients are incidence-rate ratios.

    nocnsreport; see [R] Estimation options.

    display_options:  noci, nopvalues, noomitted, vsquish, noemptycells,
        baselevels, allbaselevels, nofvlabel, fvwrap(#), fvwrapon(style),
        cformat(%fmt), pformat(%fmt), sformat(%fmt), and nolstretch; see [R]
        Estimation options.

        +--------------+
    ----+ Maximization +-----------------------------------------------------

    ml requests that optimization be carried out using Stata's ml commands
        and is the default.

    irls requests iterated, reweighted least-squares (IRLS) optimization of
        the deviance instead of Newton-Raphson optimization of the log
        likelihood.  If the irls option is not specified, the optimization is
        carried out using Stata's ml commands, in which case all options of
        ml maximize are also available.

    maximize_options:  difficult, technique(algorithm_spec), iterate(#),
        [no]log, trace, gradient, showstep, hessian, showtolerance,
        tolerance(#), ltolerance(#), nrtolerance(#), nonrtolerance, and
        from(init_specs); see [R] Maximize.  These options are seldom used.

        Setting the optimization method to technique(bhhh) resets the default
        vcetype to vce(opg).

        If option irls is specified, only maximize_options iterate(), nolog,
        trace, and ltolerance() are allowed.  With irls specified, the
        convergence criterion is satisfied when the absolute change in
        deviance from one iteration to the next is less than or equal to
        ltolerance(), where ltolerance(1e-6) is the default.

    fisher(#) specifies the number of Newton-Raphson steps that should use
        the Fisher scoring Hessian or EIM before switching to the observed
        information matrix (OIM).  This option is useful only for
        Newton-Raphson optimization (and not when using irls).

    search specifies that the command search for good starting values.  This
        option is useful only for Newton-Raphson optimization (and not when
        using irls).

    The following options are available with glm but are not shown in the
    dialog box:

    noheader suppresses the header information from the output.  The
        coefficient table is still displayed.

    notable suppresses the table of coefficients from the output.  The header
        information is still displayed.

    nodisplay suppresses the output.  The iteration log is still displayed.

    collinear, coeflegend; see [R] Estimation options.  collinear is not
        allowed with irls.


Remarks
-------

    Although glm can be used to fit linear regression (and, in fact, does so
    by default), this should be viewed as an instructional feature; regress
    produces such estimates more quickly, and many postestimation commands
    are available to explore the adequacy of the fit; see [R] regress and [R]
    regress postestimation.

    In any case, you should specify the link function by using the link()
    option and specify the distributional family by using family().  The
    available link functions are

                   Link function            glm option     
                   ----------------------------------------
                   identity                 link(identity) 
                   log                      link(log)      
                   logit                    link(logit)    
                   probit                   link(probit)   
                   complementary log-log    link(cloglog)  
                   odds power               link(opower #) 
                   power                    link(power #)  
                   negative binomial        link(nbinomial)
                   log-log                  link(loglog)   
                   log-complement           link(logc)     

    The available distributional families are

                   Family                 glm option       
                   ----------------------------------------
                   Gaussian(normal)       family(gaussian) 
                   inverse Gaussian       family(igaussian)
                   Bernoulli/binomial     family(binomial) 
                   Poisson                family(poisson)  
                   negative binomial      family(nbinomial)
                   gamma                  family(gamma)    

    You do not have to specify both family() and link(); the default link()
    is the canonical link for the specified family() (except for nbinomial):

                    Family                  Default link  
                    --------------------------------------
                    family(gaussian)        link(identity)
                    family(igaussian)       link(power -2)
                    family(binomial)        link(logit)   
                    family(poisson)         link(log)     
                    family(nbinomial)       link(log)     
                    family(gamma)           link(power -1)

    If you specify both family() and link(), not all combinations make sense.
    You may choose from the following combinations:

          | id  log  logit  probit  clog  pow  opower  nbinomial  loglog  logc
----------+-------------------------------------------------------------------
Gaussian  |  x   x                         x
inv. Gau. |  x   x                         x
binomial  |  x   x     x      x       x    x     x                  x      x
Poisson   |  x   x                         x
neg. bin. |  x   x                         x              x
gamma     |  x   x                         x


Examples
--------

    ---------------------------------------------------------------------------
    Setup
        . webuse lbw

    Generalized linear model with Bernoulli family and default logit link
        . glm low age lwt i.race smoke ptl ht ui, family(binomial)

    Replay results and report exponentiated coefficients
        . glm, eform

    ---------------------------------------------------------------------------
    Setup
        . webuse ldose

    Generalized linear model with binomial family and default logit link
        . glm r ldose, family(binomial n)

    Generalized linear model with binomial family and cloglog link
        . glm r ldose, family(binomial n) link(cloglog)

    ---------------------------------------------------------------------------
    Setup
        . webuse beetle

    Generalized linear model with binomial family and cloglog link
        . glm r i.beetle ldose, family(binomial n) link(cloglog)

    Replay results with 99% confidence intervals
        . glm, level(99)
    ---------------------------------------------------------------------------


Stored results
--------------

    glm, ml stores the following in e():

    Scalars           
      e(N)                   number of observations
      e(k)                   number of parameters
      e(k_eq)                number of equations in e(b)
      e(k_eq_model)          number of equations in overall model test
      e(k_dv)                number of dependent variables
      e(df_m)                model degrees of freedom
      e(df)                  residual degrees of freedom
      e(phi)                 scale parameter
      e(aic)                 model AIC
      e(bic)                 model BIC
      e(ll)                  log likelihood, if NR
      e(N_clust)             number of clusters
      e(chi2)                chi-squared
      e(p)                   p-value for model test
      e(deviance)            deviance
      e(deviance_s)          scaled deviance
      e(deviance_p)          Pearson deviance
      e(deviance_ps)         scaled Pearson deviance
      e(dispers)             dispersion
      e(dispers_s)           scaled dispersion
      e(dispers_p)           Pearson dispersion
      e(dispers_ps)          scaled Pearson dispersion
      e(nbml)                1 if negative binomial parameter estimated via
                               ML, 0 otherwise
      e(vf)                  factor set by vfactor(), 1 if not set
      e(power)               power set by link(power #) or link(opower #)
      e(rank)                rank of e(V)
      e(ic)                  number of iterations
      e(rc)                  return code
      e(converged)           1 if converged, 0 otherwise

    Macros            
      e(cmd)                 glm
      e(cmdline)             command as typed
      e(depvar)              name of dependent variable
      e(varfunc)             program to calculate variance function
      e(varfunct)            variance title
      e(varfuncf)            variance function
      e(link)                program to calculate link function
      e(linkt)               link title
      e(linkf)               link function
      e(m)                   number of binomial trials
      e(wtype)               weight type
      e(wexp)                weight expression
      e(title)               title in estimation output
      e(clustvar)            name of cluster variable
      e(offset)              linear offset variable
      e(chi2type)            Wald; type of model chi-squared test
      e(cons)                noconstant, if specified
      e(hac_kernel)          HAC kernel
      e(hac_lag)             HAC lag
      e(vce)                 vcetype specified in vce()
      e(vcetype)             title used to label Std. err.
      e(opt)                 ml or irls
      e(opt1)                optimization title, line 1
      e(opt2)                optimization title, line 2
      e(which)               max or min; whether optimizer is to perform
                               maximization or minimization
      e(ml_method)           type of ml method
      e(user)                name of likelihood-evaluator program
      e(technique)           maximization technique
      e(properties)          b V
      e(predict)             program used to implement predict
      e(marginsok)           predictions allowed by margins
      e(marginsnotok)        predictions disallowed by margins
      e(asbalanced)          factor variables fvset as asbalanced
      e(asobserved)          factor variables fvset as asobserved

    Matrices          
      e(b)                   coefficient vector
      e(Cns)                 constraints matrix
      e(ilog)                iteration log (up to 20 iterations)
      e(gradient)            gradient vector
      e(V)                   variance-covariance matrix of the estimators
      e(V_modelbased)        model-based variance

    Functions         
      e(sample)              marks estimation sample

    In addition to the above, the following is stored in r():

    Matrices          
      r(table)               matrix containing the coefficients with their
                               standard errors, test statistics, p-values,
                               and confidence intervals

    Note that results stored in r() are updated when the command is replayed
    and will be replaced when any r-class command is run after the estimation
    command.

    glm, irls stores the following in e():

    Scalars           
      e(N)                   number of observations
      e(k)                   number of parameters
      e(k_eq_model)          number of equations in overall model test
      e(df_m)                model degrees of freedom
      e(df)                  residual degrees of freedom
      e(phi)                 scale parameter
      e(disp)                dispersion parameter
      e(bic)                 model BIC
      e(N_clust)             number of clusters
      e(deviance)            deviance
      e(deviance_s)          scaled deviance
      e(deviance_p)          Pearson deviance
      e(deviance_ps)         scaled Pearson deviance
      e(dispers)             dispersion
      e(dispers_s)           scaled dispersion
      e(dispers_p)           Pearson dispersion
      e(dispers_ps)          scaled Pearson dispersion
      e(nbml)                1 if negative binomial parameter estimated via
                               ML, 0 otherwise
      e(vf)                  factor set by vfactor(), 1 if not set
      e(power)               power set by link(power #) or link(opower #)
      e(rank)                rank of e(V)
      e(rc)                  return code

    Macros            
      e(cmd)                 glm
      e(cmdline)             command as typed
      e(depvar)              name of dependent variable
      e(varfunc)             program to calculate variance function
      e(varfunct)            variance title
      e(varfuncf)            variance function
      e(link)                program to calculate link function
      e(linkt)               link title
      e(linkf)               link function
      e(m)                   number of binomial trials
      e(wtype)               weight type
      e(wexp)                weight expression
      e(clustvar)            name of cluster variable
      e(offset)              linear offset variable
      e(cons)                noconstant, if specified
      e(hac_kernel)          HAC kernel
      e(hac_lag)             HAC lag
      e(vce)                 vcetype specified in vce()
      e(vcetype)             title used to label Std. err.
      e(opt)                 ml or irls
      e(opt1)                optimization title, line 1
      e(opt2)                optimization title, line 2
      e(properties)          b V
      e(predict)             program used to implement predict
      e(marginsok)           predictions allowed by margins
      e(marginsnotok)        predictions disallowed by margins
      e(asbalanced)          factor variables fvset as asbalanced
      e(asobserved)          factor variables fvset as asobserved

    Matrices          
      e(b)                   coefficient vector
      e(V)                   variance-covariance matrix of the estimators
      e(V_modelbased)        model-based variance

    Functions         
      e(sample)              marks estimation sample

    In addition to the above, the following is stored in r():

    Matrices          
      r(table)               matrix containing the coefficients with their
                               standard errors, test statistics, p-values,
                               and confidence intervals

    Note that results stored in r() are updated when the command is replayed
    and will be replaced when any r-class command is run after the estimation
    command.


Reference
---------

    McCullagh, P., and J. A. Nelder. 1989.  Generalized Linear Models. 2nd
        ed.  London: Chapman & Hall/CRC.

glm command in Stata - Umbrella Command - linear regression: family(gaussian) link(identity) - logistic regression: family(binomial) link(logit) - poisson regression: family(poisson) link(log) - log-binomial regression: family(binomial) link(log)

加粗部分表示,在实际代码中可以简写为加粗的字母或单词即可

8 代码的转换

8.1 Linear regression:

  • regress price weight length mpg i.rep78
  • glm price weight length mpg i.rep78, family(gaussian)link(identity)

8.2 Logistic regression:

  • logistic low age i.smoke i.race
  • glm low age i.smoke i.race,family(binomial) link(logit)
  • glm low age i.smoke i.race,family(binomial) link(logit) eform

eform 直接输出 OR 值,即 \(e^{\beta}\)

9 线性回归模型的比较

%%stata
regress price weight length mpg i.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(7, 61)        =      7.25
       Model |   262008114         7  37429730.6   Prob > F        =    0.0000
    Residual |   314788844        61  5160472.86   R-squared       =    0.4542
-------------+----------------------------------   Adj R-squared   =    0.3916
       Total |   576796959        68  8482308.22   Root MSE        =    2271.7

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |   5.186695   1.163383     4.46   0.000     2.860367    7.513022
      length |  -124.1544   40.07637    -3.10   0.003     -204.292   -44.01671
         mpg |  -126.8367   84.49819    -1.50   0.138    -295.8012    42.12791
             |
       rep78 |
          2  |   1137.284   1803.332     0.63   0.531    -2468.701    4743.269
          3  |   1254.642   1661.545     0.76   0.453    -2067.823    4577.108
          4  |   2267.188   1698.018     1.34   0.187    -1128.208    5662.584
          5  |   3850.759   1787.272     2.15   0.035     276.8886     7424.63
             |
       _cons |   14614.49   6155.842     2.37   0.021     2305.125    26923.86
------------------------------------------------------------------------------
%%stata
glm price weight length mpg i.rep78, family(gaussian)link(identity)

Iteration 0:  Log likelihood = -626.90582  

Generalized linear models                         Number of obs   =         69
Optimization     : ML                             Residual df     =         61
                                                  Scale parameter =    5160473
Deviance         =  314788844.4                   (1/df) Deviance =    5160473
Pearson          =  314788844.4                   (1/df) Pearson  =    5160473

Variance function: V(u) = 1                       [Gaussian]
Link function    : g(u) = u                       [Identity]

                                                  AIC             =   18.40307
Log likelihood   = -626.9058204                   BIC             =   3.15e+08

------------------------------------------------------------------------------
             |                 OIM
       price | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |   5.186695   1.163383     4.46   0.000     2.906506    7.466883
      length |  -124.1544   40.07637    -3.10   0.002    -202.7026   -45.60612
         mpg |  -126.8367   84.49819    -1.50   0.133    -292.4501    38.77675
             |
       rep78 |
          2  |   1137.284   1803.332     0.63   0.528    -2397.182     4671.75
          3  |   1254.642   1661.545     0.76   0.450    -2001.927    4511.211
          4  |   2267.188   1698.018     1.34   0.182    -1060.866    5595.241
          5  |   3850.759   1787.272     2.15   0.031     347.7711    7353.747
             |
       _cons |   14614.49   6155.842     2.37   0.018     2549.263    26679.72
------------------------------------------------------------------------------

10 logistic 回归的比较

重新导入数据:

%%stata
webuse lbw,clear
(Hosmer & Lemeshow data)
%%stata
logistic low age i.smoke i.race

Logistic regression                                     Number of obs =    189
                                                        LR chi2(4)    =  15.81
                                                        Prob > chi2   = 0.0033
Log likelihood = -109.4311                              Pseudo R2     = 0.0674

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9657186   .0322573    -1.04   0.296     .9045206    1.031057
             |
       smoke |
     Smoker  |    3.00582   1.118001     2.96   0.003     1.449982    6.231081
             |
        race |
      Black  |   2.749483   1.356659     2.05   0.040     1.045318    7.231924
      Other  |   2.876948   1.167921     2.60   0.009     1.298314    6.375062
             |
       _cons |    .365111   .3146026    -1.17   0.242     .0674491    1.976395
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
%%stata
glm low age i.smoke i.race,family(binomial) link(logit) eform

Iteration 0:  Log likelihood = -109.53148  
Iteration 1:  Log likelihood = -109.43111  
Iteration 2:  Log likelihood =  -109.4311  
Iteration 3:  Log likelihood =  -109.4311  

Generalized linear models                         Number of obs   =        189
Optimization     : ML                             Residual df     =        184
                                                  Scale parameter =          1
Deviance         =  218.8621974                   (1/df) Deviance =   1.189468
Pearson          =  182.9642078                   (1/df) Pearson  =   .9943707

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = ln(u/(1-u))             [Logit]

                                                  AIC             =   1.210911
Log likelihood   = -109.4310987                   BIC             =  -745.6193

------------------------------------------------------------------------------
             |                 OIM
         low | Odds ratio   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9657186   .0322573    -1.04   0.296     .9045206    1.031057
             |
       smoke |
     Smoker  |    3.00582   1.118001     2.96   0.003     1.449982    6.231081
             |
        race |
      Black  |   2.749483   1.356659     2.05   0.040     1.045318    7.231924
      Other  |   2.876948   1.167921     2.60   0.009     1.298314    6.375062
             |
       _cons |    .365111   .3146026    -1.17   0.242     .0674491    1.976395
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

11 Log-Binomial Model

glm low age i.smoke i.race,family(binomial) link(log)

  • Logistic regression 的一种替代
  • Fail to converge
  • Poisson regression with robust variance estimate

为了避免出现 Fail to converge 的问题,采用稳健方差估计:

glm low age i.smoke i.race,family(poisson) link(log) robust

  • robust 可以缩写为 r
%%stata
glm low age i.smoke i.race,family(poisson) link(log) robust

Iteration 0:  Log pseudolikelihood = -124.55888  
Iteration 1:  Log pseudolikelihood = -122.39663  
Iteration 2:  Log pseudolikelihood = -122.39591  
Iteration 3:  Log pseudolikelihood = -122.39591  

Generalized linear models                         Number of obs   =        189
Optimization     : ML                             Residual df     =        184
                                                  Scale parameter =          1
Deviance         =   126.791811                   (1/df) Deviance =   .6890859
Pearson          =  124.4629927                   (1/df) Pearson  =   .6764293

Variance function: V(u) = u                       [Poisson]
Link function    : g(u) = ln(u)                   [Log]

                                                  AIC             =   1.348105
Log pseudolikelihood = -122.3959055               BIC             =  -837.6896

------------------------------------------------------------------------------
             |               Robust
         low | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0246781   .0203034    -1.22   0.224    -.0644722    .0151159
             |
       smoke |
     Smoker  |   .6972155    .211848     3.29   0.001     .2820011     1.11243
             |
        race |
      Black  |    .639629   .2818353     2.27   0.023     .0872419    1.192016
      Other  |   .6813692   .2428128     2.81   0.005     .2054649    1.157273
             |
       _cons |  -1.286544   .5405667    -2.38   0.017    -2.346035   -.2270526
------------------------------------------------------------------------------