global path "/Users/ryanmfinnigan/Dropbox/FinniganMarginalEffects"
log using $path/FinniganMarginalEffects.txt, replace text
*************************************
***** MARGINAL EFFECTS TUTORIAL *****
*************************************
// Ryan Finnigan
// rfinnigan@ucdavis.edu
// May, 2015
/* This brief 'tutorial' covers basic calculation and presentation of marginal
effects using logistic regression in Stata. The 'tutorial' has four sections:
1. A comparison of the substantive meanings of proportions, marginal probabilities,
and odds.
2. The presentation and substantive interpretation of results from a logistic regression
model using log-odds coefficients, odds ratios, and marginal effects.
3. An illustration for calculating average marginal effects, and marginal effects
at the means for continuous and dichotomous variables.
4. A comparison of results from linear probability models and marginal effects.
Check out Carina Mood's paper for much more on the complications of log-odds coefficients
and odds ratios relative to marginal effects, including more detailed examples:
Mood, Carina. 2010. "Logistic Regression: Why We Cannot Do What We Think We Can,
and What We Can Do About It." European Sociological Review 26:67-82
Mood draws a lot on some work by Paul Allison (and others) on logistic regression,
particularly with regards to the problems of unobserved heterogeneity between groups:
Allison, P. D. (1999). Comparing logit and probit coefficients across groups.
Sociological Methods and Research, 28, 186Ð208.
Also check out Ai and Norton if you're interested in a more technical treatment of interaction effects (which
are kind of a mess in logistic regression):
Ai, Chunrong, and Edward Norton. 2003. "Interaction Terms in Logit and Probit Models."
Economics Letters 80L 123-129
Finally, Richard Williams at Notre Dame has a Stata package (oglm) and some papers on dealing
with heterogeneity between groups in logistic and ordered logit:
Williams, Richard. 2009. "Using Heterogeneous Choice Models to Compare Logit and
Probit Coefficients Across Groups." Sociological Methods and Research 37(4):531-559.
__. 2010. "Estimating Heterogeneous Choice Models with OGLM." The Stata Journal
*/
*******************
***** 0. DATA *****
*******************
/*
The data for the exercise come from the 2013 March Current Population Survey, downloaded
from IPUMS (https://cps.ipums.org/cps/). IPUMS is a fantastic resource for CPS data,
Census and American Community Survey microdata, geographic census data, international
censuses, and all kinds of other cool stuff. Caution: it's easy to fall down the rabbit
hole of data hoarding if you're a data nerd. Get ready to invest in external hard
drives for masses of data you'll never use.
The dependent variable for the exercise is dichotomous self-rated health
(excellent, very good, good = 1; fair, poor = 0).
We'll use two continuous predictors: age and household income
And a few categorical predictors: sex, race/ethnicity, marital status, and education
*/
use if year==2013 & relate==101 & popstat!=2 & age>=18 & occ1990!=905 using "/Users/rfinniga/Documents/CPS_1962_2013.dta", clear
// taking a random subset of the data
// some of the commands below take a while with large data
// this is just to speed up the tutorial
set seed 123456
gen r = runiform()
keep if r<.1
// health
recode health (1/3 = 1 "Good") (4/5 = 0 "Poor"), gen(srh)
label var srh "Health"
// dummy for female from "sex"
recode sex (1 = 0 "Male") (2 = 1 "Female"), gen(female)
label var female "Sex"
// standard race categories from "race" and "hispan"
recode race (100 = 1 "White") (200 = 2 "Black") (0 = 3 "Latino") (650/652 = 4 "Asian") (300 700/830 = 5 "Other Race"), gen(racecat)
replace racecat = 3 if hispan>0 & hispan<900
label var racecat "Race/Ethnicity"
// marital status
recode marst (1/2 = 1 "Married") (3/5 = 2 "Previously Married") (6 = 3 "Never Married"), gen(marcat)
label var marcat "Marital Status"
// education
recode educ (2/71 = 1 "Less Than HS") (73 = 2 "HS/GED") (81/92 = 3 "Some College") (111/max = 4 "College+"), gen(edcat)
label var edcat "Education"
// logged household income, adjusted for household size
rename numprec hhsize
recode hhincome (min/1 = 1)
gen loghhincome = ln(hhincome/sqrt(hhsize))
label var loghhincome "ln(HH Income)"
keep srh age female racecat marcat edcat loghhincome
save $path/FinniganMarginalEffects.dta, replace
******************************************************
***** 1. PROPORTIONS, MARGINAL EFFECTS, AND ODDS *****
******************************************************
/* The first descriptive steps generally include cross-tabulations. Let's start
with health by sex. */
tab srh female, col
/* The marginal probabilities from logistic regression are the predicted probabilities
of good health for the sex categories. In a bivariate model, these marginal probabilities
recover the proportions of good health. The margins command is really useful for marginal
probabilities and marginal effects. Its learning curve is a bit steep, but it's pretty
powerful. See the Stata manual for way more (http://www.stata.com/manuals13/rmargins.pdf). */
quietly logit srh i.female
margins female
/* The margins command can also calculate the marginal effect, the difference in
the predicted probability of good health between female and male respondents. */
margins , dydx(female)
/* The correspondence between the marginal probabilities/effects and descriptive
proportions gives them a substantive interpretability that odds ratios and log-odds
coefficients lack (aside from the technical reasons that make odds ratios and log-odds
coefficients less comparable between models and groups. Logistic regression presented
using odds ratios can still recover the descriptive patterns, but the proportions must be
transformed. */
logistic srh female
/*
This odds ratio is the same as using the formula OR = [p1/(1-p1)]/[p0/(1-p0)]
0.7972 = (.8323/(1 - .8323))/(.8616/(1-.8616))
Females have 0.79 times the odds of reporting good health relative to males
(you could also say their odds are 1.25 = 1/0.797 times lower). Or, you could say
females have a 3 percentage point lower probability of reporting good health than
males. In my opinion, the second is more meaningful than the first. */
/* The proportions and marginal probabilities are also more informative to display
graphically than odds and odds ratios. The marginsplot command can produce graphs
following the margins command. See the stata manual for more details
(http://www.stata.com/manuals13/rmarginsplot.pdf). */
graph bar srh, over(female) scheme(s1color)
quietly logit srh i.female
quietly margins female
marginsplot , recast(bar) scheme(s1color)
***********************************************************************
***** 2. LOG-ODDS COEFFICIENTS, ODDS RATIOS, AND MARGINAL EFFECTS *****
***********************************************************************
/* Now let's compare results from a full model presented as log-odds coefficients,
odds ratios, and marginal effects. The user-written outreg2 command produces an
excel table with one column for each type of presentation. */
// table settings
global regtable "bdec(3) tdec(2) alpha(0.001, 0.01, 0.05) symbol(***, **, *) excel tstat"
// log-odds coefficients
quietly logit srh age i.female i.marcat i.racecat i.edcat loghhincome
outreg2 using $path/RegressionTable.xls, $regtable ctitle("log-odds") replace
// odds ratios, using last regression
outreg2 using $path/RegressionTable.xls, $regtable eform ctitle("OR") append
// marginal effects, using last regression
// the "post" options makes the marginal effects the last regression estimates in
// memory, rather than those from the logit command above
quietly margins , dydx(*) post
outreg2 using $path/RegressionTable.xls, $regtable ctitle("AME") append label
/* The marginal effects for categorical variables are the average differences in the
predicted probabilities of health between the given category and the reference,
with other variables at their observed values. For continuous variables, the
marginal effect is the difference in the predicted probability for very small
changes in x from its mean. This is essentially the derivative of the marginal
probability as a function of x. */
**************************************************
***** 3. DIFFERENT FORMS OF MARGINAL EFFECTS *****
**************************************************
/* The marginal effects reported above are with all other variables at their observed
values, referred to as the average marginal effect (AME). However, there are multiple
forms of marginal effects that you may wish to present, such as margins effects at
the mean (MEM), or marginal effects at a representative value (MER). Mood (2010) gets
into some of the differences in more techical detail. Also, Cameron and Trivedi (2009)
show many of the computations in Stata in the manual, Microeconometrics Using Stata.
The different forms of marginal effects are important to know because each unique
combination of values of the X's comes with its own estimate of the marginal effect.
In contrast, the log-odds coefficients and odds ratios are relative effects, constant
across all observations. Let's compare the odds ratio and average marginal effect of
sex across the range of predicted probabilities/odds of good health. */
preserve
// predicting health with age and sex
quietly logit srh age i.female i.marcat i.racecat i.edcat loghhincome
// generating a variable equal to the odds ratio for females relative to males
gen or = exp(_b[1.female])
// predicted probability of good health
predict p
// predicted odds of good health
gen odds = p/(1-p)
// calculating the average marginal effect
margins , dydx(female)
// predicted probability of good health
// calculating the marginal effect for all observations
gen female2 = female
quietly replace female = 0
predict p0
quietly replace female = 1
predict p1
gen me = p1 - p0
// comparing the average of the marginal effects for all observations to the AME from margins
sum me
tw scatter or odds if female2==1 || scatter or odds if female2==0, ///
legend(order(1 "Female" 2 "Male")) ylabel(.8(.1)1.2) ytitle("Odds Ratio") ///
scheme(s1color) nodraw name(odds, replace)
tw scatter me p if female2==1 || scatter me p if female2==0, ///
legend(order(1 "Female" 2 "Male")) ytitle("Marginal Effect") ///
scheme(s1color) nodraw name(me, replace)
graph combine odds me, title("Odds Ratios and Marginal Effects for the Sex Difference in Health") ///
row(1) scheme(s1color) ysize(4) xsize(6.5) imargin(vsmall) iscale(1.1)
restore
/* The marginal effect is largest around a 50% probability of good health, and
differs slightly between males and females. The odds ratio is constant and equal
between groups. You may be interested in the marginal effect for particular
observations/values of the Xs, so the AME may not be the droid you're looking for. */
/* You may be interested in the marginal effects with all other variables
held constant at their means, MEM. The non-linearity of the predicted probabilities
makes the probability of health with X = means different from the average probability
of health at the observed values of X. */
quietly logit srh age i.female i.marcat i.racecat i.edcat loghhincome
margins , dydx(*) atmean post
outreg2 using $path/RegressionTable.xls, $regtable ctitle("MEM") append label
/* You might also be interested in the marginal effects with some variables at
substantively relevant values. Maybe the marginal effects for those with a high
school education and those with college or more. */
// AME
quietly logit srh age i.female i.marcat i.racecat i.edcat loghhincome
eststo: margins, dydx(*) post
// AME for high school educ
quietly logit srh age i.female i.marcat i.racecat i.edcat loghhincome
eststo: margins, dydx(*) at(edcat = 2) post
// AME for college
quietly logit srh age i.female i.marcat i.racecat i.edcat loghhincome
eststo: margins, dydx(*) at(edcat = 4) post
esttab est*, label b(3)
/* The marginal effects for education are the same, but the marginal effects for
the other variables are a little stronger among those with HS than with college+.
This makes sense, because the predicted probability of good health is lower with
less education, closer to that range of probabilities with the largest marginal
effects in the graph above. You might also be interested in just the effect of
income by education. */
quietly logit srh age i.female i.marcat i.racecat i.edcat loghhincome
margins , dydx(loghhinc) at(edcat = (1 2 3 4))
/* Maybe you have bizarrely specific hypotheses, and you're interested in never
married, male Latino 40 year-olds with HS education and median HH income. */
quietly logit srh age i.female i.marcat i.racecat i.edcat loghhincome
margins , dydx(*) at(age = 40 female = 0 marcat = 3 racecat = 3 edcat = 2 (median) loghhincome)
/* The "at" option also allows you to calculated marginal probabilities for a range
of values. You could calculate the predicted probability of good health by race/ethnicity
with the other variables at their means. */
margins racecat, atmeans
/* Or by marital status across a range of ages, with other variable at their means. */
margins marcat, at((means) _all age = (20(20)60))
/* Finally, the marginal effects of continuous variables aren't super interpretable.
A "very small difference" in age is less clear than a 10 year age difference from
the mean, or a standard deviation difference in income from the mean. There's likely
a better way to do this, but here's a way that works. */
sum age loghhincome
margins , at(age = (49.32 59.32))
margins , at(loghhincome = (10.25 11.75))
/* The marginal effect of a 5 year age difference from the mean is -5.2 percentage
points (-5.2 = 81.31 - 86.51). The marginal effect of a standard deviation greater
income than the mean is 2.34 percentage points (2.34 = 87.50 - 85.16). */
****************************************
***** 4. LINEAR PROBABILITY MODELS *****
****************************************
/* Another option to circumvent many of the complications of logistic regression
is to abandon it altogether. OLS with a binary dependent variable is usually referred
to as a linear probability model (LPM), particularly in economics. In sociology,
it's The Thing That Should Never Be Done. Elsewhere, Mood comments, "For the
braver among us, there is the taboo-breaking option of using LPM, i.e. OLS for
binary outcomes. Many economists do it, there are much fewer mistakes to make
(coefficients can be compared across models and groups, and interactions are
intuitive), and the coefficients are almost always very close to the AME (or to
their discrete counterparts in case of dummy variables). If all sociologists
swapped from logit to OLS, fewer errors would be made and the results would be
more reliable and interpretable."
If you're interest in average effects, and not potentially non-linear differences
at different ranges of values, LPMs may be for you. As a brave and taboo-breaking
quantitative sociologist (???), I use LPMs. They've become pretty common place in
economics, and several economitricians have analyzed the correspondece of results
from LPMs to AMEs. I've noticed it popping up more at conferences in sociology
recently too. So I leave you with the exciting/depressing notion that there's a
whole world of ways to analyze binary dependent variables beyond logistic regression.
*/
quietly reg srh age i.female i.marcat i.racecat i.edcat loghhincome, robust
outreg2 using $path/RegressionTable.xls, $regtable ctitle("LPM") append label
log close