Binary Logistic Regression

Harrell, Frank E.

doi:10.1007/978-3-319-19425-7_10

Frank E. Harrell Jr.⁸

Part of the book series: Springer Series in Statistics ((SSS))

208k Accesses
40 Citations
1 Altmetric

Abstract

Binary responses are commonly studied in many fields. Examples include 1 the presence or absence of a particular disease, death during surgery, or a consumer purchasing a product. Often one wishes to study how a set of predictor variables X is related to a dichotomous response variable Y. The predictors may describe such quantities as treatment assignment, dosage, risk factors, and calendar time. For convenience we define the response to be Y = 0 or 1, with Y = 1 denoting the occurrence of the event of interest. Often a dichotomous outcome can be studied by calculating certain proportions, for example, the proportion of deaths among females and the proportion among males. However, in many situations, there are multiple descriptors, or one or more of the descriptors are continuous. Without a statistical model, studying patterns such as the relationship between age and occurrence of a disease, for example, would require the creation of arbitrary age groups to allow estimation of disease prevalence as a function of age.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The general formula for the sample size required to achieve a margin of error of δ in estimating a true probability of θ at the 0.95 confidence level is \(n = (\frac{1.96} {\delta } )^{2} \times \theta (1-\theta )\). Set \(\theta = \frac{1} {2}\) (intercept=0) for the worst case.
2.
The R code can easily be modified for other event frequencies, or the minimum of the number of events and non-events for a dataset at hand can be compared with \(\frac{n} {2}\) in this simulation. An average maximum absolute error of 0.05 corresponds roughly to a half-width of the 0.95 confidence interval of 0.1.
3.
In the wireframe plots that follow, predictions for cholesterol–age combinations for which fewer than 5 exterior points exist are not shown, so as to not extrapolate to regions not supported by at least five points beyond the data perimeter.
4.
Note that D and B (below) and other indexes not related to c (below) do not work well in case-control studies because of their reliance on absolute probability estimates.

References

A. Agresti. Categorical data analysis. Wiley, Hoboken, NJ, second edition, 2002.
Book MATH Google Scholar
H. R. Arkes, N. V. Dawson, T. Speroff, F. E. Harrell, C. Alzola, R. Phillips, N. Desbiens, R. K. Oye, W. Knaus, A. F. Connors, and T. Investigators. The covariance decomposition of the probability score and its use in evaluating prognostic estimates. Med Decis Mak, 15:120–131, 1995.
Article Google Scholar
D. Bamber. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Mathe Psych, 12:387–415, 1975.
Article MathSciNet MATH Google Scholar
J. Banks. Nomograms. In S. Kotz and N. L. Johnson, editors, Encyclopedia of Stat Scis, volume 6. Wiley, New York, 1985.
Google Scholar
K. N. Berk and D. E. Booth. Seeing a curve in multiple regression. Technometrics, 37:385–398, 1995.
Article MATH Google Scholar
G. W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Rev, 78:1–3, 1950.
Article Google Scholar
M. Buyse. R ²: A useful measure of model performance when predicting a dichotomous outcome. Stat Med, 19:271–274, 2000. Letter to the Editor regarding Stat Med 18:375–384; 1999.
Google Scholar
M. S. Cepeda, R. Boston, J. T. Farrar, and B. L. Strom. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epi, 158:280–287, 2003.
Article Google Scholar
J. M. Chambers and T. J. Hastie, editors. Statistical Models in S. Wadsworth and Brooks/Cole, Pacific Grove, CA, 1992.
MATH Google Scholar
W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc, 74:829–836, 1979.
Article MathSciNet MATH Google Scholar
D. Collett. Modelling Binary Data. Chapman and Hall, London, second edition, 2002.
MATH Google Scholar
E. F. Cook and L. Goldman. Asymmetric stratification: An outline for an efficient method for controlling confounding in cohort studies. Am J Epi, 127:626–639, 1988.
Google Scholar
N. R. Cook. Use and misues of the receiver operating characteristic curve in risk prediction. Circulation, 115:928–935, 2007.
Article Google Scholar
J. Copas. The effectiveness of risk scores: The logit rank plot. Appl Stat, 48:165–183, 1999.
MATH Google Scholar
J. B. Copas. Cross-validation shrinkage of regression predictors. J Roy Stat Soc B, 49:175–183, 1987.
MathSciNet MATH Google Scholar
J. B. Copas. Unweighted sum of squares tests for proportions. Appl Stat, 38:71–80, 1989.
Article MathSciNet Google Scholar
D. R. Cox. The regression analysis of binary sequences (with discussion). J Roy Stat Soc B, 20:215–242, 1958.
MATH Google Scholar
D. R. Cox. Two further applications of a model for binary regression. Biometrika, 45(3/4):562–565, 1958.
Article MATH Google Scholar
D. R. Cox and N. Wermuth. A comment on the coefficient of determination for binary responses. Am Statistician, 46:1–4, 1992.
Google Scholar
J. G. Cragg and R. Uhler. The demand for automobiles. Canadian Journal of Economics, 3:386–406, 1970.
Article Google Scholar
C. E. Davis, J. E. Hyde, S. I. Bangdiwala, and J. J. Nelson. An example of dependencies among variables in a conditional logistic regression. In S. H. Moolgavkar and R. L. Prentice, editors, Modern Statistical Methods in Chronic Disease Epi, pages 140–147. Wiley, New York, 1986.
Google Scholar
B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation. J Am Stat Assoc, 78:316–331, 1983.
Article MathSciNet MATH Google Scholar
E. B. Fowlkes. Some diagnostics for binary logistic regression via smoothing. Biometrika, 74:503–515, 1987.
Article MathSciNet Google Scholar
J. H. Friedman. A variable span smoother. Technical Report 5, Laboratory for Computational Statistics, Department of Statistics, Stanford University, 1984.
Google Scholar
T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc, 102:359–378, 2007.
Article MathSciNet MATH Google Scholar
M. Halperin, W. C. Blackwelder, and J. I. Verter. Estimation of the multivariate logistic risk function: A comparison of the discriminant function and maximum likelihood approaches. J Chron Dis, 24:125–158, 1971.
Article MATH Google Scholar
D. J. Hand. Construction and Assessment of Classification Rules. Wiley, Chichester, 1997.
MATH Google Scholar
T. L. Hankins. Blood, dirt, and nomograms. Chance, 13(1):26–37, 2000.
Article Google Scholar
J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29–36, 1982.
Article Google Scholar
F. E. Harrell. Comparison of strategies for validating binary logistic regression models. Unpublished manuscript, 1991.
Google Scholar
F. E. Harrell and K. L. Lee. A comparison of the discrimination of discriminant analysis and logistic regression under multivariate normality. In P. K. Sen, editor, Biostatistics: Statistics in Biomedical, Public Health, and Environmental Sciences. The Bernard G. Greenberg Volume, pages 333–343. North-Holland, Amsterdam, 1985.
Google Scholar
F. E. Harrell and K. L. Lee. The practical value of logistic regression. In Proceedings of the Tenth Annual SAS Users Group International Conference, pages 1031–1036, 1985.
Google Scholar
F. E. Harrell and K. L. Lee. Using logistic model calibration to assess the quality of probability predictions. Unpublished manuscript, 1987.
Google Scholar
W. W. Hauck and A. Donner. Wald’s test as applied to hypotheses in logit analysis. J Am Stat Assoc, 72:851–863, 1977.
MathSciNet MATH Google Scholar
A. V. Hernández, M. J. Eijkemans, and E. W. Steyerberg. Randomized controlled trials with time-to-event outcomes: how much does prespecified covariate adjustment increase power? Annals of epidemiology, 16(1):41–48, Jan. 2006.
Google Scholar
A. V. Hernández, E. W. Steyerberg, and J. D. F. Habbema. Covariate adjustment in randomized controlled trials with dichotomous outcomes increases statistical power and reduces sample size requirements. J Clin Epi, 57:454–460, 2004.
Article MATH Google Scholar
D. W. Hosmer, T. Hosmer, S. le Cessie, and S. Lemeshow. A comparison of goodness-of-fit tests for the logistic regression model. Stat Med, 16:965–980, 1997.
Article Google Scholar
D. W. Hosmer and S. Lemeshow. Goodness-of-fit tests for the multiple logistic regression model. Comm Stat Th Meth, 9:1043–1069, 1980.
Article Google Scholar
D. W. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley, New York, 1989.
Google Scholar
D. W. Hosmer and S. Lemeshow. Confidence interval estimates of an index of quality performance based on logistic regression models. Stat Med, 14:2161–2172, 1995. See letter to editor 16:1301-3,1997.
Google Scholar
B. Hu, M. Palta, and J. Shao. Properties of R ² statistics for logistic regression. Stat Med, 25:1383–1395, 2006.
Article MathSciNet Google Scholar
R. Kay and S. Little. Assessing the fit of the logistic model: A case study of children with the haemolytic uraemic syndrome. Appl Stat, 35:16–30, 1986.
Article MATH Google Scholar
E. L. Korn and R. Simon. Explained residual variation, explained risk, and goodness of fit. Am Statistician, 45:201–206, 1991.
Google Scholar
J. M. Landwehr, D. Pregibon, and A. C. Shoemaker. Graphical methods for assessing logistic regression models (with discussion). J Am Stat Assoc, 79:61–83, 1984.
Article MATH Google Scholar
P. W. Lavori, R. Dawson, and T. B. Mueller. Causal estimation of time-varying treatment effects in observational studies: Application to depressive disorder. Stat Med, 13:1089–1100, 1994.
Article Google Scholar
S. le Cessie and J. C. van Houwelingen. A goodness-of-fit test for binary regression models, based on smoothing methods. Biometrics, 47:1267–1282, 1991.
Article MATH Google Scholar
J. G. Liao and D. McGee. Adjusted coefficients of determination for logistic regression. Am Statistician, 57:161–165, 2003.
Article MathSciNet MATH Google Scholar
K. Linnet. Assessing diagnostic tests by a strictly proper scoring rule. Stat Med, 8:609–618, 1989.
Article Google Scholar
K. Liu and A. R. Dyer. A rank statistic for assessing the amount of variation explained by risk factors in epidemiologic studies. Am J Epi, 109:597–606, 1979.
Google Scholar
G. S. Maddala. Limited-Dependent and Qualitative Variables in Econometrics. Cambridge University Press, Cambridge, UK, 1983.
Book MATH Google Scholar
L. Magee. R ² measures based on Wald and likelihood ratio joint significance tests. Am Statistician, 44:250–253, 1990.
Google Scholar
S. Menard. Coefficients of determination for multiple logistic regression analysis. Am Statistician, 54:17–24, 2000.
Google Scholar
M. E. Miller, S. L. Hui, and W. M. Tierney. Validation techniques for logistic regression models. Stat Med, 10:1213–1226, 1991.
Article Google Scholar
M. Mittlböck and M. Schemper. Explained variation for logistic regression. Stat Med, 15:1987–1997, 1996.
Article MATH Google Scholar
K. G. M. Moons, Donders, E. W. Steyerberg, and F. E. Harrell. Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example. J Clin Epi, 57:1262–1270, 2004.
Google Scholar
N. J. D. Nagelkerke. A note on a general definition of the coefficient of determination. Biometrika, 78:691–692, 1991.
Article MathSciNet MATH Google Scholar
R. Newson. Parameters behind “nonparametric” statistics: Kendall’s tau, Somers’ D and median differences. Stata Journal, 2(1), 2002. http://www.stata-journal.com/article.html?article=st0007.
R. Newson. Confidence intervals for rank statistics: Somers’ D and extensions. Stata J, 6(3):309–334, 2006.
Google Scholar
P. C. O’Brien. Comparing two samples: Extensions of the t, rank-sum, and log-rank test. J Am Stat Assoc, 83:52–61, 1988.
Google Scholar
M. J. Pencina, R. B. D’Agostino, and O. V. Demler. Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models. Stat Med, 31(2):101–113, 2012.
Article MathSciNet Google Scholar
M. J. Pencina, R. B. D’Agostino Sr, R. B. D’Agostino Jr, and R. S. Vasan. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med, 27:157–172, 2008.
Article MathSciNet Google Scholar
D. Pregibon. Logistic regression diagnostics. Ann Stat, 9:705–724, 1981.
Article MathSciNet MATH Google Scholar
D. Pregibon. Resistant fits for some commonly used logistic models with medical applications. Biometrics, 38:485–498, 1982.
Article Google Scholar
S. J. Press and S. Wilson. Choosing between logistic regression and discriminant analysis. J Am Stat Assoc, 73:699–705, 1978.
Article MATH Google Scholar
D. B. Pryor, F. E. Harrell, K. L. Lee, R. M. Califf, and R. A. Rosati. Estimating the likelihood of significant coronary artery disease. Am J Med, 75:771–780, 1983.
Article MATH Google Scholar
J. M. Robins, S. D. Mark, and W. K. Newey. Estimating exposure effects by modeling the expectation of exposure conditional on confounders. Biometrics, 48:479–495, 1992.
Article MathSciNet MATH Google Scholar
L. D. Robinson and N. P. Jewell. Some surprising results about covariate adjustment in logistic regression models. Int Stat Rev, 59:227–240, 1991.
Article MATH Google Scholar
P. R. Rosenbaum and D. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70:41–55, 1983.
Article MathSciNet MATH Google Scholar
P. R. Rosenbaum and D. B. Rubin. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. J Roy Stat Soc B, 45:212–218, 1983.
Google Scholar
J. C. Sinclair and M. B. Bracken. Clinically useful measures of effect in binary analyses of randomized trials. J Clin Epi, 47:881–889, 1994.
Article Google Scholar
R. H. Somers. A new asymmetric measure of association for ordinal variables. Am Soc Rev, 27:799–811, 1962.
Article Google Scholar
A. Spanos, F. E. Harrell, and D. T. Durack. Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations. JAMA, 262:2700–2707, 1989.
Article Google Scholar
N. Stallard. Simple tests for the external validation of mortality prediction scores. Stat Med, 28:377–388, 2009.
Article MathSciNet Google Scholar
E. W. Steyerberg, P. M. M. Bossuyt, and K. L. Lee. Clinical trials in acute myocardial infarction: Should we adjust for baseline characteristics? Am Heart J, 139:745–751, 2000. Editorial, pp. 761–763.
Google Scholar
E. W. Steyerberg, M. J. C. Eijkemans, F. E. Harrell, and J. D. F. Habbema. Prognostic modeling with logistic regression analysis: In search of a sensible strategy in small data sets. Med Decis Mak, 21:45–56, 2001.
Article Google Scholar
T. Tjur. Coefficients of determination in logistic regression models—A new proposal: The coefficient of discrimination. Am Statistician, 63(4):366–372, 2009.
Article MathSciNet MATH Google Scholar
J. C. van Houwelingen and S. le Cessie. Logistic regression, a review. Statistica Neerlandica, 42:215–232, 1988.
Article MathSciNet Google Scholar
J. C. van Houwelingen and S. le Cessie. Predictive value of statistical models. Stat Med, 9:1303–1325, 1990.
Article Google Scholar
S. H. Walker and D. B. Duncan. Estimation of the probability of an event as a function of several independent variables. Biometrika, 54:167–178, 1967.
Article MathSciNet MATH Google Scholar
Y. Wax. Collinearity diagnosis for a relative risk regression analysis: An application to assessment of diet-cancer relationship in epidemiological studies. Stat Med, 11:1273–1287, 1992.
Article Google Scholar
T. L. Wenger, F. E. Harrell, K. K. Brown, S. Lederman, and H. C. Strauss. Ventricular fibrillation following canine coronary reperfusion: Different outcomes with pentobarbital and α-chloralose. Can J Phys Pharm, 62:224–228, 1984.
Article Google Scholar
B. Zheng and A. Agresti. Summarizing the predictive power of a generalized linear model. Stat Med, 19:1771–1781, 2000.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biostatistics, School of Medicine Vanderbilt University, Nashville, TN, USA
Frank E. Harrell Jr.

Authors

Frank E. Harrell Jr.
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Harrell, F.E. (2015). Binary Logistic Regression. In: Regression Modeling Strategies. Springer Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-19425-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-19425-7_10
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19424-0
Online ISBN: 978-3-319-19425-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics