Abstract
Binary responses are commonly studied in many fields. Examples include 1 the presence or absence of a particular disease, death during surgery, or a consumer purchasing a product. Often one wishes to study how a set of predictor variables X is related to a dichotomous response variable Y. The predictors may describe such quantities as treatment assignment, dosage, risk factors, and calendar time. For convenience we define the response to be Y = 0 or 1, with Y = 1 denoting the occurrence of the event of interest. Often a dichotomous outcome can be studied by calculating certain proportions, for example, the proportion of deaths among females and the proportion among males. However, in many situations, there are multiple descriptors, or one or more of the descriptors are continuous. Without a statistical model, studying patterns such as the relationship between age and occurrence of a disease, for example, would require the creation of arbitrary age groups to allow estimation of disease prevalence as a function of age.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The general formula for the sample size required to achieve a margin of error of δ in estimating a true probability of θ at the 0.95 confidence level is \(n = (\frac{1.96} {\delta } )^{2} \times \theta (1-\theta )\). Set \(\theta = \frac{1} {2}\) (intercept=0) for the worst case.
- 2.
The R code can easily be modified for other event frequencies, or the minimum of the number of events and non-events for a dataset at hand can be compared with \(\frac{n} {2}\) in this simulation. An average maximum absolute error of 0.05 corresponds roughly to a half-width of the 0.95 confidence interval of 0.1.
- 3.
In the wireframe plots that follow, predictions for cholesterol–age combinations for which fewer than 5 exterior points exist are not shown, so as to not extrapolate to regions not supported by at least five points beyond the data perimeter.
- 4.
Note that D and B (below) and other indexes not related to c (below) do not work well in case-control studies because of their reliance on absolute probability estimates.
References
A. Agresti. Categorical data analysis. Wiley, Hoboken, NJ, second edition, 2002.
H. R. Arkes, N. V. Dawson, T. Speroff, F. E. Harrell, C. Alzola, R. Phillips, N. Desbiens, R. K. Oye, W. Knaus, A. F. Connors, and T. Investigators. The covariance decomposition of the probability score and its use in evaluating prognostic estimates. Med Decis Mak, 15:120–131, 1995.
D. Bamber. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Mathe Psych, 12:387–415, 1975.
J. Banks. Nomograms. In S. Kotz and N. L. Johnson, editors, Encyclopedia of Stat Scis, volume 6. Wiley, New York, 1985.
K. N. Berk and D. E. Booth. Seeing a curve in multiple regression. Technometrics, 37:385–398, 1995.
G. W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Rev, 78:1–3, 1950.
M. Buyse. R 2: A useful measure of model performance when predicting a dichotomous outcome. Stat Med, 19:271–274, 2000. Letter to the Editor regarding Stat Med 18:375–384; 1999.
M. S. Cepeda, R. Boston, J. T. Farrar, and B. L. Strom. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epi, 158:280–287, 2003.
J. M. Chambers and T. J. Hastie, editors. Statistical Models in S. Wadsworth and Brooks/Cole, Pacific Grove, CA, 1992.
W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc, 74:829–836, 1979.
D. Collett. Modelling Binary Data. Chapman and Hall, London, second edition, 2002.
E. F. Cook and L. Goldman. Asymmetric stratification: An outline for an efficient method for controlling confounding in cohort studies. Am J Epi, 127:626–639, 1988.
N. R. Cook. Use and misues of the receiver operating characteristic curve in risk prediction. Circulation, 115:928–935, 2007.
J. Copas. The effectiveness of risk scores: The logit rank plot. Appl Stat, 48:165–183, 1999.
J. B. Copas. Cross-validation shrinkage of regression predictors. J Roy Stat Soc B, 49:175–183, 1987.
J. B. Copas. Unweighted sum of squares tests for proportions. Appl Stat, 38:71–80, 1989.
D. R. Cox. The regression analysis of binary sequences (with discussion). J Roy Stat Soc B, 20:215–242, 1958.
D. R. Cox. Two further applications of a model for binary regression. Biometrika, 45(3/4):562–565, 1958.
D. R. Cox and N. Wermuth. A comment on the coefficient of determination for binary responses. Am Statistician, 46:1–4, 1992.
J. G. Cragg and R. Uhler. The demand for automobiles. Canadian Journal of Economics, 3:386–406, 1970.
C. E. Davis, J. E. Hyde, S. I. Bangdiwala, and J. J. Nelson. An example of dependencies among variables in a conditional logistic regression. In S. H. Moolgavkar and R. L. Prentice, editors, Modern Statistical Methods in Chronic Disease Epi, pages 140–147. Wiley, New York, 1986.
B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation. J Am Stat Assoc, 78:316–331, 1983.
E. B. Fowlkes. Some diagnostics for binary logistic regression via smoothing. Biometrika, 74:503–515, 1987.
J. H. Friedman. A variable span smoother. Technical Report 5, Laboratory for Computational Statistics, Department of Statistics, Stanford University, 1984.
T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc, 102:359–378, 2007.
M. Halperin, W. C. Blackwelder, and J. I. Verter. Estimation of the multivariate logistic risk function: A comparison of the discriminant function and maximum likelihood approaches. J Chron Dis, 24:125–158, 1971.
D. J. Hand. Construction and Assessment of Classification Rules. Wiley, Chichester, 1997.
T. L. Hankins. Blood, dirt, and nomograms. Chance, 13(1):26–37, 2000.
J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29–36, 1982.
F. E. Harrell. Comparison of strategies for validating binary logistic regression models. Unpublished manuscript, 1991.
F. E. Harrell and K. L. Lee. A comparison of the discrimination of discriminant analysis and logistic regression under multivariate normality. In P. K. Sen, editor, Biostatistics: Statistics in Biomedical, Public Health, and Environmental Sciences. The Bernard G. Greenberg Volume, pages 333–343. North-Holland, Amsterdam, 1985.
F. E. Harrell and K. L. Lee. The practical value of logistic regression. In Proceedings of the Tenth Annual SAS Users Group International Conference, pages 1031–1036, 1985.
F. E. Harrell and K. L. Lee. Using logistic model calibration to assess the quality of probability predictions. Unpublished manuscript, 1987.
W. W. Hauck and A. Donner. Wald’s test as applied to hypotheses in logit analysis. J Am Stat Assoc, 72:851–863, 1977.
A. V. Hernández, M. J. Eijkemans, and E. W. Steyerberg. Randomized controlled trials with time-to-event outcomes: how much does prespecified covariate adjustment increase power? Annals of epidemiology, 16(1):41–48, Jan. 2006.
A. V. Hernández, E. W. Steyerberg, and J. D. F. Habbema. Covariate adjustment in randomized controlled trials with dichotomous outcomes increases statistical power and reduces sample size requirements. J Clin Epi, 57:454–460, 2004.
D. W. Hosmer, T. Hosmer, S. le Cessie, and S. Lemeshow. A comparison of goodness-of-fit tests for the logistic regression model. Stat Med, 16:965–980, 1997.
D. W. Hosmer and S. Lemeshow. Goodness-of-fit tests for the multiple logistic regression model. Comm Stat Th Meth, 9:1043–1069, 1980.
D. W. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley, New York, 1989.
D. W. Hosmer and S. Lemeshow. Confidence interval estimates of an index of quality performance based on logistic regression models. Stat Med, 14:2161–2172, 1995. See letter to editor 16:1301-3,1997.
B. Hu, M. Palta, and J. Shao. Properties of R 2 statistics for logistic regression. Stat Med, 25:1383–1395, 2006.
R. Kay and S. Little. Assessing the fit of the logistic model: A case study of children with the haemolytic uraemic syndrome. Appl Stat, 35:16–30, 1986.
E. L. Korn and R. Simon. Explained residual variation, explained risk, and goodness of fit. Am Statistician, 45:201–206, 1991.
J. M. Landwehr, D. Pregibon, and A. C. Shoemaker. Graphical methods for assessing logistic regression models (with discussion). J Am Stat Assoc, 79:61–83, 1984.
P. W. Lavori, R. Dawson, and T. B. Mueller. Causal estimation of time-varying treatment effects in observational studies: Application to depressive disorder. Stat Med, 13:1089–1100, 1994.
S. le Cessie and J. C. van Houwelingen. A goodness-of-fit test for binary regression models, based on smoothing methods. Biometrics, 47:1267–1282, 1991.
J. G. Liao and D. McGee. Adjusted coefficients of determination for logistic regression. Am Statistician, 57:161–165, 2003.
K. Linnet. Assessing diagnostic tests by a strictly proper scoring rule. Stat Med, 8:609–618, 1989.
K. Liu and A. R. Dyer. A rank statistic for assessing the amount of variation explained by risk factors in epidemiologic studies. Am J Epi, 109:597–606, 1979.
G. S. Maddala. Limited-Dependent and Qualitative Variables in Econometrics. Cambridge University Press, Cambridge, UK, 1983.
L. Magee. R 2 measures based on Wald and likelihood ratio joint significance tests. Am Statistician, 44:250–253, 1990.
S. Menard. Coefficients of determination for multiple logistic regression analysis. Am Statistician, 54:17–24, 2000.
M. E. Miller, S. L. Hui, and W. M. Tierney. Validation techniques for logistic regression models. Stat Med, 10:1213–1226, 1991.
M. Mittlböck and M. Schemper. Explained variation for logistic regression. Stat Med, 15:1987–1997, 1996.
K. G. M. Moons, Donders, E. W. Steyerberg, and F. E. Harrell. Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example. J Clin Epi, 57:1262–1270, 2004.
N. J. D. Nagelkerke. A note on a general definition of the coefficient of determination. Biometrika, 78:691–692, 1991.
R. Newson. Parameters behind “nonparametric” statistics: Kendall’s tau, Somers’ D and median differences. Stata Journal, 2(1), 2002. http://www.stata-journal.com/article.html?article=st0007.
R. Newson. Confidence intervals for rank statistics: Somers’ D and extensions. Stata J, 6(3):309–334, 2006.
P. C. O’Brien. Comparing two samples: Extensions of the t, rank-sum, and log-rank test. J Am Stat Assoc, 83:52–61, 1988.
M. J. Pencina, R. B. D’Agostino, and O. V. Demler. Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models. Stat Med, 31(2):101–113, 2012.
M. J. Pencina, R. B. D’Agostino Sr, R. B. D’Agostino Jr, and R. S. Vasan. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med, 27:157–172, 2008.
D. Pregibon. Logistic regression diagnostics. Ann Stat, 9:705–724, 1981.
D. Pregibon. Resistant fits for some commonly used logistic models with medical applications. Biometrics, 38:485–498, 1982.
S. J. Press and S. Wilson. Choosing between logistic regression and discriminant analysis. J Am Stat Assoc, 73:699–705, 1978.
D. B. Pryor, F. E. Harrell, K. L. Lee, R. M. Califf, and R. A. Rosati. Estimating the likelihood of significant coronary artery disease. Am J Med, 75:771–780, 1983.
J. M. Robins, S. D. Mark, and W. K. Newey. Estimating exposure effects by modeling the expectation of exposure conditional on confounders. Biometrics, 48:479–495, 1992.
L. D. Robinson and N. P. Jewell. Some surprising results about covariate adjustment in logistic regression models. Int Stat Rev, 59:227–240, 1991.
P. R. Rosenbaum and D. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70:41–55, 1983.
P. R. Rosenbaum and D. B. Rubin. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. J Roy Stat Soc B, 45:212–218, 1983.
J. C. Sinclair and M. B. Bracken. Clinically useful measures of effect in binary analyses of randomized trials. J Clin Epi, 47:881–889, 1994.
R. H. Somers. A new asymmetric measure of association for ordinal variables. Am Soc Rev, 27:799–811, 1962.
A. Spanos, F. E. Harrell, and D. T. Durack. Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations. JAMA, 262:2700–2707, 1989.
N. Stallard. Simple tests for the external validation of mortality prediction scores. Stat Med, 28:377–388, 2009.
E. W. Steyerberg, P. M. M. Bossuyt, and K. L. Lee. Clinical trials in acute myocardial infarction: Should we adjust for baseline characteristics? Am Heart J, 139:745–751, 2000. Editorial, pp. 761–763.
E. W. Steyerberg, M. J. C. Eijkemans, F. E. Harrell, and J. D. F. Habbema. Prognostic modeling with logistic regression analysis: In search of a sensible strategy in small data sets. Med Decis Mak, 21:45–56, 2001.
T. Tjur. Coefficients of determination in logistic regression models—A new proposal: The coefficient of discrimination. Am Statistician, 63(4):366–372, 2009.
J. C. van Houwelingen and S. le Cessie. Logistic regression, a review. Statistica Neerlandica, 42:215–232, 1988.
J. C. van Houwelingen and S. le Cessie. Predictive value of statistical models. Stat Med, 9:1303–1325, 1990.
S. H. Walker and D. B. Duncan. Estimation of the probability of an event as a function of several independent variables. Biometrika, 54:167–178, 1967.
Y. Wax. Collinearity diagnosis for a relative risk regression analysis: An application to assessment of diet-cancer relationship in epidemiological studies. Stat Med, 11:1273–1287, 1992.
T. L. Wenger, F. E. Harrell, K. K. Brown, S. Lederman, and H. C. Strauss. Ventricular fibrillation following canine coronary reperfusion: Different outcomes with pentobarbital and α-chloralose. Can J Phys Pharm, 62:224–228, 1984.
B. Zheng and A. Agresti. Summarizing the predictive power of a generalized linear model. Stat Med, 19:1771–1781, 2000.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Harrell, F.E. (2015). Binary Logistic Regression. In: Regression Modeling Strategies. Springer Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-19425-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-19425-7_10
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19424-0
Online ISBN: 978-3-319-19425-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)