Abstract
Chapter 2 dealt with aspects of modeling such as transformations of predictors, relaxing linearity assumptions, modeling interactions, and examining lack of fit. Chapter 3 dealt with missing data, focusing on utilization of incomplete predictor information. All of these areas are important in the overall scheme of model development, and they cannot be separated from what is to follow. In this chapter we concern ourselves with issues related to the whole model, with emphasis on deciding on the amount of complexity to allow in the model and on dealing with large numbers of predictors. The chapter concludes with three default modeling strategies depending on whether the goal is prediction, estimation, or hypothesis testing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Even then, the two blood pressures may need to be transformed to meet distributional assumptions.
- 2.
Shrinkage (penalized estimation) is a general solution (see Section 4.5). One can always use complex models that are “penalized towards simplicity,” with the amount of penalization being greater for smaller sample sizes.
- 3.
One can also perform a joint test of all parameters associated with nonlinear effects. This can be useful in demonstrating to the reader that some complexity was actually needed.
- 4.
Lockhart et al. 425 provide an example with n = 100 and 10 orthogonal predictors where all true βs are zero. The test statistic for the first variable to enter has type I error of 0.39 when the nominal α is set to 0.05, in line with what one would expect with multiple testing using \(1 - 0.95^{10} = 0.40\).
- 5.
AIC works successfully when the models being entertained are on a progression defined by a single parameter, e.g. a common shrinkage coefficient or the single number of knots to be used by all continuous predictors. AIC can also work when the model that is best by AIC is much better than the runner-up so that if the process were bootstrapped the same model would almost always be found. When used for one variable at a time variable selection. AIC is just a restatement of the P-value, and as such, doesn’t solve the severe problems with stepwise variable selection other than forcing us to use slightly more sensible α values. Burnham and Anderson 84 recommend selection based on AIC for a limited number of theoretically well-founded models. Some statisticians try to deal with multiplicity problems caused by stepwise variable selection by making α smaller than 0.05. This increases bias by giving variables whose effects are estimated with error a greater relative chance of being selected. Variable selection does not compete well with shrinkage methods that simultaneously model all potential predictors.
- 6.
This is akin to doing a t-test to compare the two treatments (out of 10, say) that are apparently most different from each other.
- 7.
These are situations where the true R 2 is low, unlike tightly controlled experiments and mechanistic models where signal:noise ratios can be quite high. In those situations, many parameters can be estimated from small samples, and the \(\frac{m} {15}\) rule of thumb can be significantly relaxed.
- 8.
See [487]. If one considers the power of a two-sample binomial test compared with a Wilcoxon test if the response could be made continuous and the proportional odds assumption holds, the effective sample size for a binary response is 3n 1 n 2∕n ≈ 3min(n 1, n 2) if n 1∕n is near 0 or 1 [664, Eq. 10, 15]. Here n 1 and n 2 are the marginal frequencies of the two response levels.
- 9.
Based on the power of a proportional odds model two-sample test when the marginal cell sizes for the response are n 1, …, n k , compared with all cell sizes equal to unity (response is continuous) [664, Eq, 3]. If all cell sizes are equal, the relative efficiency of having k response categories compared with a continuous response is \(1 - 1/k^{2}\) [664, Eq. 14]; for example, a five-level response is almost as efficient as a continuous one if proportional odds holds across category cutoffs.
- 10.
This is approximate, as the effective sample size may sometimes be boosted somewhat by censored observations, especially for non-proportional hazards methods such as Wilcoxon-type tests. 49
- 11.
An even more stringent assessment is obtained by stratifying calibration curves by predictor settings.
- 12.
It is interesting that researchers are quite comfortable with adjusting P-values for post hoc selection of comparisons using, for example, the Bonferroni inequality, but they do not realize that post hoc selection of comparisons also biases point estimates.
- 13.
There is an option to force continuous variables to be linear when they are being predicted.
- 14.
If one were to estimate transformations without removing observations that had these constants inserted for the current Y -variable, the resulting transformations would likely have a spike at Y = imputation constant.
- 15.
Study to Understand Prognoses Preferences Outcomes and Risks of Treatments
- 16.
Whether this statistic should be used to change the model is problematic in view of model uncertainty.
- 17.
The R function score.binary in the Hmisc package (see Section 6.2) assists in computing a summary variable from the series of binary conditions.
References
D. G. Altman and P. K. Andersen. Bootstrap investigation of the stability of a Cox regression model. Stat Med, 8:771–783, 1989.
A. C. Atkinson. A note on the generalized information criterion for choice of a model. Biometrika, 67:413–418, 1980.
P. C. Austin. Bootstrap model selection had similar performance for selecting authentic and noise variables compared to backward variable elimination: a simulation study. J Clin Epi, 61:1009–1017, 2008.
D. A. Belsley. Conditioning Diagnostics: Collinearity and Weak Data in Regression. Wiley, New York, 1991.
D. A. Belsley, E. Kuh, and R. E. Welsch. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley, New York, 1980.
J. K. Benedetti, P. Liu, H. N. Sather, J. Seinfeld, and M. A. Epton. Effective sample size for tests of censored survival data. Biometrika, 69:343–349, 1982.
L. Breiman. The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. J Am Stat Assoc, 87:738–754, 1992.
L. Breiman and J. H. Friedman. Estimating optimal transformations for multiple regression and correlation (with discussion). J Am Stat Assoc, 80:580–619, 1985.
K. P. Burnham and D. R. Anderson. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer, 2nd edition, Dec. 2003.
C. Chatfield. Avoiding statistical pitfalls (with discussion). Statistical Sci, 6:240–268, 1991.
C. Chatfield. Model uncertainty, data mining and statistical inference (with discussion). J Roy Stat Soc A, 158:419–466, 1995.
S. Chatterjee and A. S. Hadi. Regression Analysis by Example. Wiley, New York, fifth edition, 2012.
F. Chiaromonte, R. D. Cook, and B. Li. Sufficient dimension reduction in regressions with categorical predictors. Appl Stat, 30:475–497, 2002.
A. Ciampi, J. Thiffault, J. P. Nakache, and B. Asselain. Stratification by stepwise regression, correspondence analysis and recursive partition. Comp Stat Data Analysis, 1986:185–204, 1986.
N. R. Cook. Use and misues of the receiver operating characteristic curve in risk prediction. Circulation, 115:928–935, 2007.
R. D. Cook. Fisher Lecture:Dimension reduction in regression. Statistical Sci, 22:1–26, 2007.
R. D. Cook and L. Forzani. Principal fitted components for dimension reduction in regression. Statistical Sci, 23(4):485–501, 2008.
J. B. Copas. Regression, prediction and shrinkage (with discussion). J Roy Stat Soc B, 45:311–354, 1983.
J. B. Copas and T. Long. Estimating the residual variance in orthogonal regression with variable selection. The Statistician, 40:51–59, 1991.
N. J. Crichton and J. P. Hinde. Correspondence analysis as a screening method for indicants for clinical diagnosis. Stat Med, 8:1351–1362, 1989.
E. E. Cureton and R. B. D’Agostino. Factor Analysis, An Applied Approach. Erlbaum, Hillsdale, NJ, 1983.
R. B. D’Agostino, A. J. Belanger, E. W. Markson, M. Kelly-Hayes, and P. A. Wolf. Development of health risk appraisal functions in the presence of multiple indicators: The Framingham Study nursing home institutionalization model. Stat Med, 14:1757–1770, 1995.
C. E. Davis, J. E. Hyde, S. I. Bangdiwala, and J. J. Nelson. An example of dependencies among variables in a conditional logistic regression. In S. H. Moolgavkar and R. L. Prentice, editors, Modern Statistical Methods in Chronic Disease Epi, pages 140–147. Wiley, New York, 1986.
A. C. Davison and D. V. Hinkley. Bootstrap Methods and Their Application. Cambridge University Press, Cambridge, 1997.
J. de Leeuw and P. Mair. Gifi methods for optimal scaling in r: The package homals. J Stat Software, 31(4):1–21, Aug. 2009.
S. Derksen and H. J. Keselman. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. British J Math Stat Psych, 45:265–282, 1992.
B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation. J Am Stat Assoc, 78:316–331, 1983.
B. Efron. How biased is the apparent error rate of a prediction rule? J Am Stat Assoc, 81:461–470, 1986.
B. Efron and C. Morris. Stein’s paradox in statistics. Sci Am, 236(5):119–127, 1977.
B. Efron and R. Tibshirani. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Sci, 1:54–77, 1986.
B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, New York, 1993.
J. J. Faraway. The cost of data analysis. J Comp Graph Stat, 1:213–229, 1992.
L. Ferré. Determining the dimension in sliced inverse regression and related methods. J Am Stat Assoc, 93:132–149, 1998.
J. H. Friedman. A variable span smoother. Technical Report 5, Laboratory for Computational Statistics, Department of Statistics, Stanford University, 1984.
L. Friedman and M. Wall. Graphical views of suppression and multicollinearity in multiple linear regression. Am Statistician, 59:127–136, 2005.
J. H. Giudice, J. R. Fieberg, and M. S. Lenarz. Spending degrees of freedom in a poor economy: A case study of building a sightability model for moose in northeastern minnesota. J Wildlife Manage, 2011.
S. A. Glantz and B. K. Slinker. Primer of Applied Regression and Analysis of Variance. McGraw-Hill, New York, 1990.
H. H. H. Göring, J. D. Terwilliger, and J. Blangero. Large upward bias in estimation of locus-specific effects from genomewide scans. Am J Hum Gen, 69:1357–1369, 2001.
P. M. Grambsch and P. C. O’Brien. The effects of transformations and preliminary tests for non-linearity in regression. Stat Med, 10:697–709, 1991.
R. J. Gray. Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. J Am Stat Assoc, 87:942–951, 1992.
M. J. Greenacre. Correspondence analysis of multivariate categorical data by weighted least-squares. Biometrika, 75:457–467, 1988.
S. Greenland. When should epidemiologic regressions use random coefficients? Biometrics, 56:915–921, 2000.
J. Guo, G. James, E. Levina, G. Michailidis, and J. Zhu. Principal component analysis with sparse fused loadings. J Comp Graph Stat, 19(4):930–946, 2011.
P. Hall and H. Miller. Using generalized correlation to effect variable selection in very high dimensional problems. J Comp Graph Stat, 18(3):533–550, 2009.
F. E. Harrell. The LOGIST Procedure. In SUGI Supplemental Library Users Guide, pages 269–293. SAS Institute, Inc., Cary, NC, Version 5 edition, 1986.
F. E. Harrell, K. L. Lee, R. M. Califf, D. B. Pryor, and R. A. Rosati. Regression modeling strategies for improved prognostic prediction. Stat Med, 3:143–152, 1984.
F. E. Harrell, K. L. Lee, and D. B. Mark. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med, 15:361–387, 1996.
F. E. Harrell, K. L. Lee, D. B. Matchar, and T. A. Reichert. Regression models for prognostic prediction: Advantages, problems, and suggested solutions. Ca Trt Rep, 69:1071–1077, 1985.
F. E. Harrell, P. A. Margolis, S. Gove, K. E. Mason, E. K. Mulholland, D. Lehmann, L. Muhe, S. Gatchalian, and H. F. Eichenwald. Development of a clinical prediction model for an ordinal outcome: The World Health Organization ARI Multicentre Study of clinical signs and etiologic agents of pneumonia, sepsis, and meningitis in young infants. Stat Med, 17:909–944, 1998.
T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman & Hall/CRC, Boca Raton, FL, 1990. ISBN 9780412343902.
X. He and L. Shen. Linear regression after spline transformation. Biometrika, 84:474–481, 1997.
J. Hilden and T. A. Gerds. A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index. Statist. Med., 33(19):3405–3414, Aug. 2014.
W. Hoeffding. A non-parametric test of independence. Ann Math Stat, 19:546–557, 1948.
C. M. Hurvich and C. L. Tsai. The impact of model selection on inference in linear regression. Am Statistician, 44:214–217, 1990.
J. E. Jackson. A User’s Guide to Principal Components. Wiley, New York, 1991.
I. T. Jolliffe. Discarding variables in a principal component analysis. I. Artificial data. Appl Stat, 21:160–173, 1972.
I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, second edition, 2010.
R. E. Kass and A. E. Raftery. Bayes factors. J Am Stat Assoc, 90:773–795, 1995.
H. J. Keselman, J. Algina, R. K. Kowalchuk, and R. D. Wolfinger. A comparison of two approaches for selecting covariance structures in the analysis of repeated measurements. Comm Stat - Sim Comp, 27:591–604, 1998.
W. A. Knaus, F. E. Harrell, J. Lynn, L. Goldman, R. S. Phillips, A. F. Connors, N. V. Dawson, W. J. Fulkerson, R. M. Califf, N. Desbiens, P. Layde, R. K. Oye, P. E. Bellamy, R. B. Hakim, and D. P. Wagner. The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized adults. Ann Int Med, 122:191–203, 1995.
W. F. Kuhfeld. The PRINQUAL procedure. In SAS/STAT 9.2 User’s Guide. SAS Publishing, Cary, NC, second edition, 2009.
J. F. Lawless and K. Singhal. Efficient screening of nonnormal regression models. Biometrics, 34:318–327, 1978.
S. le Cessie and J. C. van Houwelingen. Ridge estimators in logistic regression. Appl Stat, 41:191–201, 1992.
M. LeBlanc and R. Tibshirani. Adaptive principal surfaces. J Am Stat Assoc, 89:53–64, 1994.
A. Leclerc, D. Luce, F. Lert, J. F. Chastang, and P. Logeay. Correspondence analysis and logistic modelling: Complementary use in the analysis of a health survey among nurses. Stat Med, 7:983–995, 1988.
S. Lee, J. Z. Huang, and J. Hu. Sparse logistic principal components analysis for binary data. Ann Appl Stat, 4(3):1579–1601, 2010.
K. Li, J. Wang, and C. Chen. Dimension reduction for censored regression data. Ann Stat, 27:1–23, 1999.
K. C. Li. Sliced inverse regression for dimension reduction. J Am Stat Assoc, 86:316–327, 1991.
R. Lockhart, J. Taylor, R. J. Tibshirani, and R. Tibshirani. A significance test for the lasso. Technical report, arXiv, 2013.
X. Luo, L. A. Stfanski, and D. D. Boos. Tuning variable selection procedures by adding noise. Technometrics, 48:165–175, 2006.
N. Mantel. Why stepdown procedures in variable selection. Technometrics, 12:621–625, 1970.
G. Marshall, F. L. Grover, W. G. Henderson, and K. E. Hammermeister. Assessment of predictive models for binary outcomes: An empirical approach using operative death from cardiac surgery. Stat Med, 13:1501–1511, 1994.
J. M. Massaro. Battery Reduction. 2005.
G. P. McCabe. Principal variables. Technometrics, 26:137–144, 1984.
N. Meinshausen. Hierarchical testing of variable importance. Biometrika, 95(2):265–278, 2008.
G. Michailidis and J. de Leeuw. The Gifi system of descriptive multivariate analysis. Statistical Sci, 13:307–336, 1998.
R. H. Myers. Classical and Modern Regression with Applications. PWS-Kent, Boston, 1990.
T. G. Nick and J. M. Hardin. Regression modeling strategies: An illustrative case study from medical rehabilitation outcomes research. Am J Occ Ther, 53:459–470, 1999.
P. Peduzzi, J. Concato, A. R. Feinstein, and T. R. Holford. Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates. J Clin Epi, 48:1503–1510, 1995.
P. Peduzzi, J. Concato, E. Kemper, T. R. Holford, and A. R. Feinstein. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epi, 49:1373–1379, 1996.
N. Peek, D. G. T. Arts, R. J. Bosman, P. H. J. van der Voort, and N. F. de Keizer. External validation of prognostic models for critically ill patients required substantial sample sizes. J Clin Epi, 60:491–501, 2007.
M. J. Pencina, R. B. D’Agostino, and O. V. Demler. Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models. Stat Med, 31(2):101–113, 2012.
M. J. Pencina, R. B. D’Agostino, and E. W. Steyerberg. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med, 30:11–21, 2011.
M. J. Pencina, R. B. D’Agostino Sr, R. B. D’Agostino Jr, and R. S. Vasan. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med, 27:157–172, 2008.
A. N. Phillips, S. G. Thompson, and S. J. Pocock. Prognostic scores for detecting a high risk group: Estimating the sensitivity when applied to new data. Stat Med, 9:1189–1198, 1990.
E. B. Roecker. Prediction error and its estimation for subset-selected models. Technometrics, 33:459–468, 1991.
W. Sarle. The VARCLUS procedure. In SAS/STAT User’s Guide, volume 2, chapter 43, pages 1641–1659. SAS Institute, Inc., Cary, NC, fourth edition, 1990.
W. Sauerbrei and M. Schumacher. A bootstrap resampling procedure for model building: Application to the Cox regression model. Stat Med, 11:2093–2109, 1992.
J. Shao. Linear model selection by cross-validation. J Am Stat Assoc, 88:486–494, 1993.
X. Shen, H. Huang, and J. Ye. Inference after model selection. J Am Stat Assoc, 99:751–762, 2004.
L. R. Smith, F. E. Harrell, and L. H. Muhlbaier. Problems and potentials in modeling survival. In M. L. Grady and H. A. Schwartz, editors, Medical Effectiveness Research Data Methods (Summary Report), AHCPR Pub. No. 92-0056, pages 151–159. US Dept. of Health and Human Services, Agency for Health Care Policy and Research, Rockville, MD, 1992.
I. Spence and R. F. Garrison. A remarkable scatterplot. Am Statistician, 47:12–19, 1993.
D. J. Spiegelhalter. Probabilistic prediction in patient management and clinical trials. Stat Med, 5:421–433, 1986.
E. W. Steyerberg, M. J. C. Eijkemans, F. E. Harrell, and J. D. F. Habbema. Prognostic modelling with logistic regression analysis: A comparison of selection and estimation methods in small data sets. Stat Med, 19:1059–1079, 2000.
E. W. Steyerberg, M. J. C. Eijkemans, F. E. Harrell, and J. D. F. Habbema. Prognostic modeling with logistic regression analysis: In search of a sensible strategy in small data sets. Med Decis Mak, 21:45–56, 2001.
E. W. Steyerberg, A. J. Vickers, N. R. Cook, T. Gerds, M. Gonen, N. Obuchowski, M. J. Pencina, and M. W. Kattan. Assessing the performance of prediction models: a framework for traditional and novel measures. Epi (Cambridge, Mass.), 21(1):128–138, Jan. 2010.
G. Sun, T. L. Shook, and G. L. Kay. Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. J Clin Epi, 49:907–916, 1996.
J. M. G. Taylor, A. L. Siqueira, and R. E. Weiss. The cost of adding parameters to a model. J Roy Stat Soc B, 58:593–607, 1996.
R. Tibshirani. Regression shrinkage and selection via the lasso. J Roy Stat Soc B, 58:267–288, 1996.
R. Tibshirani. The lasso method for variable selection in the Cox model. Stat Med, 16:385–395, 1997.
T. van der Ploeg, P. C. Austin, and E. W. Steyerberg. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Medical Research Methodology, 14(1):137+, Dec. 2014.
H. C. van Houwelingen and J. Thorogood. Construction, validation and updating of a prognostic model for kidney graft survival. Stat Med, 14:1999–2008, 1995.
J. C. van Houwelingen and S. le Cessie. Predictive value of statistical models. Stat Med, 9:1303–1325, 1990.
W. N. Venables and B. D. Ripley. Modern Applied Statistics with S-Plus. Springer-Verlag, New York, third edition, 1999.
P. Verweij and H. C. van Houwelingen. Penalized likelihood in Cox regression. Stat Med, 13:2427–2436, 1994.
P. J. M. Verweij and H. C. van Houwelingen. Cross-validation in survival analysis. Stat Med, 12:2305–2314, 1993.
S. K. Vines. Simple principal components. Appl Stat, 49:441–451, 2000.
E. Vittinghoff and C. E. McCulloch. Relaxing the rule of ten events per variable in logistic and Cox regression. Am J Epi, 165:710–718, 2006.
A. Wang and E. A. Gehan. Gene selection for microarray data analysis using principal component analysis. Stat Med, 24:2069–2087, 2005.
Y. Wax. Collinearity diagnosis for a relative risk regression analysis: An application to assessment of diet-cancer relationship in epidemiological studies. Stat Med, 11:1273–1287, 1992.
R. E. Weiss. The influence of variable selection: A Bayesian diagnostic perspective. J Am Stat Assoc, 90:619–625, 1995.
J. Whitehead. Sample size calculations for ordered categorical data. Stat Med, 12:2257–2271, 1993. See letter to editor SM 15:1065-6 for binary case;see errata in SM 13:871 1994;see kol95com, jul96sam.
R. E. Wiegand. Performance of using multiple stepwise algorithms for variable selection. Stat Med, 29:1647–1659, 2010.
S. N. Wood. Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC, Boca Raton, FL, 2006. ISBN 9781584884743.
F. W. Young, Y. Takane, and J. de Leeuw. The principal components of mixed measurement level multivariate data: An alternating least squares method with optimal scaling features. Psychometrika, 43:279–281, 1978.
H. Zhou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. J Comp Graph Stat, 15:265–286, 2006.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Harrell, F.E. (2015). Multivariable Modeling Strategies. In: Regression Modeling Strategies. Springer Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-19425-7_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-19425-7_4
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19424-0
Online ISBN: 978-3-319-19425-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)