Focus on: Contemporary Methods in Biostatistics (I)
Regression Modeling StrategiesEstrategias para la elaboración de modelos estadísticos de regresión

https://doi.org/10.1016/j.rec.2011.01.017Get rights and content

Abstract

Multivariable regression models are widely used in health science research, mainly for two purposes: prediction and effect estimation. Various strategies have been recommended when building a regression model: a) use the right statistical method that matches the structure of the data; b) ensure an appropriate sample size by limiting the number of variables according to the number of events; c) prevent or correct for model overfitting; d) be aware of the problems associated with automatic variable selection procedures (such as stepwise), and e) always assess the performance of the final model in regard to calibration and discrimination measures. If resources allow, validate the prediction model on external data.

Resumen

Actualmente los modelos multivariables de regresión son parte importante del arsenal de la investigación clínica, ya sea para la creación de puntuaciones con fines pronósticos o en investigación dedicada a generar nuevas hipótesis. En la creación de estos modelos, se debe tener en cuenta: a) el uso apropiado de la técnica estadística, que ha de ser acorde con el tipo de información disponible; b) mantener el número de variables por evento no mayor de 10:1 para evitar la sobresaturación del modelo, relación que se puede considerar una medida grosera de la potencia estadística; c) tener presentes los inconvenientes del uso de los procesos automáticos en la selección de las variables, y d) evaluar el modelo final con relación a las propiedades de calibración y discriminación. En la creación de modelos de predicción, en la medida de lo posible se debe evaluar estas mismas medidas en una población diferente.

Introduction

Multivariable regression models are widely used in health science research. Data are frequently collected to investigate interrelationships among variables or to determine factors affecting an outcome of interest. It is here where multivariable regression models become a tool to find a simplified mathematical explanation between the candidate predictors and the outcome. The ultimate goal is to derive a parsimonious model that makes sense from the subject matter point of view, closely matches the observed data, and has valid predictions on independent data.

Due to advances in statistical software, which have made them friendlier to the user, more researchers with limited background in biostatistics are now engaged in data analysis. Thus, the goal of this review is to provide practical advice on how to build a parsimonious and more effective multivariable model. The overall steps in any regression model exercise are listed in Table 1. Due to limited space, only the most practical points are presented.

Section snippets

Data structure and type of regression analysis

Regression models share a general form that should be familiar to most, usually: response = weight1 × predictor1 + weight2 × predictor2 + … weightk × predictork | normal error term. The variable to be explained is called the dependent (or response) variable. When the dependent variable is binary, the medical literature refers to it as an outcome (or endpoint). The factors that explain the dependent variable are called independent variables, which encompass the variable of interest (or explanatory variable)

Data manipulation

Not infrequently, the data require clean-up before fitting the model. Three important areas need to be considered here:

  • 1.

    Missing data. This is a ubiquitous problem in health science research. Three types of missingness mechanisms have been distinguished19: missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR) (Table 3). Multiple imputation was developed for dealing with missing data under MAR and MCAR assumptions by replacing missing values with a set of

Model building strategies

Variable selection is a crucial step in the process of model creation (Table 1). Including in the model the right variables is a process heavily influenced by the prespecified balance between complexity and parsimony (Table 3). Predictive models should include those variables that reflect the pattern in the population from which our sample was drawn. Here, what matters is the information that the model as a whole represents. For effect estimation, however, a fitted model that reflects the

Final model evaluation

Central to the idea of building a regression model is the question of model performance assessment. A number of model performance metrics have been proposed, although they can be grouped in two main categories: calibration and discrimination measures (Table 1, Table 3). Independent of the aim for which the model was created, these two performance measures need to be derived from the data which gave origin to the model, or even better, through bootstrap resampling (known as internal validity).

Results presentation

The final consideration in the process of model creation is how the estimated parameters will be presented. Commonly, the statistical packages, when comparing two groups on a binary outcome, express the effect size of the explanatory variable in relative metrics. For logistic and Cox regressions, the OR and HR are the traditional metrics to indicate the degree of association found between a factor and the outcome. Because these are a ratio of proportions, the information conveyed about their

Conflicts of interest

None declared.

References (45)

  • I. Annesi et al.

    Efficiency of the logistic regression and Cox proportional hazards models in longitudinal studies

    Stat Med

    (1989)
  • T. Martinussen et al.

    Dynamic regression models for survival data

    (2006)
  • P.C. Lambert et al.

    Further development of flexible parametric models for survival analysis

    Stata J

    (2009)
  • M. Pintilie

    Competing risks: a practical perspective

    (2007)
  • J.P. Fine et al.

    A proportional hazard model for the subdistribution of a competing risk

    J Am Stat Assoc

    (1999)
  • J.G. Ibrahim et al.

    Basic concepts and methods for joint models of longitudinal and survival data

    J Clin Oncol

    (2010)
  • D. Rizopoulos

    Joint modelling of longitudinal and time-to-event data: challenges and future directions

  • G. Touloumi et al.

    A comparison of two methods for the estimation of precision with incomplete longitudinal data, jointly modelled with a time-to-event outcome

    Stat Med

    (2003)
  • G. Touloumi et al.

    Impact of missing data due to selective dropouts in cohort studies and clinical trials

    Epidemiology

    (2002)
  • D. Rizopoulos

    JM: An R package for the joint modelling of longitudinal and time-to-event data

    J Stat Soft

    (2010)
  • N. Pantazis et al.

    Analyzing longitudinal data in the presence of informative drop-out: The jmre1 command

    Stata J

    (2010)
  • L. Meira-Machado et al.

    Multi-state models for the analysis of time-to-event data

    Stat Methods Med Res

    (2009)
  • Cited by (0)

    View full text