Abstract
The measurement of many human traits, states, and disorders begins with a set of items on a questionnaire. The response format for these questions is often simply binary (e.g., yes/no) or ordered (e.g., high, medium or low). During data analysis, these items are frequently summed or used to estimate factor scores. In clinical applications, such assessments are often non-normally distributed in the general population because many respondents are unaffected, and therefore asymptomatic. As a result, in many cases these measures violate the statistical assumptions required for subsequent analyses. To reduce the influence of the non-normality and quasi-continuous assessment, variables are frequently recoded into binary (affected–unaffected) or ordinal (mild–moderate–severe) diagnoses. Ordinal data therefore present challenges at multiple levels of analysis. Categorizing continuous variables into ordered categories typically results in a loss of statistical power, which represents an incentive to the data analyst to assume that the data are normally distributed, even when they are not. Despite prior zeitgeists suggesting that, e.g., variables with more than 10 ordered categories may be regarded as continuous and analyzed as if they were, we show via simulation studies that this is not generally the case. In particular, using Pearson product-moment correlations instead of maximum likelihood estimates of polychoric correlations biases the estimated correlations towards zero. This bias is especially severe when a plurality of the observations fall into a single observed category, such as a score of zero. By contrast, estimating the ordinal correlation by maximum likelihood yields no estimation bias, although standard errors are (appropriately) larger. We also illustrate how odds ratios depend critically on the proportion or prevalence of affected individuals in the population, and therefore are sub-optimal for studies where comparisons of association metrics are needed. Finally, we extend these analyses to the classical twin model and demonstrate that treating binary data as continuous will underestimate genetic and common environmental variance components, and overestimate unique environment (residual) variance. These biases increase as prevalence declines. While modeling ordinal data appropriately may be more computationally intensive and time consuming, failing to do so will likely yield biased correlations and biased parameter estimates from modeling them.
Similar content being viewed by others
References
Agresti A (1990) Analysis of categorical data. Wiley, New York
Bock RD, Aitkin M (1981) Marginal maximum likelihood estimation of item parameters: application of an em algorithm. Psychometrika 46(4):443–459. https://doi.org/10.1007/BF02293801
Boker SM, Neale MC, Maes H, Wilde M, Spiegel M, Brick TR, Bates T et al (2011) OpenMx: n open source extended structural equation modeling framework. Psychometrika 76(2):306–317
Boyle EA, Li YI, Pritchard JK (2017) An expanded view of complex traits: from polygenic to omnigenic. Cell 169(7):1177–1186. https://doi.org/10.1016/j.cell.2017.05.038
Browne MW (1984) Asymptotically distribution-free methods for the analysis of covariance structures. Br J Math Stat Psychol 37(1):62–83. https://doi.org/10.1111/j.2044-8317.1984.tb00789.x
Chalmers RP (2012) mirt: a multidimensional item response theory package for the R environment. J Stat Softw 48(6):1–29
Curnow RN (1972) The multifactorial model for the inheritance of liability to disease and its implications for relatives at risk. Biometrics 28(4):931–46
Eaves L (2017) Genotype x environment interaction in psychiatric genetics: deep truth or thin ice? Twin Res Hum Genet 20(3):187–196. https://doi.org/10.1017/thg.2017.19
Eaves L, Verhulst B (2014) Problems and pit-falls in testing for g x e and epistasis in candidate gene studies of human behavior. Behav Genet 44(6):578–90. https://doi.org/10.1007/s10519-014-9674-6
Fisher RA (1915) Frequency distribution of the values of the correlation coefficient in samples of an indefinitely large population. Biometrika 10:507–521
Fisher RA (1921) On the ‘probable error’ of a coefficient of correlation deduced from a small sample. Metron 1:3–32
Flora DB, Curran PJ (2004) An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychol Methods 9(4):466–491. https://doi.org/10.1037/1082-989X.9.4.466
Fox J (2019) Polycor: Polychoric and polyserial correlations. R package version 0.7-10. https://CRAN.R-project.org/package=polycor
Glass GV, Hopkins KD (1995) Statistical methods in education and psychology, 3rd edn. Allyn & Bacon, Boston
Gottesman II, Shields J (1967) A polygenic theory of schizophrenia. Proc Natl Acad Sci USA 58(1):199–205. https://doi.org/10.1073/pnas.58.1.199
Huber P (1967) The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the fifth berkeley symposium on mathematical statistics and probability, vol 1, pp. 221–233. University of California Press, Berkeley, CA
Jöreskog KG, Sórbom D (1993) PRELIS2 - user’s reference guide. Scientific Software, Chicago, IL
Lehmann EL (1998) Elements of large-sample theory. Springer, New York
Long JS (1997) Regression models for categorical and limited dependent variables. Advanced Quantitative Techniques in the Social Sciences. Sage Publications Inc, Thousand Oaks, CA
Martin NG, Eaves LJ (1977) The genetical analysis of covariance structure. Heredity (Edinb) 38(1):79–95. https://doi.org/10.1038/hdy.1977.9
Mehta PD, Neale MC, Flay BR (2004) Squeezing interval change from ordinal panel data: latent growth curves with ordinal outcomes. Psychol Methods 9(3):301. https://doi.org/10.1037/1082-989X.9.3.301
Neale MC, Hunter MD, Pritikin JN, Zahery M, Brick TR, Kirkpatrick R, Boker SM et al (2016) OpenMx 2.0: extended structural equation and statistical modeling. Psychometrika 81(2):535–549. https://doi.org/10.1007/s11336-014-9435-8
Neuman RJ, Heath A, Reich W, Bucholz KK, Madden P, Sun L, Hudziak JJ (2001) Latent class analysis of ADHD and comorbid symptoms in a population sample of adolescent female twins. J Child Psychol Psychiatry 42(7):933–942. https://doi.org/10.1111/1469-7610.00789
Newman H, Freeman F, Holzinger K (1937) Twins: a study of heredity and environment. The University of Chicago Press, Chicago, Il
Pritikin Brick TR, Neale MC (2018) Multivariate normal maximum likelihood with both ordinal and continuous variables, and data missing at random. Behav Res Methods 50(2):490–500. https://doi.org/10.3758/s13428-017-1011-6
Pritikin Neale MC, Prom-Wormley EC, Clark SL, Verhulst B (Under Review). Gw-sem 2.0: enhancing efficiency, flexibility, and accessibility. Behav Genetics
R Core Team (2014) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org
Smith C (1970) Heritability of liability and concordance in monozygous twins. Ann Hum Genet 34(1):85–91. https://doi.org/10.1111/j.1469-1809.1970.tb00223.x
Smith C (1974) Concordance in twins: methods and interpretation. Am J Hum Genet 26(4):454–66
Teugels JL (1990) Some representations of the multivariate Bernoulli and binomial distributions. J Multivariate Anal 32:256–268
van den Oord EJ, Simonoff E, Eaves LJ, Pickles A, Silberg J, Maes H (2000) An evaluation of different approaches for behavior genetic analyses with psychiatric symptom scores. Behav Genet 30(1):1–18. https://doi.org/10.1023/a:1002095608946
Verhulst B, Maes HH, Neale MC (2017) Gw-sem: a statistical package to conduct genome-wide structural equation modeling. Behav Genet 47(3):345–359. https://doi.org/10.1007/s10519-017-9842-6
Verhulst B, Prom-Wormley E, Keller M, Medland S, Neale MC (2019) Type I error rates and parameter bias in multivariate behavioral genetic models. Behav Genet 49(1):99–111. https://doi.org/10.1007/s10519-018-9942-y
White H (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48:817–830
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Funding
This study was supported by NIDA grants R01-DA018673 and R01-DA049867.
Conflict of interest
Brad Verhulst and Michael C. Neale declare that they have no conflicts of interest related to the publication of this article.
Ethical Approval
This article does not contain any studies with human participants or animal subjects performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The authors would like to express our deepest gratitude to an anonymous reviewer and to Professor Conor Dolan for their invaluable comments as reviewers of this manuscript. Not only did they provide outstanding critiques that undoubtedly improved the overall quality of the manuscript, but Professor Dolan also provided an initial draft of the R code for the fourth simulation study.
Rights and permissions
About this article
Cite this article
Verhulst, B., Neale, M.C. Best Practices for Binary and Ordinal Data Analyses. Behav Genet 51, 204–214 (2021). https://doi.org/10.1007/s10519-020-10031-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10519-020-10031-x