Best Practices for Binary and Ordinal Data Analyses

Verhulst, Brad; Neale, Michael C.

doi:10.1007/s10519-020-10031-x

Best Practices for Binary and Ordinal Data Analyses

Original Research
Published: 05 January 2021

Volume 51, pages 204–214, (2021)
Cite this article

Behavior Genetics Aims and scope Submit manuscript

2298 Accesses
20 Citations
5 Altmetric
Explore all metrics

Abstract

The measurement of many human traits, states, and disorders begins with a set of items on a questionnaire. The response format for these questions is often simply binary (e.g., yes/no) or ordered (e.g., high, medium or low). During data analysis, these items are frequently summed or used to estimate factor scores. In clinical applications, such assessments are often non-normally distributed in the general population because many respondents are unaffected, and therefore asymptomatic. As a result, in many cases these measures violate the statistical assumptions required for subsequent analyses. To reduce the influence of the non-normality and quasi-continuous assessment, variables are frequently recoded into binary (affected–unaffected) or ordinal (mild–moderate–severe) diagnoses. Ordinal data therefore present challenges at multiple levels of analysis. Categorizing continuous variables into ordered categories typically results in a loss of statistical power, which represents an incentive to the data analyst to assume that the data are normally distributed, even when they are not. Despite prior zeitgeists suggesting that, e.g., variables with more than 10 ordered categories may be regarded as continuous and analyzed as if they were, we show via simulation studies that this is not generally the case. In particular, using Pearson product-moment correlations instead of maximum likelihood estimates of polychoric correlations biases the estimated correlations towards zero. This bias is especially severe when a plurality of the observations fall into a single observed category, such as a score of zero. By contrast, estimating the ordinal correlation by maximum likelihood yields no estimation bias, although standard errors are (appropriately) larger. We also illustrate how odds ratios depend critically on the proportion or prevalence of affected individuals in the population, and therefore are sub-optimal for studies where comparisons of association metrics are needed. Finally, we extend these analyses to the classical twin model and demonstrate that treating binary data as continuous will underestimate genetic and common environmental variance components, and overestimate unique environment (residual) variance. These biases increase as prevalence declines. While modeling ordinal data appropriately may be more computationally intensive and time consuming, failing to do so will likely yield biased correlations and biased parameter estimates from modeling them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods

Article 04 June 2018

Health, Health-Related Quality of Life, and Quality of Life: What is the Difference?

Article 18 February 2016

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

References

Agresti A (1990) Analysis of categorical data. Wiley, New York
Google Scholar
Bock RD, Aitkin M (1981) Marginal maximum likelihood estimation of item parameters: application of an em algorithm. Psychometrika 46(4):443–459. https://doi.org/10.1007/BF02293801
Article Google Scholar
Boker SM, Neale MC, Maes H, Wilde M, Spiegel M, Brick TR, Bates T et al (2011) OpenMx: n open source extended structural equation modeling framework. Psychometrika 76(2):306–317
Article Google Scholar
Boyle EA, Li YI, Pritchard JK (2017) An expanded view of complex traits: from polygenic to omnigenic. Cell 169(7):1177–1186. https://doi.org/10.1016/j.cell.2017.05.038
Article PubMed PubMed Central Google Scholar
Browne MW (1984) Asymptotically distribution-free methods for the analysis of covariance structures. Br J Math Stat Psychol 37(1):62–83. https://doi.org/10.1111/j.2044-8317.1984.tb00789.x
Article PubMed Google Scholar
Chalmers RP (2012) mirt: a multidimensional item response theory package for the R environment. J Stat Softw 48(6):1–29
Article Google Scholar
Curnow RN (1972) The multifactorial model for the inheritance of liability to disease and its implications for relatives at risk. Biometrics 28(4):931–46
Article Google Scholar
Eaves L (2017) Genotype x environment interaction in psychiatric genetics: deep truth or thin ice? Twin Res Hum Genet 20(3):187–196. https://doi.org/10.1017/thg.2017.19
Article PubMed Google Scholar
Eaves L, Verhulst B (2014) Problems and pit-falls in testing for g x e and epistasis in candidate gene studies of human behavior. Behav Genet 44(6):578–90. https://doi.org/10.1007/s10519-014-9674-6
Article PubMed PubMed Central Google Scholar
Fisher RA (1915) Frequency distribution of the values of the correlation coefficient in samples of an indefinitely large population. Biometrika 10:507–521
Google Scholar
Fisher RA (1921) On the ‘probable error’ of a coefficient of correlation deduced from a small sample. Metron 1:3–32
Google Scholar
Flora DB, Curran PJ (2004) An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychol Methods 9(4):466–491. https://doi.org/10.1037/1082-989X.9.4.466
Article PubMed PubMed Central Google Scholar
Fox J (2019) Polycor: Polychoric and polyserial correlations. R package version 0.7-10. https://CRAN.R-project.org/package=polycor
Glass GV, Hopkins KD (1995) Statistical methods in education and psychology, 3rd edn. Allyn & Bacon, Boston
Google Scholar
Gottesman II, Shields J (1967) A polygenic theory of schizophrenia. Proc Natl Acad Sci USA 58(1):199–205. https://doi.org/10.1073/pnas.58.1.199
Article PubMed Google Scholar
Huber P (1967) The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the fifth berkeley symposium on mathematical statistics and probability, vol 1, pp. 221–233. University of California Press, Berkeley, CA
Jöreskog KG, Sórbom D (1993) PRELIS2 - user’s reference guide. Scientific Software, Chicago, IL
Google Scholar
Lehmann EL (1998) Elements of large-sample theory. Springer, New York
Google Scholar
Long JS (1997) Regression models for categorical and limited dependent variables. Advanced Quantitative Techniques in the Social Sciences. Sage Publications Inc, Thousand Oaks, CA
Google Scholar
Martin NG, Eaves LJ (1977) The genetical analysis of covariance structure. Heredity (Edinb) 38(1):79–95. https://doi.org/10.1038/hdy.1977.9
Article Google Scholar
Mehta PD, Neale MC, Flay BR (2004) Squeezing interval change from ordinal panel data: latent growth curves with ordinal outcomes. Psychol Methods 9(3):301. https://doi.org/10.1037/1082-989X.9.3.301
Article PubMed Google Scholar
Neale MC, Hunter MD, Pritikin JN, Zahery M, Brick TR, Kirkpatrick R, Boker SM et al (2016) OpenMx 2.0: extended structural equation and statistical modeling. Psychometrika 81(2):535–549. https://doi.org/10.1007/s11336-014-9435-8
Article PubMed Google Scholar
Neuman RJ, Heath A, Reich W, Bucholz KK, Madden P, Sun L, Hudziak JJ (2001) Latent class analysis of ADHD and comorbid symptoms in a population sample of adolescent female twins. J Child Psychol Psychiatry 42(7):933–942. https://doi.org/10.1111/1469-7610.00789
Article PubMed Google Scholar
Newman H, Freeman F, Holzinger K (1937) Twins: a study of heredity and environment. The University of Chicago Press, Chicago, Il
Google Scholar
Pritikin Brick TR, Neale MC (2018) Multivariate normal maximum likelihood with both ordinal and continuous variables, and data missing at random. Behav Res Methods 50(2):490–500. https://doi.org/10.3758/s13428-017-1011-6
Article PubMed Google Scholar
Pritikin Neale MC, Prom-Wormley EC, Clark SL, Verhulst B (Under Review). Gw-sem 2.0: enhancing efficiency, flexibility, and accessibility. Behav Genetics
R Core Team (2014) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org
Smith C (1970) Heritability of liability and concordance in monozygous twins. Ann Hum Genet 34(1):85–91. https://doi.org/10.1111/j.1469-1809.1970.tb00223.x
Article PubMed Google Scholar
Smith C (1974) Concordance in twins: methods and interpretation. Am J Hum Genet 26(4):454–66
PubMed PubMed Central Google Scholar
Teugels JL (1990) Some representations of the multivariate Bernoulli and binomial distributions. J Multivariate Anal 32:256–268
Article Google Scholar
van den Oord EJ, Simonoff E, Eaves LJ, Pickles A, Silberg J, Maes H (2000) An evaluation of different approaches for behavior genetic analyses with psychiatric symptom scores. Behav Genet 30(1):1–18. https://doi.org/10.1023/a:1002095608946
Article PubMed Google Scholar
Verhulst B, Maes HH, Neale MC (2017) Gw-sem: a statistical package to conduct genome-wide structural equation modeling. Behav Genet 47(3):345–359. https://doi.org/10.1007/s10519-017-9842-6
Article PubMed PubMed Central Google Scholar
Verhulst B, Prom-Wormley E, Keller M, Medland S, Neale MC (2019) Type I error rates and parameter bias in multivariate behavioral genetic models. Behav Genet 49(1):99–111. https://doi.org/10.1007/s10519-018-9942-y
Article PubMed Google Scholar
White H (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48:817–830
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Psychiatry, Texas A&M University, College Station, USA
Brad Verhulst
Virginia Commonwealth University, Richmond, USA
Michael C. Neale

Authors

Brad Verhulst
View author publications
You can also search for this author in PubMed Google Scholar
Michael C. Neale
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Brad Verhulst.

Ethics declarations

Funding

This study was supported by NIDA grants R01-DA018673 and R01-DA049867.

Conflict of interest

Brad Verhulst and Michael C. Neale declare that they have no conflicts of interest related to the publication of this article.

Ethical Approval

This article does not contain any studies with human participants or animal subjects performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The authors would like to express our deepest gratitude to an anonymous reviewer and to Professor Conor Dolan for their invaluable comments as reviewers of this manuscript. Not only did they provide outstanding critiques that undoubtedly improved the overall quality of the manuscript, but Professor Dolan also provided an initial draft of the R code for the fourth simulation study.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Verhulst, B., Neale, M.C. Best Practices for Binary and Ordinal Data Analyses. Behav Genet 51, 204–214 (2021). https://doi.org/10.1007/s10519-020-10031-x

Download citation

Received: 13 June 2020
Accepted: 31 October 2020
Published: 05 January 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s10519-020-10031-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Best Practices for Binary and Ordinal Data Analyses

Abstract

Access this article

Similar content being viewed by others

RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods

Health, Health-Related Quality of Life, and Quality of Life: What is the Difference?

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Funding

Conflict of interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Best Practices for Binary and Ordinal Data Analyses

Abstract

Access this article

Similar content being viewed by others

RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods

Health, Health-Related Quality of Life, and Quality of Life: What is the Difference?

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Funding

Conflict of interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation