Prediction models for high risk of suicide in Korean adolescents using machine learning techniques

Jun Su Jung; Sung Jin Park; Eun Young Kim; Kyoung-Sae Na; Young Jae Kim; Kwang Gi Kim

doi:10.1371/journal.pone.0217639

Abstract

Objective

Suicide in adolescents is a major problem worldwide and previous history of suicide ideation and attempt represents the strongest predictors of future suicidal behavior. The aim of this study was to develop prediction model to identify Korean adolescents of high risk suicide (= who have history of suicide ideation/attempt in previous year) using machine learning techniques.

Methods

A nationally representative dataset of Korea Youth Risk Behavior Web-based Survey (KYRBWS) was used (n = 59,984 of middle and high school students in 2017). The classification process was performed using machine learning techniques such as logistic regression (LR), random forest (RF), support vector machine (SVM), artificial neural network (ANN), and extreme gradient boosting (XGB).

Results

A total of 7,443 adolescents (12.4%) had a previous history of suicidal ideation/attempt. In the multivariable analysis, sadness (odds ratio [OR], 6.41; 95% confidence interval [95% CI], 6.08–6.87), violence (OR, 2.32; 95% CI, 2.01–2.67), substance use (OR, 1.93; 95% CI, 1.52–2.45), and stress (OR, 1.63; 95% CI, 1.40–1.86) were associated factors. Taking into account 26 variables as predictors, the accuracy of models of machine learning techniques to predict the high-risk suicidal was comparable with that of LR; the accuracy was best in XGB (79.0%), followed by SVM (78.7%), LR (77.9%), RF (77.8%), and ANN (77.5%).

Conclusions

The machine leaning techniques showed comparable performance with LR to classify adolescents who have previous history of suicidal ideation/attempt. This model will hopefully serve as a foundation for decreasing future suicides as it enables early identification of adolescents at risk of suicide and modification of risk factors.

Citation: Jung JS, Park SJ, Kim EY, Na K-S, Kim YJ, Kim KG (2019) Prediction models for high risk of suicide in Korean adolescents using machine learning techniques. PLoS ONE 14(6): e0217639. https://doi.org/10.1371/journal.pone.0217639

Editor: Vincenzo De Luca, University of Toronto, CANADA

Received: January 10, 2019; Accepted: April 27, 2019; Published: June 6, 2019

Copyright: © 2019 Jung et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: These analyses were performed using data from the Korean Young Risk Behavior Web-based Survey (KYRBWS) link: http://www.cdc.go.kr/CDC/eng/main.jsp). The code used in this study has been uploaded to the following GithHub link: https://github.com/gcubme/PredictSuicide.git.

Funding: This research was supported by MD-PhD research through the Korea Research-Driven Hospital (grant 2018-5287) and Future Innovation Challenges through the Gachon University (grant 2018-0669).

Competing interests: NO authors have competing interests.

Introduction

In South Korea, suicide in adolescents has been emerging as a major public health problem. The suicide rate has increased annually in adolescents and is recorded as not only one of the highest, but also the most rapidly increasing feature among Organization for Economic Cooperation and Development (OECD) countries.

Although several studies have identified risk factors of suicide [1–5], a recent meta-analysis reveals that the ability to predict suicide behaviors have remained limited [6]. New application of machine learning techniques are gaining attention to identify suicide risk at various clinical setting [7]; Passos et al. classified individuals with a history of suicide attempt among patients with mood disorders based on demographic and clinical data [8]. Oh et al. distinguished suicide attempters from non-suicide attempters among patients with depression or anxiety disorders, applying ANN to multiple psychiatric scales and sociodemographic data [5]. Using general characteristics and insurance data from the National Health Insurance Service cohort in Korea, one recent study analyzed the probability of death by suicide [9].

Since the presence of previous suicide ideation/attempt represent one of the strongest predictors of future suicide behavior and death by suicide [6], it is important to identify adolescents who have history of previous suicide ideation/attempt. Herein, the purpose of this study was to establish prediction models for high-risk of suicide in Korean adolescents using machine learning techniques.

Materials and methods

Data collection and preparation

Data used in this study was brought from the Korean Young Risk Behavior Web-based Survey (KYRBWS) XIII in 2017. The KYRBWS is a self-administered online survey and it was approved by the Institutional Review Board (Certificate Number: 11758) of the Korea Centers for Disease Control and Prevention (KCDC).

This survey intends to grasp South Korean adolescents’ health-risk behaviors such as smoking, alcohol use, obesity, physical activity, eating habits, injury prevention, mental health, sexual behaviors, oral health, allergic disorders, personal hygiene, internet addiction, and health equity. Participants were provided with identification numbers and were guaranteed anonymity, and all participants completed an online, self-reported questionnaire in a school computer room after the survey had been fully explained. All data used in this study have been fully anonymized before we accessed them. All procedures and terms and conditions of the survey have been complied with were performed in accordance with the Declaration of Helsinki 7th version and informed consent was obtained from all participants. The test–retest reliability of the KYRBWS questionnaire has been reported to be stable [10]. The dataset and questionnaire is provided with guidelines for calculating a health-related index through the KCDC online site (http://www.cdc.go.kr/CDC/eng/main.jsp).

In 2017, the KYRBWS dataset included a total 62,276 adolescents from 799 middle and high schools (response rate: 95.8%), using a complex sampling design which involves stratification, clustering, and multistage sampling.

Suicide

High risk of suicide, as a dependent variable, was categorized as adolescents who had either suicidal ideation or suicidal attempt in previous year. Suicidal ideation was defined as a yes response to the question, “Did you consider suicide in the last 12 months?” and suicidal attempt was defined as a yes response to the question, “Did you attempt suicide in the last 12 months?” The respondents who experienced either suicidal ideation or suicidal attempt were categorized within the high risk of suicide group.

Independent variables

Independent variables included socio-demographic variables (sex, grade, city type, academic achievement, family structure, family socioeconomic status, and education level of father and mother), health-related lifestyle factors (current smoking, current alcohol consumption, substance use, physical activity, obesity, sexual experience, and internet addiction), and psychological stress factors (sadness, stress, self-rated health, sleep satisfaction, self-rated weight, distorted weight perception, school injury, and violence). Comorbidities included asthma, allergic rhinitis, and atopic dermatitis.

School grade was divided as middle school (Grades 1–3, corresponding age 12–15 years) and high school (Grades 4–6, corresponding age 16–18 years). City type was categorized as big cities, small and medium-sized cities, and countryside. Academic achievement was categorized as high, high middle, middle, low middle, and low. Family structure was categorized as having both parents, having either parent, and neither parent. Family socioeconomic status (SES) was categorized as high, high middle, middle, low middle, and low. Education level of father and mother was categorized as unknown, middle school graduate or less, high school graduate, and college or graduate degree.

Current smoking, current alcohol consumption, and substance use were defined as a yes response to the questions: “Did you smoke or drink alcohol more than once within the last 30 days?” and “Have you ever used any substance or sniffed glue or butane habitually on purpose?”

Physical activity was categorized as “active” (vigorous physical activities more than two days among the last seven days) or “inactive.” Vigorous physical activities were defined as those that make one sweat or feel breathless for 20 minutes or more in the questionnaire.

Body mass index (BMI) was calculated based on the self-reported height and weight, and was categorized as underweight (≤ 5^th percentile), normal (5-85^th percentile), overweight (85-95^th percentile), and obesity (≥ 95^th percentile or BMI ≥ 25 kg/m²). Self-rated weight was categorized as very fat, fat, normal, thin, and very thin. Distorted weight perception was defined when respondents answered “very fat” or “fat” for the self-rated weight question, while his or her actual weight was categorized as underweight or normal.

Information regarding sexual experience, school injury, and internet addiction was also collected. For sadness, the adolescents were asked, “In the last 12 months, has a feeling of sadness interrupted your daily activities for at least two weeks?” In addition, stress, self-rated health, and sleep satisfaction were categorized in five levels by the extent of these symptoms.

Models to predict high risk of suicide

To prevent learning bias resulting from an imbalanced dataset (the proportion of the non-suicide group was about 7 times larger than the suicide group in the entire dataset), a balanced dataset (same number of age- and sex-matched non-suicide group for the suicide group, n = 7,647 for each group) was selected from preprocessed data in terms of down-sampling (Fig 1). To prevent overfitting, the preprocessed dataset was split in five equally-sized random groups using a 5-fold cross validation. One group was used as the test set and the other groups were used as the training sets for the machine learning prediction models. Five machine learning methods were trained: logistic regression (LR), random forest (RF), support vector machine (SVM), artificial neural network (ANN), and extreme gradient boosting (XGB). Optimal parameters for each machine learning method were selected through a grid search (Table 1). The variables used in the model were categorical; hence, a 0 or 1 value was applied by one-hot encoding.

Download:

Fig 1. Scheme prediction model development.

https://doi.org/10.1371/journal.pone.0217639.g001

Download:

Table 1. Optimal parameters for each machine learning model are selected through the grid search.

https://doi.org/10.1371/journal.pone.0217639.t001

A comparison of LR and other machine learning discriminations for each model was performed, in terms of sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy to predict adolescents who had a history of suicidal ideation or attempt. For test dataset, the area under the receiver operating characteristic curve (AUC) for each model was also calculated to evaluate general prediction performance.

Statistical analysis

Results are presented as percentages for categorical variables and as means (± standard deviation) for continuous variables. Categorical variables and continuous variables were compared using the chi-square test or the Student’s t-test for comparisons between adolescents with/without risk of suicide. Multivariate regression analysis was used to identify factors associated with previous suicidal ideation or attempt using the backward stepwise selection method.

The analysis and machine learning models and diagnostic performance was evaluated using the open-source statistical software Python version 3.6.0. P-values of less than 0.05 (two-sided) were considered significant.

Results

The clinical characteristics for a total of 59,984 subjects with valid information regarding previous history of suicidal ideation/attempt are summarized in Table 2. The high risk suicide group showed higher proportions of girl, low school grade, low academic achievement, those not living with both parents, low family SES, low parental education level, current smoking, current alcohol drinking, substance use, inactive physical activity, sexual experience, internet addiction, sadness, high stress, poor self-rated health, low sleep satisfaction, high self-rated weight, distorted weight perception, experience of school injury and violence, and presence of comorbid diseases (asthma, allergic rhinitis, atopic dermatitis).

Download:

Table 2. Characteristics of high-risk suicide (n = 7,443) and no high-risk suicide (n = 52,541).

https://doi.org/10.1371/journal.pone.0217639.t002

A multivariate regression analysis was performed to identify factors associated with high risk of suicide (Table 3). Sadness (odds ratio [OR], 6.41; 95% confidence interval [95% CI], 6.08–6.87), violence (OR, 2.32; 95% CI, 2.01–2.67), substance use (OR, 1.93; 95% CI, 1.52–2.45), and stress (OR, 1.63; 95% CI, 1.40–1.86) showed relatively strong associations with previous suicide ideation/attempt. There were other factors that showed associations with suicide: girl sex, grade, academic achievement, family structure, family SES, parental education level, current smoking, current alcohol drinking, physical activity, overweight, self-rated health, sleep satisfaction, sexual experience, school injury, and violence.

Download:

Table 3. Multivariate logistic regression analysis to identify factors associated with high risk of suicide.

https://doi.org/10.1371/journal.pone.0217639.t003

For the test dataset, the confusion matrix and receiver operating characteristic (ROC) curve show that the diagnostic performance of machine learning techniques are comparable with that of the LR result (Table 4 and Fig 2). XGB showed the best performance, with a sensitivity of 78.5%, specificity of 79.4%, PPV of 79.2%, NPV of 78.7%, classification accuracy of 79.0%, and AUC of 0.863.

Download:

Fig 2. Receiver operating characteristic (ROC) curve.

https://doi.org/10.1371/journal.pone.0217639.g002

Download:

Table 4. Confusion matrix for prediction models (Test set).

https://doi.org/10.1371/journal.pone.0217639.t004

Discussion

Machine learning techniques offer promise to improve risk prediction for suicide. A systematic review revealed greater prediction accuracy of self-injurious thoughts and behaviors than in previous studies using traditional statistical methods [7].

Machine learning techniques have advantages beyond traditional statistical approaches in psychological research [11]. For example, traditional approaches greatly minimize the number of variables and impose linearity on relationships that likely have more complex associations. On the other hand, machine learning approaches enable the simultaneous testing of numerous variables and their complex interactions and allow for non-linearity in producing predictive models [11].

The purpose of this study was to develop models to determine adolescent at risk of suicide using nationally representative survey dataset in Korea by using machine learning methods. In this study, we applied the LR method and several other machine learning algorithms, and XGB showed the best performance in the test dataset with an accuracy of 79.0% (AUC = 0.863). XGB, one of the machine learning techniques, is highly efficient and flexible and can be easily used on distributed platforms for further computational efficiency [12]. Ensemble learning is possible by attaching another algorithm to XGB. Future studies would possibly show a better performance if XGB is combined with various algorithms rather than a single algorithm model.

However, the machine leaning techniques showed an overall comparable diagnostic performance with LR. The main reason might be due to the type of dataset used in the present study. The KYRBWS survey data are composed of general health-risk behaviors and we arbitrarily select 26 categorical variables to develop prediction models. Further study is warranted to explore the increasing accuracy using latent variables.

The present study has several limitations. First, the KYRBWS was developed to cover general health-risk behaviors including psychological status and previous suicidal behavior, which were examined by simple questions and scales. If the survey had been composed of more detailed questions regarding suicide behavior or psychological status, the performance of models might have improved. Second, this model was developed using the KYRBWS dataset, it does not guarantee the same diagnostic performance with other datasets or populations. In the present study, we used pairing cross validation for imbalance outcome to avoid the problem of “limited generalization” or “overfitting.” Nevertheless, despite these limitations, this is the first study to adopt machine learning techniques to a nationally representative, and large number (n = 59,984) of Korean adolescents.

In conclusion, this study showed that machine learning techniques have the potential to identify Korean adolescents at risk of suicide using nationally representative survey dataset of general health-risk behaviors. Several machine learning models have comparable performance with the conventional LR method, which have potential for development. Establishment of accurate prediction models through additional studies would facilitate early screening of high risk adolescents and correction of modifiable risk factors, so that society can prevent future suicidal behavior and death by suicide.

References

1. Byeon KH, Kimm H, Jee SH, Sull JW, Choi B. Relationship between heavy drinking experience and suicide attempts in Korean adolescents: Based on 2013 The Korea Youth Risk Behavior Web-based Survey. Epidemiology and health. 2018:e2018046. Epub 2018/10/20. pmid:30336665.
2. Choi SB, Lee W, Yoon JH, Won JU, Kim DW. Risk factors of suicide attempt among people with suicidal ideation in South Korea: a cross-sectional study. BMC public health. 2017;17(1):579. Epub 2017/06/18. pmid:28619107; PubMed Central PMCID: PMCPMC5472995.
3. Jordan P, Shedden-Mora MC, Lowe B. Predicting suicidal ideation in primary care: An approach to identify easily assessable key variables. General hospital psychiatry. 2018;51:106–11. Epub 2018/02/13. pmid:29428582.
4. Kim YJ, Moon SS, Lee JH, Kim JK. Risk Factors and Mediators of Suicidal Ideation Among Korean Adolescents. Crisis. 2018;39(1):4–12. Epub 2016/11/22. pmid:27869508.
5. Oh J, Yun K, Hwang JH, Chae JH. Classification of Suicide Attempts through a Machine Learning Algorithm Based on Multiple Systemic Psychiatric Scales. Frontiers in psychiatry. 2017;8:192. Epub 2017/10/19. pmid:29038651; PubMed Central PMCID: PMCPMC5632514.
6. Franklin JC, Ribeiro JD, Fox KR, Bentley KH, Kleiman EM, Huang X, et al. Risk factors for suicidal thoughts and behaviors: A meta-analysis of 50 years of research. Psychol Bull. 2017;143(2):187–232. pmid:27841450.
7. Burke TA, Ammerman BA, Jacobucci R. The use of machine learning in the study of suicidal and non-suicidal self-injurious thoughts and behaviors: A systematic review. Journal of affective disorders. 2019;245:869–84. pmid:30699872.
8. Passos IC, Mwangi B, Cao B, Hamilton JE, Wu MJ, Zhang XY, et al. Identifying a clinical signature of suicidality among patients with mood disorders: A pilot study using a machine learning approach. Journal of affective disorders. 2016;193:109–16. Epub 2016/01/17. pmid:26773901; PubMed Central PMCID: PMCPMC4744514.
9. Choi SB, Lee W, Yoon JH, Won JU, Kim DW. Ten-year prediction of suicide death using Cox regression and machine learning in a nationwide retrospective cohort study in South Korea. J Affect Disord. 2018;231:8–14. Epub 2018/02/07. pmid:29408160.
10. Bae J, Joung H, Kim JY, Kwon KN, Kim YT, Park SW. Test-retest reliability of a questionnaire for the Korea Youth Risk Behavior Web-based Survey. J Prev Med Public Health. 2010;43(5):403–10. pmid:20959711.
11. McArdle JJ, Ritschard G. Contemporary Issues In Exploratory Data Mining in the Behavior Sciences.2014.
- View Article
- Google Scholar
12. Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Statist. 2001;29(5):1189–232.
- View Article
- Google Scholar

[ref1] 1. Byeon KH, Kimm H, Jee SH, Sull JW, Choi B. Relationship between heavy drinking experience and suicide attempts in Korean adolescents: Based on 2013 The Korea Youth Risk Behavior Web-based Survey. Epidemiology and health. 2018:e2018046. Epub 2018/10/20. pmid:30336665.
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Choi SB, Lee W, Yoon JH, Won JU, Kim DW. Risk factors of suicide attempt among people with suicidal ideation in South Korea: a cross-sectional study. BMC public health. 2017;17(1):579. Epub 2017/06/18. pmid:28619107; PubMed Central PMCID: PMCPMC5472995.
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Jordan P, Shedden-Mora MC, Lowe B. Predicting suicidal ideation in primary care: An approach to identify easily assessable key variables. General hospital psychiatry. 2018;51:106–11. Epub 2018/02/13. pmid:29428582.
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Kim YJ, Moon SS, Lee JH, Kim JK. Risk Factors and Mediators of Suicidal Ideation Among Korean Adolescents. Crisis. 2018;39(1):4–12. Epub 2016/11/22. pmid:27869508.
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Oh J, Yun K, Hwang JH, Chae JH. Classification of Suicide Attempts through a Machine Learning Algorithm Based on Multiple Systemic Psychiatric Scales. Frontiers in psychiatry. 2017;8:192. Epub 2017/10/19. pmid:29038651; PubMed Central PMCID: PMCPMC5632514.
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Franklin JC, Ribeiro JD, Fox KR, Bentley KH, Kleiman EM, Huang X, et al. Risk factors for suicidal thoughts and behaviors: A meta-analysis of 50 years of research. Psychol Bull. 2017;143(2):187–232. pmid:27841450.
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Burke TA, Ammerman BA, Jacobucci R. The use of machine learning in the study of suicidal and non-suicidal self-injurious thoughts and behaviors: A systematic review. Journal of affective disorders. 2019;245:869–84. pmid:30699872.
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Passos IC, Mwangi B, Cao B, Hamilton JE, Wu MJ, Zhang XY, et al. Identifying a clinical signature of suicidality among patients with mood disorders: A pilot study using a machine learning approach. Journal of affective disorders. 2016;193:109–16. Epub 2016/01/17. pmid:26773901; PubMed Central PMCID: PMCPMC4744514.
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Choi SB, Lee W, Yoon JH, Won JU, Kim DW. Ten-year prediction of suicide death using Cox regression and machine learning in a nationwide retrospective cohort study in South Korea. J Affect Disord. 2018;231:8–14. Epub 2018/02/07. pmid:29408160.
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Bae J, Joung H, Kim JY, Kwon KN, Kim YT, Park SW. Test-retest reliability of a questionnaire for the Korea Youth Risk Behavior Web-based Survey. J Prev Med Public Health. 2010;43(5):403–10. pmid:20959711.
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. McArdle JJ, Ritschard G. Contemporary Issues In Exploratory Data Mining in the Behavior Sciences.2014.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref12] 12. Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Statist. 2001;29(5):1189–232.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

Figures

Abstract

Objective

Methods

Results

Conclusions

Introduction

Materials and methods

Data collection and preparation

Suicide

Independent variables

Models to predict high risk of suicide

Statistical analysis

Results

Discussion

References