Introduction

Several recent attempts have been made to classify suicidal behavior using machine learning (Burke et al., 2020; Miché et al., 2020; Shen et al., 2020; van Vuuren et al., 2021). In this paper, we point out a critical issue that has not been addressed in the literature and contrasts the common understanding of the False Positive cases (FP), which are considered as non-informative classification error (see for example Linthicum et al., 2019; van Vuuren et al., 2021). In a nutshell, when evaluating machine learning binary classification models of suicidal behavior, a closer look should be given to FP. Our argument is that FP may exhibit similar psycho-socio-behavioral response patterns to True Positive cases (TP) and may therefore include individuals with a high risk of developing suicidal tendencies. Further, it is known that individuals with suicidal tendencies may avoid reporting their suicide attempts or ideations, which is particularly common among adolescents (Brahmbhatt & Grupp-Phelan, 2019; Christl et al., 2006; Hart et al., 2013; Jones et al., 2019). Thus, it is plausible that a well-trained model can identify high-risk individuals who have not yet attempted suicide or who have refused to self-report their previous attempts. This group may be a clinically relevant targetFootnote 1 for prevention programs.

Consider an imperfect binary classifier that categorizes individuals into two groups of positives and negatives, of which TP and True Negative cases (TN) are correct classifications, and FP and False Negative cases (FN) constitute misclassifications. Machine learning classification models of suicidal behavior are typically trained using multiple mental health indicators and risk factors and can detect patterns in the data that lead to accurate classifications (Healy, 2021; Ley et al., 2022; Walsh et al., 2017). Additionally, some of the common predictors and risk factors of suicidal behavior, such as depressive symptoms, non-suicidal self-harm, and substance use, might mediate or causally relate to the development of suicidal behavior, which are also expected to persist over time. Persistence of risk factors and lack of protective factors will increase the risk for developing suicidal behavior in the future. Furthermore, suicidal behavior is expected to develop over time, as there is evidence for pathways that lead to development of suicidality among adolescents (Haghish et al., 2023; Van Orden et al., 2010). Therefore, if the classification model demonstrates high accuracy and the classification is made for high cutoff in the estimated probabilistic risk scores (high specificity), the response patterns and symptoms of FP and TP are expected to be similar (see Fig. 1). Consequently, FP are expected to have worse mental health conditions compared to TN, providing evidence that FP are a risk group and potentially relevant to a suicide prevention program.

Fig. 1
figure 1

The rationale for considering FP as a risk group in suicide attempt classification models

Drawing on this rationale, we tested two hypotheses using cross-sectional data. First, we hypothesized that at higher specificity thresholds, the severity of depression and anxiety symptoms would be more similar between the FP and TP groups. Second, we hypothesized that at a high specificity threshold, the prevalence of psychological distress and non-suicidal self-harm would be significantly higher among the FP group compared to the TN group.

Methods

We utilized data from the Ungdata project (www.ungdata.no), which included 173,663 adolescents from all municipalities in Norway, who participated in the period between 2014 and 2019. The participants completed a battery of questionnaires that covered socio-demographic information, internalizing and externalizing problems, traumatic experiences, interpersonal relationships, and suicidal behavior. Following the approach proposed by Strand et al., (2003), we computed the psychological distress score as the average of the depression and anxiety sum scores, and classified participants who scored at least 3 out of 4 as distressed. A comprehensive description of the instruments and their items can be accessed on the Open Science repository of this paper via https://osf.io/a7fgb/.

The Extreme Gradient Boosting algorithm (XGBoost; Chen & Guestrin, 2016) was used to develop the classification model, as it is expected to outperform decision tree ensemble algorithms as well as generalized linear models (Haghish et al., 2023; Sahin, 2020). We trained the algorithm on 80% of the data (n = 138,931), with the remaining 20% (n = 34,732) reserved for testing. To optimize performance on the imbalanced outcome variable, we fine-tuned the model using 10-fold cross-validation to maximize the Area Under Precision-Recall Curve (AUCPR), which is the preferred performance metric for imbalanced (low prevalent) outcomes (Davis & Goadrich, 2006). The adjROC R package (Haghish, 2022) was employed to compute the classification cutoffs for specificity values ranging from 0.4 to 1.0, and for each threshold, we computed the severity of depression and anxiety symptoms for all classification groups. To test our first hypothesis, we subtracted the depression and anxiety scores of TP from FP at different thresholds to calculate their differences and fitted linear regression models on the results. For the second hypothesis, we compared the prevalence of psychological distress and non-suicidal self-harm between FP and TN using Fisher’s exact test. Note that both the binary psychological distress item was computed after the model training. Moreover, the non-suicidal self-harm item was excluded from the dataset in the process of model training. 

Results

The trained XGBoost modelFootnote 2 achieved an AUC of 92.8% and an AUPRC of 48.8%. Additionally, the model exhibited a sensitivity of 51.7%, specificity of 96.9%, and a Cohen’s Kappa of 0.466. These results suggest that the model's accuracy and inter-rater agreement in classifying suicidal behavior are high.

Figure 2 illustrates that as specificity increased, the average sum scores of depression and anxiety for TP and FP groups became more similar. Linear regression analysis confirmed a significant negative slope for the distance between the TP and FP scores for depression (β = - 0.69, Adjusted R2 = 0.991, F (1, 97) = 11240, p < 0.0001) and anxiety (β = - 0.71, Adjusted R2 = 0.972, F (1, 14) = 3382, p < 0.0001). These findings indicate the difference between the mean sum scores of the two groups decreased as a function of increase in specificity, thereby supporting our first hypothesis.

Fig. 2
figure 2

Scaled mean score of anxiety and depression sum scores evaluated for specificity thresholds ranging from 0.4 to 1.0

In the testing sample, 7.2% of adolescents were found to experience psychological distress, with varying rates across the different classification groups. Specifically, the prevalence was highest in the TP group at 63.4%, followed by FP at 55.3%, FN at 20.0%, and TN at 4.2%. Fisher's exact test revealed a statistically significant difference in psychological distress levels between the FP and FN groups (Odds ratio = 0.036, 95% CI = 0.031 – 0.041, p < 0.0001). Indeed, the rate of psychological distress of the FP was also significantly higher than the TN group (Odds ratio = 4.958, 95% CI = 3.976 – 6.202, p < 0.0001). Additionally, the prevalence of non-suicidal self-harm was highest in the TP group at 95.2%, followed by FN at 80.8%, FP at 65.2%, and TN at 9.9%. Consistent with expectations, Fisher's exact test also revealed a statistically significant difference in non-suicidal self-harm prevalence between the FP and TN groups (Odds ratio = 0.059, 95% CI = 0.050 – 0.068, p < 0.0001), supporting the second hypothesis.

Discussion

The results supported our claims that for an accurate suicide classification model and at a high specificity threshold, adolescents in the FP group might be comparable to TP, showing severe symptoms of psychological distress and non-suicidal self-harm at rates much higher than those in the TN group. In our analysis, FP showed more severe signs of psychological distress than the FN group that reflects on why the model had evaluated their suicide attempt risk to be higher than the FN. Depression, anxiety, and non-suicidal self-harm are well-established risk factors for suicidal behavior and thus, the study's findings support our hypotheses and arguments (Carballo et al., 2020; Darke et al., 2010; Greening et al., 2008; Lewis et al., 2014; Lohner & Konrad, 2006; Toprak et al., 2011).

Well-trained machine learning suicide classification models are expected to identify important predictors and interactions between the predictors. In estimating suicide attempt risk, machine learning models are also expected to take the severity of these predictors into account. Therefore, it is to be expected that the severity of important suicide-related indicators such as depression, anxiety, psychological distress, and the prevalence of non-suicidal self-harm are reflected in the suicide risk estimations of the model. What is noteworthy here, however, is that the identified high-risk non-suicidal adolescents should be conceptualized as a risk group relevant to a suicide prevention program rather than mere classification error that should be dismissed. Although they do not report any suicide attempts, yet, compared to true negative cases, they are likely to be at higher risk of developing suicidal tendencies or attempt suicide in near future.

It is important to note that this study relied on cross-sectional data and only identified similarities and differences between the FP, TP, and TN groups as evidence for suicide attempt risk, which is a limitation. Future longitudinal studies should investigate whether FP adolescents are significantly more likely to attempt suicide or develop suicidal behavior than those in the TN group. As we used a representative dataset of Norwegian adolescents, we anticipate the findings could be applicable to other age groups. However, this should be confirmed in future research. Nevertheless, the study had several strengths, in particular the use of a comprehensive dataset. This dataset not only included a large number of participants, but also risk and protective survey items from a broad range of psycho-socio-environmental domains, thereby enabling an accurate estimation of suicide attempt risk.

Research has found low effectiveness for suicide intervention programs (Fox et al., 2020; Large, 2018), highlighting the importance of prevention rather than addressing suicidal tendencies after they emerge (Carter & Spittal, 2018). Therefore, identifying adolescents vulnerable to developing suicidal behavior is pivotal. Our study suggests that the FP group may be a relevant target for a suicide prevention program. If future longitudinal research confirms that FP adolescents (or other age groups) are clinically relevant and at high risk of attempting suicide, it will be necessary to devise new measures to evaluate the performance of machine learning classifiers for suicidal behavior. Such measures should consider the clinical relevance of the FP group. This information could lead to a redefinition of clinical relevance and the development of optimized cutoff values that maximize clinical relevance rather than overemphasizing sensitivity and specificity (Brown & Barlow, 2016; Hayes & Bell, 2014). This approach may also be applicable to other health-related binary outcomes and is not limited to suicide research. While this idea is novel, its generalization requires further investigation, particularly by using data from longitudinal studies.