This paper presents a unique analysis of categories of vocalizations from a database of vocalizations from non- or minimally-speaking individuals in real-world environments. The categories—frustration, dysregulation, request, self-talk, and delight—can be interpreted by caregivers of non- or minimally-speaking individuals, but it can be hard for a new listener to discern between them. Correlation-based features, representing the coordination within and across three speech subsystems—articulatory, laryngeal, and respiratory—were used to identify the types of changes that non- or minimally-speaking individuals may make in movements of their speech production subsystems to differentiate between the vocalization categories. This work provides insight into both the acoustic differences between the categories as well as the localization of how those acoustic differences may arise. The results from this paper could be used to develop a curated feature set to provide interpretability of vocalizations from non- or minimally-speaking individuals and also to better understand how motor coordination and communication are linked in speech production. The insights could additionally be utilized by clinicians to better understand how speech production movements differ across vocalization classes, and how they could be modified to better communicate the desired intent.
Vocalization Analysis
The frustrated and dysregulated categories did not show many clear differences between movement and coordination across and within speech subsystems, other than in the laryngeal subsystem. Consistent with spectrograms, the frustrated category appeared to have lower complexity of movement, with less change in frequency of vocal fold vibration (F0) and irregularity of this vibration (CPP) over time as compared to vocalizations from the dysregulated category, as shown in Fig.
5. This indicates that individuals don’t modulate their pitch (vocal fold vibration) as frequently during frustration utterances as compared to during dysregulated utterances.
Subsequently, it was found that there was a clear division between frustrated and dysregulated vocalizations as compared to self-talk, request, and delighted vocalizations. Specifically, there appears to be lower complexity of laryngeal and respiratory movements in the self-talk, request, and delighted vocalizations in comparison to frustrated and dysregulated vocalizations, as shown in Figs.
7 and
8. This may indicate that prosodic information, typically modulated by the laryngeal subsystem, may help in distinguishing frustrated and dysregulated states from these other categories. Perceptually, this may manifest as rapid changes in loudness and pitch during frustration and dysregulation as compared to the other vocalization classes. On the other hand, there is higher complexity of articulatory movements in the self-talk, request, and delighted vocalizations as compared to frustration and dysregulation, as shown in the spectrograms in Fig.
6 and effect size patterns in Fig.
7, which highlights that articulatory information is perhaps not used to communicate frustrated and dysregulated states for these participants. Individuals are likely utilizing rapid articulatory movements to portray self-talk, request, and delighted vocalizations.
Request vocalizations had lower complexity in the correlation across signals from all three subsystems as compared to self-talk and delighted vocalizations, as shown in Fig.
8. Combined with the comparison to frustrated and dysregulated vocalizations, this suggests that simple, coupled movements may be used in the respiratory and laryngeal subsystems by non- or minimally-speaking individuals to ask for something from the individuals around them, and they include articulatory movements to further distinguish their intent from frustrated and dysregulated states. These observations match with perceptual observations, where there was typically more consonants in request sounds, while frustrated and dysregulated vocalizations included exclamations, squeals, and content that sounded more like a sustained vowel, such as holding the vowel ‘a’ for a period of time. On the other hand, delighted vocalizations appeared to have the highest complexity of movements across all three subsystems as compared to request and self-talk, which may indicate that there is less control over coordination of movements across the speech production subsystems during the expression of this vocal categorization.
Limitations, Future Work, and Clinical Implications
Many of the tools and algorithms we utilized for low-level acoustic feature extraction have been developed and utilized for speech recordings in controlled environments, and have typically been utilized in feature extraction in verbal communication. Therefore, there may be some noise in the low-level features extracted, as recordings in this study were done in a real-world environment, with possible background noise, varied recording conditions, and a focus on non-verbal vocalizations. In future work, we will explore noise-reduction techniques and further evaluate parameters of the feature extraction techniques to increase robustness of feature extraction.
In the analysis presented here, we eliminated the ‘social exchange’ vocalization category due to the low number of labeled occurrences across all of the subjects. Moving forward, it will be important to emphasize the importance of collecting examples from this category, as social exchanges can help in better understanding communication differences across non- or minimally-speaking individuals. Within the social exchange category and across all other vocalization categories, we may also be able to understand changes made in speech production subsystems when interacting with a caregiver vs. a peer vs. a clinician, which would be important to incorporate into our understanding of how non- or minimally-speaking individuals communicate with different individuals. Given a potential overlap between the social exchange category and other categories, we can gain insight into how the acoustics of social exchanges alter depending on additional vocalization categories and contexts. In addition, the current analysis relies on the labeling from the caregiver, but incorporating labels from clinicians, teachers, and other individuals may also aid in developing personalized models and insights to understand how non-verbal vocalizations change with respect to social context.
Caregivers were instructed to label vocalizations when they were certain that there was a single communicative function or emotion being conveyed with that vocalization. However, this excludes the possibility that there are overlapping functions and emotions in vocalizations, such as ‘dysregulated request’ or ‘delighted self-talk’, which would be important to identify for the individual. In addition, there could be vocalizations that sound very similar to each other acoustically, but were interpreted as different communicative intents given the context in which they were produced. In initial versions of the application, the introduction of multiple labels for vocalizations was overwhelming for the labeler, which led to the instructions used in the data collection for this study. This labeling procedure was replaced with the application asking for caregivers to only label a vocalization when they were confident in the meaning and intent, given the contextual clues, which greatly reduced the cognitive load of the labeling procedure and made it much easier for the caregivers to provide the data included in this study. Being the first of its kind to label the vocalizations of non- or minimally-speaking individuals in the wild, we strove to reduce cognitive load and interference with daily life for the caregiver in the current study, thus allowing for a large number of collected utterances from each individual. However, moving forward, we will test out application interfaces that will enable for multiple labels such that we can get more accurate analysis of vocalizations with multiple communicative intents. We will also present the opportunity for multiple caregivers to participate in labeling at the same time so that we are able to establish inter-rater reliability and determine how consistent the labels are for each individual across caregivers.
The current analysis focuses on the vocalizations of 7 individuals, but given the variability across the population of non- or minimally-speaking individuals and the different conditions that can cause an individual to be non- or minimally-speaking, there might be patterns that are stronger for some subgroups of individuals compared to the larger group. While there were many additional individuals who were interested in participating in the study, many were unable to incorporate the recording procedure into their daily lives easily. This warrants the creation of a simpler recording procedure and application, which will be as non-invasive in daily life as possible, while also collecting a large variety of vocalizations. With this simpler recording procedure, we hope to expand this study to include a larger number of individuals and conditions, such that we can verify whether the patterns observed here carry over to a larger set of individuals. In addition, we will be able to conduct subgroup level analyses within specific diagnoses to determine whether characteristics of the disorder affect these patterns, e.g., in CP vs. ASD.
We also analyzed subjects across a wide age range of 6–23 years. In that time, neurotypical individuals typically undergo physiological changes, such as lengthening of the vocal tract, as well as improved coordination of articulators. However, these changes have not yet been studied in non- or minimally-speaking individuals, where individuals have limited expressive vocabulary, and have not been analyzed in non-verbal vocalizations. As we expand the number of subjects in this study and conduct a longitudinal analysis, we aim to analyze any changes to non-verbal vocalizations that may occur as individuals age and link those changes to physiological changes in each individual. Future work will aim to increase the number of subjects so that we can compare specific subgroups—i.e., focusing on similar levels of cognitive status, and smaller age groups. We could also compare to nonverbal infants aged 0–1 years to see whether verbalization classes from infants are similar to those from non- or minimally-speaking individuals.
Additionally, while the group level analyses did highlight clear patterns that distinguished between categories, there was variation across individuals, where some individuals had patterns that were opposite to that of the group level. This highlights the need for personalized modeling of vocalizations. There are subtleties that differentiate between the types of acoustic modulations that will be used by non- or minimally-speaking individuals to communicate different intents, and these may not necessarily carry over to the next individual. For clinicians and therapists, who may be seeing multiple patients, it will be important to develop a personalized model and analysis of the vocalizations of a non- or minimally-speaking individual, which would allow for the clinician or therapist to easily interpret the intent of the vocalization. An initial set of models have been developed for some of the individuals in this study (Narain et al.,
2022), and we aim to augment and further inform feature selection of the models based on the patterns that we observe in this paper. Motoric movements (both fine- and gross-motor) and gestures also contribute to a caregiver’s perception and understanding of a non-verbal vocalization and were utilized when caregivers were labelling the audio. Though we would like to continue to focus on characterizing vocalizations using audio, as this is a simple, non-invasive way of collecting data, future work may evaluate the utility of visual and environmental information to augment characterization of non-verbal vocalization categories, which may lead to improved personalized classification models.
Altogether, future work will aim to incorporate the speech-based features of this paper to develop an easy-to-use application which can interpret the vocalizations and additional contextual information in real time. This can help facilitate communication between non- or minimally-speaking individuals and the people around them who are unfamiliar with their communication patterns, such as visiting grandparents, new teachers, or new therapists. This will increase the independence and agency of non- or minimally-speaking individuals. We additionally will gain insight into the motor coordination strategies used by non- or minimally-speaking individuals for communication, and compare them to those from neurotypical individuals to get closer to understanding the link between motor coordination and communication. Finally, in clinical practice, consistent tracking of motor coordination-based features could aid in understanding the effect of therapeutics on motor coordination and communication over time, and may lead to the development of additional strategies to assist communication in non- or minimally-speaking individuals.