Introduction
Hypnosis and hypnotic suggestions have been shown to be useful experimental tools to test theories of cognitive neuroscience (Oakley & Halligan,
2013; Raz,
2011), especially theories related to consciousness (Cardeña,
2014; Terhune, Cleeremans, Raz, & Lynn,
2017). For instance, hypnotic suggestions can evoke changes in the feeling of voluntariness (Weitzenhoffer,
1974,
1980) or modify one`s sense of agency (Haggard, Cartledge, Dafydd, & Oakley,
2004; Lush et al.,
2017; Polito, Barnier, & Woody,
2013). Responses to suggestions frequently involve alterations in perception, such as the experience of positive and negative hallucinations or delusions (Kihlstrom,
1985; Oakley & Halligan,
2009). Moreover, hypnotic suggestions can be employed to simulate some properties of neurological and psychiatric conditions in healthy subjects (Barnier & McConkey,
2003; Oakley,
2006). Finally, correlations between hypnotisability and measures employed by consciousness researchers (e.g. the rubber hand illusion; the vicarious pain questionnaire; mirror touch synaesthesia) have recently been found (Lush et al.,
2019). These correlations suggest that measures common in the consciousness literature are driven by hypnotic suggestibility. There is therefore an increasing need for an expansion of hypnosis research. Unfortunately, the successful application of hypnotic suggestions demands plenty of resources, making it impractical for researchers to run large-scale hypnosis related studies. In order to conduct experiments involving hypnosis, researchers generally need to recruit from a specific subsample of people based on their tendency to respond to hypnotic suggestions. To achieve this, researchers run hypnosis screening sessions before recruitment, so that, for example, they can identify the participants at the lowest and highest end of the scale (low and highly hypnotisable people, respectively). High and low hypnotisability are usually defined as the top and bottom 10–15% of screening scores (Barnier & McConkey,
2004; Anlló, Becchio & Sackur,
2017). Therefore, screening procedures are time-consuming; to identify a single highly suggestible participant for an experiment, one has to find, on average, ten people who are willing to undertake a screening that can last from 40 up to 90 min depending on the applied method.
The hypnosis screening procedure has moved through a long developmental process in which it has become more and more user friendly. Initially, the screening consisted of two steps, a preliminary group session applying the Harvard Group Scale of Hypnotic Susceptibility Form A (HGSHS:A; Shor & Orne,
1963) and an individual session using the Stanford Hypnotic Susceptibility Scale Form C (SHSS:C; Weitzenhoffer & Hilgard,
1962) conducted with only those scoring very high or low in the first session. The later development of a reliable group screening method, the Waterloo-Stanford Group Scale of Hypnotic Susceptibility (WSGC; Bowers,
1993), has drastically mitigated the time required for screening as it allows researcher to screen up to a dozen people in about 90 min (although it was originally intended to act as a second screen after an HGSH:A, a single screen with the WSGC is quite reliable enough to select subjects capable of later having compelling subjective responses to difficult suggestions, e.g. digit–colour synesthesia, Anderson, Seth, Dienes, & Ward,
2014, or compelling objective reductions in Stroop interference to alexia (word blindness) suggestions, e.g. Parris, Dienes, Bate, & Gothard,
2014). Recently, the Sussex Waterloo Scale of Hypnotizability (SWASH; Lush, Moga, McLatchie & Dienes,
2018) was introduced, which is a modified version of the WSGC. The SWASH includes new items to measure the subjective experiences of the participants (compare also the Carleton University Responsiveness to Suggestion Scale [CURSS, Spanos, Radtke, Hodgins, Stam, & Bertrand,
1983], the Creative Imagination Scale [CIS, Wilson & Barber,
1978], and the Experiential Scale for the WSGC [Kirsch, Milling, & Burgess, 1988]). The length of the procedure was reduced to 40 min and it can be run with larger groups than the WSGC (Lush et al.,
2018). Moreover, the dream and age regression suggestions were not included in the SWASH. These highly personalised items of the WSGC can be risky by virtue of possibly triggering unpleasant memories or emotions (Cardeña & Terhune,
2009; Hilgard,
1974).
Nonetheless, the application of the least demanding methods (such as the SWASH, the CURSS or the CIS), still requires potential participants to attend a group session, which makes the screening procedure relatively time-consuming and limits the subject pools to psychology students who are the easiest to incentivise to participate in a group screening on campuses. These two barriers of large-scale hypnosis studies could be overcome by employing fully automatised, online hypnosis screening procedures. In the last two decades, psychological science has witnessed growth in the application of online data collection for experimental purposes, paving the way for researchers to collect large samples in a short period of time (Reips,
2000; though it can come with its own problems, e.g. Dennis, Goodson & Pearson,
2018). In order to adapt the hypnosis screening procedure online, one needs to ensure that the non “live” version can induce similar objective and subjective hypnotic responses as with a “live” hypnotist. Indeed, suggestibility scores of participants are comparable when the hypnotic induction and suggestions are delivered by a pre-recorded audiotape and when they are delivered by an experimenter (Barber & Calverley,
1964; Fassler, Lynn, & Knox,
2008; Lush, Scott, Moga, & Dienes,
2019). These findings underpin the idea that the participants could easily undergo a hypnosis screening procedure in their own rooms by listening to a pre-recorded script and filling out the booklets online. Nevertheless, online data collection has its own perils, namely, the data acquired by online questionnaires might not be as reliable and the results might not be consistent with the ones of the traditional data collection procedures (Krantz & Dalal,
2000). Therefore, the reliability of new online questionnaires, such as the online version of a hypnosis screening procedure, needs to be tested even if there is evidence that the quality of the data and the findings of online-based studies can be similar to those obtained by traditional methods (Gosling, Vazire, Srivastava, & John,
2004; Buhrmester, Kwang, & Gosling,
2011).
In this project, our purpose is to explore the extent to which an online hypnotic screening procedure is reliable and consistent with an offline procedure. To this aim, we measured people`s hypnotic suggestibility with the SWASH on two separate occasions and in two different environments. Henceforth, we call every type of data collection carried out in a controlled environment with the experimenter present an offline screening, whereas undertaking a hypnotic screening alone in one’s own room under one’s own control will be called online screening. In addition, we are interested in the extent to which the length of the delay between first and second screen can influence the reliability and the scores of hypnotic suggestibility. The question about the stability of hypnotic suggestibility over periods of few days or even decades have inspired various research projects (e.g. Fassler, Lynn, & Knox,
2008; Lynn, Weekes, Matyi, & Neufeld,
1988; Piccione, Hilgard, & Zimbardo,
1989). To assess the stability of hypnotic suggestibility, we recruited half of the sample from the subject pool of the year of 2016 and the other half from the year of 2017, both of whom have already received offline screening. Therefore, for some of the participants, the delay between the two screenings is not more than 6 months (short delay group), whereas for the others, it is at least one and a half years (long delay group). For practical reasons, the first screening was organised offline, in groups of 20–40 for all the participants, whereas the second screening was either an online screening or another offline one. By this method, we are able to estimate how strongly the type of the screening and the length of the delay can influence the suggestibility scores of the people; we can also assess their influence on the test–retest reliability and the validity of the screening. Taken together, this project strives to explore whether a well-established offline screening procedure could be replaced for practical purposes by an online version, which could help consciousness researchers run more and larger hypnosis studies by drastically cutting the recruitment-related costs.
While responding to hypnotic suggestions, people tend to experience as of being in some form of trance or altered state (Kihlstrom,
2005; Kirsch,
2011). This experience is usually measured by subjective reports of depth of hypnosis (e.g. Hilgard & Tart,
1966), which is, interestingly, strongly associated with people`s ability to respond to hypnotic suggestions (Wagstaff, Cole, & Brunas-Wagstaff,
2008). We investigate this link by assessing the strength of relationship between hypnotic suggestibility scores and depth of hypnosis reports, and the extent to which the mentioned experimental manipulations can influence this relationship. We also aim to evaluate the extent to which depth of hypnosis is influenced by the type of data collection and the length of the delay between screens to ensure that people experience comparable level of hypnotic depth during online and offline screens.
In our analyses, we solely employed estimation procedures instead of testing the existence of differences with an inferential statistical tool such as the null-hypothesis significance test (Fisher,
1925; Neyman & Pearson,
1933) or the Bayes factor (e.g. Dienes,
2011; Rouder et al.,
2009). Estimation is recommended over inferential statistics when the existence of a difference is established or it is not relevant (Jeffreys
1961; Wagenmakers et al.,
2018). The second point proves to be decisive for our case, since it is not necessary to test the existence of any investigated effect to answer our research questions. For instance, the core aim of the current project was to conclude regarding the applicability of online hypnosis screening by comparing the SWASH scores, the reliability and the validity of online and offline hypnosis screening. Imagine a scenario in which an inferential statistical tool demonstrates evidence for the difference between the offline and online groups in favour of the offline group in all aspects that assess the quality of the measurement. Importantly, this outcome per se cannot give a definite answer to our central question as the mere fact that offline screening is significantly better than online screening neglects the question of magnitude of the difference. To reject or accept the idea that online screening is viable, we need to know the extent to which the quality of offline and online screening differs so that we can decide whether the benefits of the online screen outbalance its costs. Further, the fact that the two types of screening will correlate cannot be in doubt; the question is simply the strength of the relationship between them.
To explore the range of plausible effect sizes, estimation methods, either from the Bayesian (Kruschke,
2010,
2013; Rouder, Lu, Speckman, Sun, & Jiang,
2005; Wagenmakers, Morey, & Lee,
2016) or from the frequentist school (Cumming,
2014) can be used. Here, we applied a Bayesian tool, estimation by calculating the 95% Bayesian Credibility Intervals, as this is the method that is appropriate to answer our research question; namely, how confident can we be that the true effect size lies within a specific interval (Wagenmakers et al.,
2018). Only Credibility Intervals allow us to make claims such as that the true value of the effect size is probably not larger or smaller than a particular value.
Discussion
The purpose of the present study was to explore whether online hypnosis screening is feasible as the adaptation of this method could ease the recruitment-related costs of hypnosis research. To this aim, we estimated the extent to which offline and online hypnosis screening scores, measured by the SWASH, are comparable. The results revealed that the difference between offline and online groups was small to negligible in all aspects and, importantly, applying online rather than offline screening is unlikely to reduce the composite screening score by more than 1.22 and the objective score by more than 1.36 out of ten. To put these effect sizes in perspective, for instance, a recent meta-analysis of four studies investigating the influence of standard induction procedures on suggestibility found that, on average, people score 1.46 higher (out of ten) on scales assessing objective responses to suggestions if they had received a priori induction compared to no induction (Martin & Dienes,
2019). Moreover, the average SWASH score in the online group was comparable to the result of an earlier screen conducted in group sessions at the same university (Lush et al.,
2018). Finally, it is not only the average scores in the online group that can be deemed acceptable, the distribution of SWASH scores were also akin in the offline and online groups even at the positive end of the scale. This implies that some people can successfully respond to many suggestions when they undertake an online screening (see Fig.
1). None of this was obvious before the data were collected.
The correlation between objective and subjective scores was strong for both of the offline and online groups; crucially, the correlation in the online group can only be as small as 0.65. This indicates that the validity of the SWASH remained acceptable even with online data collection. Moreover, the strength of the correlation between the subjective and objective components of the SWASH found by Lush et al. (
2018) was 0.70, which is consistent with our results. The strength of the correlation between SWASH scores of the first and second screens was medium in the offline and strong in the online group. The lower bound of the 95% CI in the online group was 0.57 implying that the test–retest reliability of the online measurement is adequate. These values are also appropriate in relative terms. For instance, Fassler et al. (
2008) employed the CURSS which has an objective and a subjective subscale such as the SWASH, in two occasions and the test–retest correlations were 0.59 and 0.77 for the objective and subjective components, respectively. These results are in line with the correlations found by us in the online group. Overall, the psychometric properties of online screening were excellent; the quality of data collected online has shown to be consistent with the quality of offline data gathered within this study and as part of earlier studies with the SWASH and other hypnosis screening tools.
Modern theories of hypnosis advocate the notion that all hypnosis is self-hypnosis, since the hypnotic subject is the one who actively responds to the suggestions and creates the requested experience (Kihlstrom,
2008; Raz,
2011). This does not mean, however, that the experimenter has no influence on the responsiveness of the subject. For instance, the presence of an experimenter can be helpful in building up a rapport and facilitating responsiveness of the participants (e.g. Gfeller, Lynn, & Pribble,
1987). Nonetheless, the experimenter can also bias the responses of the subjects (e.g. Barber & Calverley,
1966; Troffer & Tart,
1964), and importantly, this level of bias can strongly vary across participants as it is almost impossible to deliver the induction and suggestions in an identical way multiple times. Therefore, the application of fully automatised screenings, such as the online version, can subserve the standardisation of the assessment of hypnotic suggestibility.
Introducing online hypnosis screening would markedly decrease the amount of time experimenters need to invest to find participants for their studies. However, to complete a screening procedure, the participants still need to spend 45–60 min without taking a break; otherwise, the data would be not usable for recruitment purposes. A substantial part of the screening is assigned to the standard hypnotic induction, which consists of various suggestions mostly to relax; however, the responses to these suggestions are not assessed directly during the screening (e.g. Shor & Orne,
1963; Weitzenhoffer & Hilgard,
1962). Would it be feasible to exclude the standard induction from the screening procedure to save time for the participants? Cognitive theories of hypnosis, such as the cold control theory (Barnier, Dienes, & Mitchell,
2008; Dienes & Perner,
2007), emphasise the role of the feeling of involuntariness in differentiating hypnotic from non-hypnotic responses. This feeling is also known as the “classical suggestion effect” (Weitzenhoffer,
1974,
1980). Therefore, according to cold control theory, not the practice of induction, but the feeling of involuntariness is the demarcation criterion, and it is important to ensure with self-report measures that the participants experienced a reduction in the level of control over their own behaviour (e.g. Palfi, Parris, McLatchie, Kekecs, & Dienes,
2018)
1. From a practical perspective, it is important to bear in mind that the presence of a standard induction can increase responsiveness to the suggestions in the screening, on average, by 1.46 (Martin & Dienes,
2019) compared to the absence of the induction; and that the strength of the effect of an induction fluctuates across suggestions (Terhune & Cardeña,
2016). Nonetheless, as argued earlier in this paper, a general reduction of responsiveness does not qualify as decisive argument for retaining the induction procedure. As long as the absence of the induction does not produce a floor-effect or alters markedly the ranking of the suggestibility scores, the screening can be perfectly adequate for screening people for individual differences in response. Indeed, there are existing attempts to assess responsiveness to suggestions without exposing the participants to an induction, such as the Barber Suggestibility Scale (Barber & Glass,
1962) and the CIS (Wilson & Barber,
1978). These scales can be easily administered in a context presented as a test of imagination while applying motivational instructions to replace the induction or simply leaving out the induction. The existing evidence suggests that employing motivational instructions creates similar level of responsiveness as the application of the induction; however, the absence of the induction significantly dwindles the level of responsiveness to suggestions (Barber & Wilson,
1978). Future research could explore the extent to which the exclusion or replacement of the induction from the SWASH would be feasible and assess whether it would be beneficial.
A secondary interest of the current study was to assess the extent to which the length of the delay between the first and second screening affects the outcome of the screen and the psychometric properties of the measurement tool. Repeated assessment of suggestibility can negatively affect the suggestibility scores, for instance, if the delay amid the two occasions takes only a few days or weeks (Barber & Calverley,
1966; Fassler et al.,
2008; Lynn et al.,
1988). This reduction in suggestibility may be caused by boredom; the participants can become disengaged with the procedure by virtue of finding it repetitive (Barber & Calverley,
1966; Fassler et al.,
2008). In our case, the short delay was a minimum of 5 months and we found no indication of substantial differences between the short and long delay groups among the SWASH subscales. For instance, Fassler et al. (
2008) found a difference of 0.77 on the objective scores between the first and second session
2, but according to our data, the largest plausible difference is only 0.34. Nonetheless, the effect of boredom on the subjective scores observed by Fassler et al. (
2008) was 1.05,
3 which is compatible with our results as the lower bound of the difference in that aspect was 1.12. Taken together, our data imply that the negative effect of boredom might wear off or becomes negligible after 5 months; however, more research is needed to settle this matter and identify the ideal amount of delay that can prevent boredom effects in repeated designs.
We note that our sample was restricted to university students, which might preclude the generalisation of our findings, crucially, the applicability of online hypnosis screening, to a wider population. Nonetheless, the problem of generalisability represents a universal issue in experimental hypnosis research. For instance, a meta-analysis on 27 studies investigating hypnotically induced analgesia found that from the studies with non-clinical samples (
N = 19), only one was run with people recruited from the local community whereas all the other studies were run with students (Montgomery, Duhamel, & Redd,
2000). Recruiting from a wider population would not only increase generalisability of the findings, but it would further facilitate researchers to run large-scale hypnosis studies strengthening the replicability of the findings. Future research is needed to explore the extent to which online hypnosis research can be applied to screen and recruit people from local communities.
Finally, the vast majority of our participants were females; hence, the gender imbalance in our sample might be another factor hindering the generalisability of our findings. Research on the link between gender and hypnotic suggestibility has provided ambiguous results with some studies finding virtually no effect (Cooper & London,
1966; Dienes, Brown, Hutton, Kirsch, Mazzoni, & Wright,
2009; McConkey, Barnier, Maccallum, & Bishop,
1996) and some studies demonstrating a small effect size (Green,
2004; Green & Lynn,
2010; Morgan & Hilgard,
1973; Page & Green,
2007; Rudski, Marra, & Graham,
2004). Studies showing a small effect size of gender consistently found that women score higher than men, which might be caused by a divergence in a personality trait that partly underlies suggestibility or difference between women and men in how they assess the difficulty of the suggestions (Rudski, Marra, & Graham,
2004). Nonetheless, these explanations are conjectures that have yet to be tested. With only seven men in the current data set, we can only speculate how much gender might moderate the difference the online compared to the offline measurement of hypnotic suggestibility.