According to cognitive load theory (Sweller, 2010; Sweller, Van Merriënboer, & Paas, 1998; Van Merriënboer & Sweller, 2005), instruction can impose three types of cognitive load (CL) on a learner’s cognitive system: task complexity and the learner’s prior knowledge determine the intrinsic load (IL), instructional features that are not beneficial for learning contribute to extraneous load (EL), and instructional features that are beneficial for learning contribute to germane load (GL). IL should be optimized in instructional design by selecting learning tasks that match learners’ prior knowledge (Kalyuga, 2009), whereas EL should be minimized to reduce ineffective load (Kalyuga & Hanham, 2011; Paas, Renkl, & Sweller, 2003) and to allow learners to engage in activities imposing GL (Van Merriënboer & Sweller, 2005).

The extent to which instructional features contribute to EL or GL may depend on the individual learner and the extent to which the individual learner experiences IL. For example, less knowledgeable learners may learn better from worked examples (i.e., worked example effect; Cooper & Sweller, 1987; Paas & Van Merriënboer, 1994; Sweller & Cooper, 1985) or from completing a partially solved problem (i.e., a problem completion effect; Paas, 1992; Van Merriënboer, 1990) than from autonomous problem-solving. More knowledgeable learners benefit optimally from autonomous problem-solving (i.e., expertise reversal effect; Kalyuga, Ayres, Chandler, & Sweller, 2003; Kalyuga, Chandler, Tuovinen, & Sweller, 2001). The information presented in worked examples is redundant for more knowledgeable learners who have the cognitive schemata to solve the problem without instructional guidance, and processing redundant information leads to EL (i.e., a redundancy effect; Chandler & Sweller, 1991). Also, when instructions are presented in such a way that learners need to split their attention between two or more mutually referring information sources they are likely to experience higher EL (i.e., split-attention effect; Sweller, Chandler, Tierney, & Cooper, 1990).

When IL is optimal and EL is low, learners can engage in knowledge elaboration processes (Kalyuga, 2009) like self-explanation (Atkinson, Renkl, & Merrill, 2003; Berthold & Renkl, 2009) and argumentation (Fischer, 2002; Knipfer, Mayr, Zahn, Schwan, & Hesse, 2009) that impose GL and facilitate learning.

Being able to properly measure the different types of CL would help educational researchers and instructional designers to better understand why learning outcomes attained with instructional formats may differ between formats or between learners. If IL differs between learners who are given the same instructions, the difference in IL provides us with information on the learners’ level of expertise and—if measured repeatedly—how that changes over time. Meanwhile, when instructions are varied—for example in experimental studies—such measurements can help us gain a better understanding of instructional effects for learners with similar or distinct levels of expertise. Thus far, however, only a few attempts have been made to develop instruments for measuring these different types of cognitive load (Cierniak, Scheiter, & Gerjets, 2009; DeLeeuw & Mayer, 2008; Eysink et al., 2009).

The measurement of CL, IL, EL, and GL

Subjective rating scales like Paas’s (1992) nine-point mental effort rating scale have been used intensively (for reviews, see Paas, Tuovinen, Tabbers, & Van Gerven, 2003; Van Gog & Paas, 2008) and have been identified as reliable and valid estimators of overall CL (Ayres, 2006; Paas, Ayres, & Pachman, 2008; Paas, Tuovinen, et al., 2003; Paas, Van Merriënboer, & Adam, 1994). From the reviews by Paas, Tuovinen, et al. and Van Gog and Paas, it also becomes clear that in many studies task difficulty rather than mental effort is used as an estimator of CL. Next to measures of overall CL attempts have been made to measure the different types of CL separately. Ayres, for instance, presented a rating scale for the measurement of IL, and other researchers have used rating scales for measuring IL, EL, and GL separately (e.g., Eysink et al., 2009). To measure EL, Cierniak et al. (2009) asked learners to rate on a six-point scale how difficult it was to learn with the material, and to measure GL, they adopted Salomon’s (1984) question of how much learners concentrated during learning.

Generally, the fact that different scales, varying in both number of categories and labels, are used is a problem, especially because some of these scales have not been validated. Moreover, whether overall CL or (one of) the types of CL is measured, in most cases one Likert item is used, and the number of categories in the item typically varies (see also Van Gog & Paas, 2008) and can be five (e.g., Camp, Paas, Rikers, & Van Merriënboer, 2001; Salden, Paas, Broers, & Van Merriënboer, 2004), six (e.g., Cierniak et al., 2009), seven (e.g., Ayres, 2006), or nine (e.g., Eysink et al., 2009; Paas, 1992). Although load data are typically assumed to be measured at interval level (i.e., metric), by using less than seven categories one may be measuring at ordinal level of measurement rather than at interval level of measurement. Furthermore, when referring to very specific instructional features to measure EL or GL, there may be a conceptual problem, because the expertise reversal effect shows that a particular instructional feature may be associated with GL (i.e., enhancing learning outcomes) for one learner and with EL (i.e., hindering learning outcomes) for another learner (Kalyuga et al., 2003). An alternative approach to the formulation of questions for EL and GL might solve this problem. Furthermore, the measurement could become more precise when using multiple items for each of the separate types of CL with a scale that is different from the scales used in previous research. It is not entirely clear to what extent workload and cognitive load refer to the same concept across settings, but the NASA-TLX is an example of an instrument that assesses work load on five 7-point scales. Increments of high, medium, and low estimates for each point result in 21 gradations on the scales (Hart & Staveland, 1988; Hilbert & Renkl, 2009; Zumbach & Mohraz, 2008).

A new instrument for the measurement of IL, EL, and GL

In this study, a new instrument for the measurement of IL, EL, and GL in complex knowledge domains was developed. The data for the present article were collected in four lectures and in a randomized experiment in statistics. Statistics is an important subject in many disciplines, jobs, study programs, and every-day situations. In this domain, abstract concepts are hierarchically organized and typically have little or no meaning outside the domain. Not only do learners need to learn formulas and how to apply them correctly, they also need to develop knowledge of key concepts and definitions, and have to learn to understand how statistical concepts are interrelated (Huberty, Dresden, & Bak, 1993). Although the latter requires intensive training, knowledge of key concepts and definitions and proficiency with basic formulas can be developed at an early stage (Leppink, Broers, Imbos, Van der Vleuten, & Berger, 2011, 2012a, b). Therefore, asking learners to rate difficulty or complexity of formulas, concepts, and definitions may be feasible at an early stage, whereas asking them to rate complexity of relationships between various concepts may not, because they may not yet be able to perceive any of these relationships. With this in mind, the items displayed in Appendix 1 were developed.

Items 2 and 9 refer to formulas, whereas Items 1, 3, 7, 8, and 10 refer to concepts, definitions, or just the topics covered. Although Item 8 directly refers to understanding of statistics, of course the term “statistics” can be replaced by the term representing another complex knowledge domain if data are to be collected in, for example, mathematics, programming, physics, economics, or biology.

The ten items had been subjected to an online pilot study at a Belgian university (teaching in Dutch), involving 100 first year bachelor students in psychology, and 67 master students in psychology.

The present set of studies

In a set of four studies, all carried out in the same Dutch university, the performance of the new instrument was examined. In a first study (henceforth, Study I), the instrument was administered in a lecture in statistics for 56 PhD students in psychology and health sciences, and Hypotheses 1–3 were tested using principal component analysis:

  • Hypothesis 1. Items 1, 2, and 3 all deal with complexity of the subject matter itself and are therefore expected to load on the factor of IL;

  • Hypothesis 2. Items 4, 5, and 6 all deal with negative characteristics of instructions and explanations and are therefore expected to load on the factor of EL;

  • Hypothesis 3. Items 7, 8, 9, and 10 all deal with the extent to which instructions and explanations contribute to learning and are therefore expected to load on the factor of GL.

In a second study (henceforth, Study II), we administered a questionnaire comprising these ten items and the aforementioned scales by Paas (1992) for CL, Ayres (2006) for IL, Cierniak et al. (2009) for EL, and Salomon (1984) for GL in a lecture in statistics for 171 second-year bachelor students in psychology, to test the first three and the following four hypotheses (i.e., Hypotheses 1–7) using confirmatory factor analysis:

  • Hypothesis 4. Ayres’s (2006) scale for IL loads on IL but not on EL or GL;

  • Hypothesis 5. Cierniak et al.’s (2009) scale for EL loads on EL but not on IL or GL;

  • Hypothesis 6. Salomon’s (1984) scale for GL loads on GL but not on IL or EL;

  • Hypothesis 7. Paas’s (1992) scale for CL loads on IL, EL, and GL.

Hypotheses 4–7 received no support from the data in Study II. Ayres’s (2006) scale for IL had a lower loading on IL than Items 1, 2, and 3, and it had a significant cross-loading on EL. Cierniak et al.’s (2009) scale for EL and Salomon’s (1984) scale for GL diverged from the other items in the instrument, and Paas’s (1992) scale for CL has relatively weak loadings on all three factors. Therefore, only Hypotheses 1–3 were tested using confirmatory factor analysis in a third study (henceforth, Study III). The data for this analysis were collected in a lecture in statistics for 136 third-year bachelor students in psychology, and in a lecture in statistics for 148 first-year bachelor students in health sciences. As Studies I, II, and III together provided support for Hypotheses 1–3, a three-factor approach for IL, EL, and GL was adopted in a fourth study (henceforth: Study IV).

In Study IV, a randomized experiment was conducted to examine the effects of experimental treatment and prior knowledge on CL, IL, EL, and GL, and learning outcomes. In this experiment, a total of 58 novice learners studied a problem either in a familiar format (textual explanation) and subsequently in an unfamiliar format (formula; n = 29) or in an unfamiliar format (formula) and subsequently in a familiar format (textual explanation; n = 29). Studies by Reisslein, Atkinson, Seeling, and Reisslein (2006) and Van Gog, Kester, and Paas (2011) have demonstrated that example-problem pairs are more effective for novices’ learning than problem–example pairs. Even though both conditions receive the same tasks, the order matters, presumably because studying an example first induces lower EL and higher GL, allowing for schema building. That schema can subsequently be used when solving the problem. When solving a problem first, there is very high EL and little learning. In line with these findings, we expected that learners who studied the problem in a familiar (textual) format first would demonstrate better learning outcomes (because they could use what they had learned from the text to understand the formula) and respond with lower levels of EL and higher levels of GL. Further, we expected learners with more prior knowledge to demonstrate better learning outcomes and respond with lower levels of IL than less knowledgeable learners. Thus, Hypotheses 8–12 were tested in a randomized experiment:

  • Hypothesis 8. Learners who have more prior knowledge experience lower IL than learners who have less prior knowledge;

  • Hypothesis 9. Learners who have more prior knowledge demonstrate better learning outcomes than learners who have less prior knowledge;

  • Hypothesis 10. Studying a problem first in a familiar format and subsequently in an unfamiliar format enhances learning outcomes more than studying the same problem first in an unfamiliar format and subsequently in a familiar format;

  • Hypothesis 11. Studying a problem first in a familiar format and subsequently in an unfamiliar format imposes less EL on a learner than studying the same problem first in an unfamiliar format and subsequently in a familiar format;

  • Hypothesis 12. Studying a problem first in a familiar format and subsequently in an unfamiliar format imposes more GL on a learner than studying the same problem first in an unfamiliar format and subsequently in a familiar format.

In the following discussion, methods and results are discussed for each of the studies separately. Next, findings and limitations are discussed for each of the studies, and implications for future research are discussed.

Study I: Exploratory analysis

Method

A total of 56 PhD students in the social and health sciences, who attended a lecture on multiple linear regression analysis and analysis of variance, completed the questionnaire. To avoid potential confounding from specific item-order effects, the items presented in Appendix 1 were counterbalanced in three orders: order A (n = 19), Items 1, 7, 4, 2, 8, 5, 3, 9, 6, and 10; order B (n = 20), Items 6, 10, 9, 3, 5, 8, 2, 7, 1, and 4; and order C (n = 17), Items 9, 3, 6, 8, 2, 4, 10, 5, 7, and 1. The forms were put in randomized order, so that people sitting next to each other were not necessarily responding to the same item at the same time. Although it was also part of the written instruction on the questionnaire that students received, 2 min of oral instruction was provided at the beginning of the lecture to emphasize that each of the items in the questionnaire referred to the lecture that students were going to attend. All students completed the questionnaire on paper at the very end of the lecture and returned it right away. The lecture lasted 120 min and students had a break of about 15 min somewhere halfway. This procedure was the same in the lectures in Study II and III.

Hypotheses 1–3 were tested using principal component analysis. Principal component analysis is a type of exploratory factor analysis, in that loadings from all items on all components are explored.

Results

Although the sample size of this lecture was rather small for a ten-item instrument, the distributional properties of the data allowed for this type of factor analysis [no outliers or extreme skewness or kurtosis, as well as sufficient interitem correlation; KMO = .692, Bartlett’s χ 2(45) = 228, p < .001]. In case of this type of small sample, principal component analysis is preferred to principal factor analysis because it is less dependent on assumptions (e.g., normally distributed residuals are assumed in the latter).

Oblique (i.e., Oblimin) rotation was performed to take the correlational nature of the components into account (orthogonal rotation assumes that the factors are uncorrelated). If the components underlying the ten items are as hypothesized—IL, EL, and GL—correlation between components is to be expected. For the knowledgeable learner, IL may be low and the instructional features that contribute to EL and GL, respectively may be different from the instructional features that contribute to EL and GL for less knowledgeable learners. Learners who experience extremely high IL and/or high EL may not be able or willing to engage in GL activities. Using oblique rotation in principal component analysis, the correlation between each pair of components is estimated and taken into account in the components solution. Means (and standard deviations, SD), skewness, kurtosis, and component loadings are presented in Table 1. No outliers were detected.

Table 1 Means (and SD), skewness, kurtosis, and component loadings in Study I

Figure 1 shows a component loading plot. The component loadings are in line with Hypotheses 1–3, and no cross-loadings above .40 are present. Although the absence of cross-loadings above .40 is a positive sign, given the limited sample size of n = 56, the component loadings reported in Table 1 only provide a preliminary indication of what the component solution may be. In Table 2, we present the correlations between the three components.

Fig. 1
figure 1

Component loading plot for Study I

Table 2 Component correlations in Study I

Reliability analysis for the three components revealed Cronbach’s alpha values of .81 for Items 1, 2, and 3 (expected to measure IL); .75 for Items 4, 5, and 6 (expected to measure EL); and .82 for Items 7, 8, 9, and 10 (expected to measure GL).

Study II: Confirmatory analysis

Method

Data were collected in a lecture for 171 second-year bachelor students in psychology on one-way and two-way analysis of variance. We justified a different cohort of students for this second study, because both lectures covered topics at a comparable level of difficulty. The students from both cohorts had limited knowledge of the topics covered, and therefore the lectures were of a rather introductory level. Furthermore, if a three-factor structure underlies the items in an instrument, one would expect that three-factor structure to hold across cohorts and potentially across settings.

To test Hypotheses 4–7, we added four items to the ten items presented in Appendix 1 that were introduced previously in this article: Paas’s (1992) scale, which is assumed to be an estimator of CL; a nine-point version of Ayres’s (2006) six-point rating scale for IL; a nine-point version of Cierniak et al.’s (2009) seven-point rating scale for EL; and a nine-point version of the seven-point rating scale for GL used by Cierniak et al., who adopted it from Salomon (1984). These four items, presented in Appendix 2, formed the first four items of the questionnaire.

The item order for the ten new items was the same as order C in Study I. The reason that nine-point scales were used for each of these four items is to ease the standardization and interpretation of outcomes in the confirmatory factor analysis. If these items measure what they have been expected to measure, using a nine-point scale should cause no harm to the measurement. For example, higher EL should still be reflected in higher ratings on the nine-point version of Cierniak et al.’s (2009) seven-point rating scale for EL.

As in the principal component analysis on the data obtained in Study I, in the confirmatory factor analysis on the data in Study II, the correlation between each pair of factors was estimated and taken into account in the factor solution.

Results

In Table 3, we present means (and SD), skewness, and kurtosis, as well as squared multiple correlations (R 2) of each of the items administered in Study II. The R 2 is an indicator of item reliability and should preferably be .25 or higher.

Table 3 Means (and SD), skewness, and kurtosis in Study II

The R 2 values reported in Table 3 and the factor loadings presented in Table 4 indicate that Cierniak et al.’s (2009) scales for EL and GL diverge from the other items in the instrument.

Table 4 Factor loadings for each of the 14 items administered in Study II

In addition, Paas’s (1992) scale for CL has relatively weak loadings on all three factors, maybe due to capturing overall load, whereas all other items in the questionnaire focus on a specific type of load. Although the loading of .61 for Ayres’s (2006) scale for IL could be acceptable from the loading point of view, the modification indices reveal a significant cross-loading on EL, indicating that it may diverge from the other items that are expected to measure IL. In line with this, both its factor loading and its R 2 are lower than the factor loadings and R 2 of the other items that load on IL and have no significant cross-loadings.

In the present study design, we cannot answer the question why these measures diverge, or which of the measures is a better measure of the different types of load, because the instructional tasks used in our study varied extensively from the prior studies. However, given that the ten recently developed items appear to form a three-factor solution from which the other four items diverge from, we continued by testing a model with only the ten recently developed items. The three factors are significantly correlated: the correlation between IL and EL is .41 (p < .001), the correlation between IL and GL is .33 (p < .001), and the correlation between EL and GL is –.19 (p = .025). Two additional residual covariance paths were included to the model—namely, between Item 7 and Item 9 and between Item 9 and Item 10. Item 9 asks students to rate the extent to which the activity contributed to their understanding of formulas, whereas Items 7 and 10 refer more to verbal information. These residual covariance paths were included, because the three lecturers involved in Study II and Study III were different in terms of emphasis on verbal explanation versus formulaic explanation.

Table 5 contains factor loadings of Items 1–10 in Study II and the correlations of the two residual covariance paths. The two residual covariance paths have small coefficients, and one of them was not statistically significant. We find χ 2(30) = 62.36, p < .001, CFI = .965, TLI = .947, RMSEA = .079. The modification indices do not provide any meaningful suggestions for additional paths. Although the CFI and TLI appear to indicate that we have a good fitting model, the RMSEA is on the edge (i.e., above .08 is inadequate, values around .06 are acceptable, and values of .05 and lower are preferred). We decided to test this model on the new data collected in two lectures in Study III.

Table 5 Factor loadings for each of the ten recently developed items administered in Study II

Study III: Cross-validation

Method

The instrument was administered in a lecture for 136 third-year bachelor students in psychology on logistic regression and in a lecture for 148 first-year bachelor students in health sciences on null hypothesis significance testing. In the lecture on logistic regression, the items were asked in the order presented in Appendix 1. In the lecture on null hypothesis significance testing, the items presented in Appendix 1 were presented in three orders: the order in Appendix 1 (n = 50), as well as order D (n = 49)—Items 1, 5, 10, 2, 6, 3, 7, 8, 4, and 9—and order E (n = 49)—Items 5, 9, 1, 3, 10, 4, 6, 8, 2, and 7 (i.e., orders D and E were used because the orders were different from orders A, B, and C used previously). The forms were put in randomized order, so that people sitting next to each other were not necessarily answering the same questions.

We are aware that the cohorts in Study III differ from each other in terms of knowledge of statistics and that both cohorts differ from the cohorts in Study I and Study II. All four lectures in the three studies, however, covered content that had not been taught to these cohorts before and were therefore of a rather introductory level. Furthermore, administering an instrument in different cohorts potentially increases variability of responses and enables the stability of a factor solution. If a factor solution is consistent across datasets, this is an indicator of the stability of the solution.

Results

Table 6 shows the factor loadings of the ten items and the correlations of the two residual covariance paths in the lecture on logistic regression.

Table 6 Factor loadings for each of the ten recently developed items administered in the lecture on logistic regression

The residual covariance that had been statistically significant in Study II was not statistically significant in the lecture on logistic regression, whereas the other residual covariance had a moderate coefficient and was statistically significant.

The three factors were significantly correlated: The correlation between IL and EL was .61 (p < .001), the correlation between IL and GL was –.36 (p < .001), and the correlation between EL and GL was –.56 (p < .001). The analysis yielded χ 2(30) = 35.036, p = .24, CFI = .995, TLI = .992, RMSEA = .035. Table 7 contains the factor loadings of the ten items and the correlations of the two residual covariance paths in the lecture on null hypothesis significance testing.

Table 7 Factor loadings for each of the ten recently developed items administered in the lecture on null hypothesis significance testing

Both residual covariance paths were close to zero and not statistically significant in the lecture on null hypothesis significance testing. Furthermore, only IL and EL were significantly correlated: The correlation between IL and EL was .25 (p = .007), the correlation between IL and GL was .04 (p = .65), and the correlation between EL and GL was –.11 (p = .24). These results yielded χ 2(30) = 30.298, p = .45, CFI = 1.000, TLI = .999, RMSEA = .008. Table 8 shows the R 2 values for each of the ten items in the final model and Cronbach’s alpha values per scale for the lectures in Studies II and III.

Table 8 R 2 values for each of the ten items in the final model, along with Cronbach’s alpha values, per scale in Study II and Study III

The lowest R 2 value was .42 in Study II (Item 6, which appears to be an indicator of EL), which indicates that every item has a sufficient amount of variance in common with other items in the questionnaire.

Study IV: Experiment

Method

A total of 58 university freshmen who were about to enter a course in basic inferential statistics participated in a randomized experiment, in which two groups studied a problem on conditional and joint probabilities in counterbalanced order. Prior knowledge of conditional and joint probabilities was assessed prior to the study, and immediately after the study a posttest on conditional and joint probabilities was administered.

The students had a stake in the experiment, as the content of the experiment would form the content of the first week in their upcoming statistics course. The students were informed that they would participate in a short experiment and that this experiment would be followed by a one-hour lecture in which the content covered in the experiment—conditional and joint probabilities—would be explained. Participation in the experiment lasted 45 min, and the subsequent lecture lasted 60 min.

In the lecture, conditional and joint probabilities as well as frequent misconceptions on these topics were discussed by a statistics teacher. The lecture was interactive; not only did the lecturer explain the concepts of conditional and joint probability, the lecturer also stimulated students in the audience who knew the answer to the problem presented on the screen to explain their reasoning to their peers. After the lecture, students were also debriefed about the setup of the experiment. Finally, lecture slides as well as correct calculations and answers to all the items in the prior knowledge test and posttest were provided to the students, and students were allowed to stay in touch via email with the lecturer to ask questions on the content or on the provided materials.

From an ethical perspective, we wanted to avoid potential disadvantage for individual students due to them having participated in a specific treatment order condition. Through an additional lecture for all participating students together, we expected to compensate for unequal learning outcomes resulting from the experiment. From a motivational perspective, we expected that providing students with feedback on their performance in (as well as after) such a lecture would stimulate students to take the experiment seriously, which could reduce noise in their responses to the various items.

At the very start of the meeting, all students completed the prior knowledge test on conditional and joint probabilities that is presented in Appendix 3.

To reduce guessing behavior, multiple choice items were avoided and open-answer questions were used. Students had to calculate a conditional probability in the first question and a joint probability in the second question. As expected, both questions were of a sufficient difficulty level in that they did not lead to extremely low correct response proportions: the first question yielded fifteen correct responses (about 26 % of the sample) and the second question yielded 31 correct responses (about 53 % of the sample). At the end of the prior knowledge test, students completed the same questionnaire as presented in Appendix 1.

Next, students were assigned randomly to either of two treatment order conditions. In both conditions, students were presented the same problem on conditional and joint probabilities in two modes: in an explanation of six lines text, and in formula notation. In treatment order condition TF, students first studied the text explanation (T) and then the formula explanation (F), and in condition FT, the order was the other way around. The two presentation formats—text and formula—are presented in Appendix 4.

Students reported, as expected, that they were not familiar with the specific notation of conditional probabilities like P(man | psychology). In both treatment conditions, students completed the same questionnaire as they completed after the prior knowledge test and after each study format. The two formats were not presented simultaneously; students received the two formats in counterbalanced order, and which format they received first depended on the treatment order condition.

To assess learning outcomes, a five-item posttest on conditional and joint probabilities was administered. The items were similar to the questions in the prior knowledge test and resembled the problem studied in the two formats, only more difficult to avoid potential ceiling effects for some items. Correct response rate on an item varied from sixteen respondents (about 31 % of the sample) to 32 respondents (about 55 % of the sample). The average number of correctly responded items was 1.97, and Cronbach’s alpha of the five-item scale was .79. Having completed the five-item posttest, students completed the same questionnaire as they completed after the prior knowledge test and after the two study formats. Thus, we had four measurements for all the CL-related items per participating student. Completed questionnaires were checked for missing responses right away, which confirmed that all participants responded to all the items in the questionnaire. Likewise, on the prior knowledge test and posttest, no missing responses were found.

Results

The reliability analysis revealed that Items 1, 2, and 3 form a homogeneous scale, and when we added Ayres’s (2006) item for IL, the Cronbach’s alpha of the scale remained more or less the same. Furthermore, Items 4, 5, and 6 form a scale for which Cronbach’s alpha decreased considerably in three of the four measurements when Cierniak et al.’s (2009) item for EL was added. Similarly, Items 7, 8, 9, and 10 form a homogeneous scale for which Cronbach’s alpha decreased considerably when Salomon’s (1984) item for GL was added. Finally, Paas’s (1992) item for CL appears to be correlated to the items that aim to measure IL only, and adding Paas’s item to the scale with Items 1, 2, 3, and Ayres’s item for IL did not lead to remarkable changes in Cronbach’s alpha. These findings are presented in Table 9 for the four time points (i.e., after prior knowledge test, after text format, after formula format, after posttest), respectively.

Table 9 Cronbach’s alphas of three scales in Study IV

Table 10 shows the means and standard deviations for each of the three scales of Items 1–10 and for the four 9-point scales at each of the four time points, per treatment order condition (i.e., TF and FT).

Table 10 Mean (and SD) for each of the three scales of Items 1–10 and for the four 9-point scales, per treatment order condition in Study IV

The somewhat lower Cronbach’s alpha value for the scale of Items 4, 5, and 6 after the prior knowledge test and after the posttest may be a consequence of restriction of range effects. After both treatment formats, there is more variation in scores on this scale and Cronbach’s alpha values of the scale are within the expected range. As expected, the average score on this scale was highest after the formula format in treatment condition FT, where students were confronted with the formula format before they received the text format.

Linear contrast analysis for the effect of prior knowledge (number of items correct: 0, 1, or 2) on posttest performance (0–5) revealed a linear effect, F(1, 24) = 8.973, p < .01, η 2 = .134, and the deviation was not statistically significant, F(1, 7) = 2.76, p = .10, η 2 = .041. We therefore included prior knowledge as a linear predictor in our subsequent regression analysis for posttest performance. None of the CL-related scores obtained after the prior knowledge test, after the text format, and after the formula format contributed significantly to posttest performance. In Table 11, we present the results of an analysis of covariance (ANCOVA) model for posttest performance using prior knowledge score, treatment order, and the average on the scale of Items 7, 8, 9, 10—the four items that are supposed to measure GL—as predictors after the posttest. Of the other CL-related scales after the posttest, none contributed significantly to posttest performance, which makes sense because only GL activities should contribute to learning and result in better learning outcomes.

Table 11 ANCOVA model for posttest performance, using as covariates prior knowledge score, treatment order, and average score on the scale of Items 7–10 after the posttest in Study IV

In line with Hypothesis 9, a higher prior knowledge score was a statistically significant predictor for higher posttest performance. Furthermore, posttest performance was non-significantly worse in the TF condition, meaning we have no support for Hypothesis 10. Finally, there is limited evidence that higher scores on the scale of Items 7, 8, 9, and 10—intended to measure GL—predict higher posttest performance (η 2 = .064). It is possible that students were still learning to a more or lesser extent while completing the posttest.

For the effects of prior knowledge and experimental treatment on IL, EL, and GL, as measured by the scales of Items 1–10, mixed linear models with Toeplitz as covariance structure provided the best solution for analysis.

Table 12 shows the outcomes of this model for average IL (i.e., Items 1, 2, and 3). In line with Hypothesis 8, the model presented in Table 12 indicates that more prior knowledge predicts lower IL. Furthermore, presenting the formula format before the text format appears to lower IL experienced when studying the text presentation but not when studying the formula presentation.

Table 12 Mixed linear model for IL in Study IV

Table 13 presents the outcomes of the model for average EL (i.e., Items 4, 5, and 6). Confirming Hypothesis 11, the model presented in Table 13 indicates that when the formula format is presented before the text format, EL is elevated significantly for the formula format.

Table 13 Mixed linear model for EL in Study IV

Finally, Table 14 reveals the outcomes of the model for average GL (i.e., Items 7, 8, 9, and 10). The model presented in Table 14 indicates that the text format imposes significantly more GL when presented after the formula format. On the one hand, one may argue that the formula format confronted students with difficulties, leading them to invest more GL activities when the textual explanation was provided. On the other hand, however, no significantly elevated posttest performance was detected.

Table 14 Mixed linear model for GL in Study IV

Discussion

In this section, findings and limitations are discussed for the four studies, and implications for future research are discussed.

Exploratory analysis

Although the sample size was small for a ten-item instrument, the principal component analysis in Study I provided preliminary support for Hypotheses 1, 2, and 3. Also, as one would expect, the components that are expected to be EL and GL are negatively correlated. Furthermore, the components that are expected to measure IL and GL have a correlation around zero. The relationship between IL and GL may not be linear. Extremely low as well as extremely high levels of IL may lead to limited GL activity. On the one hand, if a learning task is too easy for a student, the explanations and instructions in the task may not contribute to actual learning on the part of that student. On the other hand, if a learning task is too complex for a particular student, cognitive capacity for GL activity may be very limited. Finally, the components that are expected to measure IL and EL have a moderately positive correlation.

Confirmatory support for a three-factor model

The fact that the items presented in Appendix 1 have different factor loadings than the previously developed scales for measuring the different types separately is interesting, but also hard to explain on the basis of the present data. Moreover, since no learning outcomes were measured after the lectures, these studies do not provide insight in how the various scales are related to learning outcomes. For this reason, we conducted the randomized experiment in Study IV (1) to examine how different scales vary in two different experimental conditions that we expected to lead to differential effects on IL, EL, and GL, and (2) to examine how the various scales are related to learning outcomes. Together, the results of Study II and Study III provide support for the three-component solution found in Study I.

The high item reliabilities (i.e., R 2 values), high Cronbach’s alpha values, and high fit indices (i.e., CFI and TLI) across lectures in studies I to III, and the low RMSEA in two of the three confirmatory factor analyses support our expectation that a three-factorial structure underlies Items 1–10. It has been suggested that the concept of GL should be redefined as referring to actual working memory resources devoted to dealing with IL rather than EL (Kalyuga, 2011; Sweller, 2010). Kalyuga suggested that “the dual intrinsic/extraneous framework is sufficient and non-redundant and makes boundaries of the theory transparent” (2011, p. 1). Contrary to EL and IL, GL “was added to the cognitive framework based on theoretical considerations rather than on specific empirical results that could not be explained without this concept” (Kalyuga, 2011, p. 1). The present findings suggest, however, that such a two-factor framework may not be sufficient; the three-factor solution is consistent across lectures.

On the use of different cohorts in Studies I, II, and III

We justified the use of different cohorts of students in the four lectures studied. If a factor solution is consistent across these varied datasets, this is an indicator of the stability of the solution. The reason that we chose two lectures instead of one lecture in Study III was to have two independent lectures additional to the lecture Study II to test the hypothesized three-factor model. However, the use of different cohorts and different lecturers may introduce confounds, which may partly explain why the correlation between factor pairs and the residual covariances are somewhat different correlations across lectures.

Cohort-related factors may form one source of confounding. PhD students—and to some extent also advanced bachelor students—are, more than university freshmen, aware of the importance of statistics in their later work.

Teaching style may form a second source of confounding: Whereas some lecturers emphasize conceptual understanding, others emphasize formulas and computations. In a lecture in which the focus is on conceptual understanding rather than on formulas, Item 9 may be a somewhat weaker indicator of GL. If the focus in a lecture is on formulas and conceptual understanding is of minor importance, Item 10 may be a somewhat weaker indicator of GL.

A third potential source of confounding in these studies was the subject matter. Whereas the lectures in Study I and Study II covered similar topics, the lectures in Study III were on different topics, which could have affected the measurement of the different types of load.

Future validation studies should administer this instrument in different lectures of a number of courses given by the same lecturers and for the same cohorts of students, repeatedly, to estimate the magnitude of student-related, teacher-related, and subject-related factors in item response and to examine the stability of the three-factor model across time.

Additional support for the three-factor solution in the experiment

The experiment in Study IV provides evidence for the validity of the three-factor solution underlying Items 1–10. First of all, as expected, higher prior knowledge predicted lower IL throughout the study (all four time points) and higher posttest performance. More knowledgeable learners have more elaborated knowledge structures in their long-term memory and are therefore expected to experience lower IL due to novelty of elements and element interactivity in a task (Kalyuga, 2011; Van Merriënboer & Sweller, 2005).

Secondly, as expected, EL during learning was higher when a problem to be studied was presented first in a format learners were not familiar with (the formula format); however, learners appeared to engage more in GL activities if the problem was subsequently presented in a format they were familiar with (the text format). Also, the known format was reported to impose less IL when presented after the unknown format. Although the students who received the unknown (formula) format first complained that it was difficult and responded to the questionnaire with higher rates of EL after the unknown format, they subsequently responded with lower rates of IL and higher rates of GL after the text format. These findings are difficult to explain, and suggest that order effects may influence the IL that is experienced by a learner. A limitation of this study was that only one posttest was administered after studying both formats, so we cannot determine to what extent each of the formats separately contributed to posttest performance. Future studies should include a test after each format instead of only after both formats. This may also provide more insight into why, in the present experiment, no negative effects of EL on learning performance were found. It is possible that higher EL experienced among students who received the formula format first compensated by increased investment in GL activities in the subsequent study in the text format.

Finally, there is limited evidence that higher scores on GL after the posttest predict higher posttest performance. New experiments, using larger sample sizes, are needed to further investigate this finding.

Question wording effects

More experimentation is also needed to examine across a wide range of learning tasks and contexts the correlations between the items presented in Appendix 2 and the three factors that underlie Items 1–10. Specific wording effects may play a role. For example, Paas’s (1992) item for CL directly asks how much effort learners invest in an activity. This “investment” term is not used in any of the other items included. In addition, the question “how difficult it is to learn with particular material” could refer to EL for some learners and to IL for other learners. New studies should examine qualitatively how exactly learners interpret these items across a range of tasks.

Implications and suggestions for future research

For the present set of studies, the statistics knowledge domain was chosen because this is a complex knowledge domain that is important in many professions and academic curricula, and potentially even in everyday contexts. As with the items developed by Paas (1992), Ayres (2006), Cierniak et al. (2009), and Salomon (1984), however, the intended applicability of Items 1–10 is not restricted to a particular knowledge domain. With minor adjustments (e.g., “statistics” in some items), these items could be used in research in other complex knowledge domains.

Finally, studies combining the subjective measures presented in this article—including the four items developed by Paas (1992), Ayres (2006), Cierniak et al. (2009), and Salomon (1984)—and biological measures such as eyetracking (Holmqvist et al., 2011; Van Gog & Scheiter, 2010) may lead to new insights on convergence between biological and subjective measures and on what these different types of measures are measuring. If both biological and subjective measures measure the same constructs—in this context, IL, EL, and GL, and potentially even overall CL as a function of these three types of CL—one would expect high and positive correlations between these measures across educational settings. If such correlations are found, that may imply for measurement that using either of two types is potentially sufficient in educational studies. If other types of correlations are found, this opens doors for new research on why and under what circumstances the different types of measures diverge.