Skip to main content

Advertisement

Log in

Reliability and agreement of student ratings of the classroom environment: A reanalysis of TIMSS data

  • Original Paper
  • Published:
Learning Environments Research Aims and scope Submit manuscript

Abstract

In educational research, characteristics of the learning environment are generally assessed by asking students to evaluate features of their lessons. The student ratings produced by this simple and efficient research strategy can be analysed from two different perspectives. At the individual level, they represent the individual student’s perception of the learning environment. Scores aggregated to the classroom level reflect perceptions of the shared learning environment, corrected for individual idiosyncrasies. This second approach is often pursued in studies on teaching quality and effectiveness, where student-level ratings are aggregated to the class level to obtain general information about the learning environment. Although this strategy is widely applied in educational research, neither the reliability of aggregated student ratings nor the within-group agreement between the students in a class has been subject to much investigation. The present study introduces and discusses different procedures that have been proposed in the field of organisational psychology to assess the reliability and agreement of students’ ratings of their instruction. The application of the proposed indexes is demonstrated by a reanalysis of student ratings of mathematics instruction obtained in the Third International Mathematics and Science Study (N = 2,064 students in 100 classes).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. The literature on observer agreement also refers to the ICC(1) as the unadjusted intraclass correlation, because mean differences between raters affect the error variance, meaning that absolute agreement between the raters is required (McGraw & Wong, 1996). In organisational psychology (e.g. Bliese, 2000; Cohen et al., 2001), in contrast, the ICC(1) is seen as a measure of reliability. This is the approach taken in the present article.

  2. In their original article, James et al. (1984) presented the r WG as a measure of interrater reliability among ratings of a single target. Following critical analysis by Schmidt and Hunter (1989), it became standard practice to see the r WG as a measure of interrater agreement (James et al., 1993; Kozlowski & Hattrup, 1992). The main criticism voiced by Schmidt and Hunter was that, in psychometric testing theory, the concept of reliability is based on the assumption of variance between true values, but that this variance is only given when more than one stimulus is rated.

References

  • Anderson, C. S. (1982). The search for school climate: A review of the research. Review of Educational Research, 52, 368–420.

    Article  Google Scholar 

  • Baumert, J., Lehmann, R. H., Lehrke, M., Schmitz, B., Clausen, M., Hosenfeld, I., et al. (1997). TIMSS: Mathematisch-Naturwissenschaftlicher Unterricht im internationalen Vergleich [TIMSS: Mathematics and science instruction in an international comparison]. Opladen, Germany: Leske and Budrich.

    Google Scholar 

  • Beaton, A. E., Mullis, I. V. S., Martin, M. O., Gonzales, E. J., Kelly, D. L., & Smith, T. A. (1996). Mathematics achievement in the middle school years: IEA’s third international mathematics and science study. Chestnut Hill: Boston College.

    Google Scholar 

  • Bliese, P. D. (1998). Group size, ICC values, and group-level correlations: A simulation. Organizational Research Methods, 1, 355–373.

    Google Scholar 

  • Bliese, P. D. (2000). Within-group agreement, non-independence, and reliability: Implications for data aggregation and analysis. In K. J. Klein, & S. W. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations (pp. 349–381). San Francisco: Jossey-Bass.

    Google Scholar 

  • Brown, R. D., & Hauenstein, N. M. A. (2005). Interrater agreement reconsidered: An alternative to the r WG indices. Organizational Research Methods, 8, 165–184.

    Article  Google Scholar 

  • Burke, M. J., & Dunlap, W. P. (2002). Estimating interrater agreement with the average deviation index: A user’s guide. Organizational Research Methods, 5, 159–172.

    Google Scholar 

  • Burke, M. J., Finkelstein, L. M., & Dusig, M. S. (1999). On average deviation indices for estimating interrater agreement. Organizational Research Methods, 2, 49–68.

    Google Scholar 

  • Chan, D. (1998). Functional relations among constructs in the same content domain at different levels of analysis: A typology of compositional models. Journal of Applied Psychology, 83, 234–246.

    Article  Google Scholar 

  • Clausen, M. (2002). Unterrichtsqualität: Eine Frage der Perspektive? [Quality of instruction: A matter of perspective?] Münster, Germany: Waxmann.

    Google Scholar 

  • Cohen, A., Doveh, E., & Eick, U. (2001). Statistical properties of the r WG(J) index of agreement. Psychological Methods, 6, 297–310.

    Article  Google Scholar 

  • Dunlap, W. P., Burke, M. J., & Smith-Crowe, K. (2003). Accurate tests of statistical significance for r WG and average deviation interrater agreement indexes. Journal of Applied Psychology, 88, 356–362.

    Article  Google Scholar 

  • Finn, R. H. (1970). A note on estimating the reliability of categorical data. Educational and Psychological Measurement, 30, 71–76.

    Google Scholar 

  • Fraser, B. (1991). Two decades of classroom environment research. In H. J. Walberg (Eds.), Educational environments: Evaluation, antecedents and consequences (pp. 3–27). Elmsford, NY: Pergamon.

    Google Scholar 

  • Grawitch, M. J., & Munz, D. C. (2004). Are your data nonindependent? A practical guide to evaluating nonindependence and within-group agreement. Understanding Statistics, 3, 231–257.

    Article  Google Scholar 

  • Griffith, J. (2002). Is quality/effectiveness an empirically demonstrable school attribute? Statistical aids for determining appropriate levels of analysis. School Effectiveness and School Improvement, 13, 91–122.

    Article  Google Scholar 

  • Gruehn, S. (2000). Unterricht und schulisches Lernen: Schüler als Quellen der Unterrichtsbeschreibung [Instruction and learning in school: Students as sources of information]. Münster, Germany: Waxmann.

    Google Scholar 

  • James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69, 85–98.

    Article  Google Scholar 

  • James, L. R., Demaree, R. G., & Wolf, G. (1993). r WG: An assessment of within-group interrater agreement. Journal of Applied Psychology, 78, 306–309.

    Article  Google Scholar 

  • Kane, M. T., & Brennan, R. L. (1977). The generalizability of class means. Review of Educational Research, 47, 267–292.

    Article  Google Scholar 

  • Klein, K. J., Conn, A. B., Smith, B. D., & Sorra, J. S. (2001). Is everyone in agreement? An exploration of within-group agreement in employee perceptions of the work environment. Journal of Applied Psychology, 86, 3–16.

    Article  Google Scholar 

  • Kozlowski, S. W., & Hattrup, K. (1992). A disagreement about within-group agreement: Disentangling issues of consistency versus consensus. Journal of Applied Psychology, 77, 161–167.

    Article  Google Scholar 

  • Kunter, M. (2005). Multiple Ziele im Mathematikunterricht [Multiple objectives in mathematics instruction]. Münster, Germany: Waxmann.

    Google Scholar 

  • Kunter, M., Baumert, J., & Köller, O. (2005). Effective classroom management and the development of subject-related interest. Manuscript submitted for publication.

  • LeBreton, J. M., James, L. R., & Lindell, M. K. (2005). Recent issues regarding r WG, r*WG, r WG(J), and r*WG(J). Organizational Research Methods, 8, 128–138.

    Article  Google Scholar 

  • Lindell, M. K., & Brandt, C. J. (1997). Measuring interrater agreement for ratings of a single target. Applied Psychological Measurement, 21, 271–278.

    Google Scholar 

  • Lindell, M. K., & Brandt, C. J. (1999). Assessing interrater agreement on the job relevance of a test: A comparison of CVI, T, r WG(J), and r*WG(J) indexes. Journal of Applied Psychology, 84, 640–647.

    Article  Google Scholar 

  • Lindell, M. K., & Brandt, C. J. (2000). Climate quality and climate consensus as mediators of the relationship between organizational antecedents and outcomes. Journal of Applied Psychology, 85, 331–348.

    Article  Google Scholar 

  • Lindell, M. K., Brandt, C. J., & Whitney, D. J. (1999). A revised index of interrater agreement for multi-item ratings of a single target. Applied Psychological Measurement, 23, 127–135.

    Article  Google Scholar 

  • Lüdtke, O., Köller, O., Marsh, H. W., & Trautwein, U. (2005). Teacher frame of reference and the big-fish-little-pond effect. Contemporary Educational Psychology, 30, 263–285.

    Article  Google Scholar 

  • Lüdtke, O., Robitzsch, A., & Köller, O. (2002). Statistische Artefakte bei Kontexteffekten in der pädagogisch-psychologischen Forschung [Statistical artifacts in educational studies on context effects]. Zeitschrift für Pädagogische Psychologie, 16, 217–231.

    Article  Google Scholar 

  • Lüdtke, O., Trautwein, U., Kunter, M., & Baumert, J. (2006). Analyse von Lernumwelten: Ansätze zur Bestimmung der Reliabilität und Übereinstimmung von Schülerwahrnehmungen [Analysis of learning environments: Approaches to determining the reliability and agreement of student ratings]. Zeitschrift für Pädagogische Psychologie, 20, 85–96.

    Article  Google Scholar 

  • McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46.

    Article  Google Scholar 

  • Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models (2nd ed.). Thousand Oaks, CA: Sage.

    Google Scholar 

  • Schmidt, F. L., & Hunter, J. E. (1989). Interrater reliability coefficients cannot be computed when only one stimulus is rated. Journal of Applied Psychology, 75, 322–327.

    Google Scholar 

  • Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428.

    Article  Google Scholar 

  • Snijders, T. A. B., & Bosker, R. J. (1999). Multilevel analysis: An introduction to basic and advanced multilevel modeling. London: Sage.

    Google Scholar 

  • Urdan, T., Midgley, C., & Anderman, E. M. (1998). The role of classroom goal structure in students’ use of self-handicapping strategies. American Educational Research Journal, 35, 101–122.

    Article  Google Scholar 

  • Wong, A. F., Young, D. J., & Fraser, B. J. (1997). A multilevel analysis of learning environments and student attitudes. Educational Psychology, 17, 449–468.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oliver Lüdtke.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lüdtke, O., Trautwein, U., Kunter, M. et al. Reliability and agreement of student ratings of the classroom environment: A reanalysis of TIMSS data. Learning Environ Res 9, 215–230 (2006). https://doi.org/10.1007/s10984-006-9014-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10984-006-9014-8

Keywords

Navigation