Reliability and agreement of student ratings of the classroom environment: A reanalysis of TIMSS data

Lüdtke, Oliver; Trautwein, Ulrich; Kunter, Mareike; Baumert, Jürgen

doi:10.1007/s10984-006-9014-8

Reliability and agreement of student ratings of the classroom environment: A reanalysis of TIMSS data

Original Paper
Published: 09 January 2007

Volume 9, pages 215–230, (2006)
Cite this article

Learning Environments Research Aims and scope Submit manuscript

Oliver Lüdtke¹,
Ulrich Trautwein¹,
Mareike Kunter¹ &
…
Jürgen Baumert¹

1685 Accesses
136 Citations
Explore all metrics

Abstract

In educational research, characteristics of the learning environment are generally assessed by asking students to evaluate features of their lessons. The student ratings produced by this simple and efficient research strategy can be analysed from two different perspectives. At the individual level, they represent the individual student’s perception of the learning environment. Scores aggregated to the classroom level reflect perceptions of the shared learning environment, corrected for individual idiosyncrasies. This second approach is often pursued in studies on teaching quality and effectiveness, where student-level ratings are aggregated to the class level to obtain general information about the learning environment. Although this strategy is widely applied in educational research, neither the reliability of aggregated student ratings nor the within-group agreement between the students in a class has been subject to much investigation. The present study introduces and discusses different procedures that have been proposed in the field of organisational psychology to assess the reliability and agreement of students’ ratings of their instruction. The application of the proposed indexes is demonstrated by a reanalysis of student ratings of mathematics instruction obtained in the Third International Mathematics and Science Study (N = 2,064 students in 100 classes).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Moving beyond means: revealing features of the learning environment by investigating the consensus among student ratings

Article 09 August 2016

Understanding (Dis)Agreement in Student Ratings of Teaching and the Quality of the Learning Environment

Student Ratings of Teaching Quality Dimensions: Empirical Findings and Future Directions

Notes

The literature on observer agreement also refers to the ICC(1) as the unadjusted intraclass correlation, because mean differences between raters affect the error variance, meaning that absolute agreement between the raters is required (McGraw & Wong, 1996). In organisational psychology (e.g. Bliese, 2000; Cohen et al., 2001), in contrast, the ICC(1) is seen as a measure of reliability. This is the approach taken in the present article.
In their original article, James et al. (1984) presented the r _WG as a measure of interrater reliability among ratings of a single target. Following critical analysis by Schmidt and Hunter (1989), it became standard practice to see the r _WG as a measure of interrater agreement (James et al., 1993; Kozlowski & Hattrup, 1992). The main criticism voiced by Schmidt and Hunter was that, in psychometric testing theory, the concept of reliability is based on the assumption of variance between true values, but that this variance is only given when more than one stimulus is rated.

References

Anderson, C. S. (1982). The search for school climate: A review of the research. Review of Educational Research, 52, 368–420.
Article Google Scholar
Baumert, J., Lehmann, R. H., Lehrke, M., Schmitz, B., Clausen, M., Hosenfeld, I., et al. (1997). TIMSS: Mathematisch-Naturwissenschaftlicher Unterricht im internationalen Vergleich [TIMSS: Mathematics and science instruction in an international comparison]. Opladen, Germany: Leske and Budrich.
Google Scholar
Beaton, A. E., Mullis, I. V. S., Martin, M. O., Gonzales, E. J., Kelly, D. L., & Smith, T. A. (1996). Mathematics achievement in the middle school years: IEA’s third international mathematics and science study. Chestnut Hill: Boston College.
Google Scholar
Bliese, P. D. (1998). Group size, ICC values, and group-level correlations: A simulation. Organizational Research Methods, 1, 355–373.
Google Scholar
Bliese, P. D. (2000). Within-group agreement, non-independence, and reliability: Implications for data aggregation and analysis. In K. J. Klein, & S. W. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations (pp. 349–381). San Francisco: Jossey-Bass.
Google Scholar
Brown, R. D., & Hauenstein, N. M. A. (2005). Interrater agreement reconsidered: An alternative to the r _WG indices. Organizational Research Methods, 8, 165–184.
Article Google Scholar
Burke, M. J., & Dunlap, W. P. (2002). Estimating interrater agreement with the average deviation index: A user’s guide. Organizational Research Methods, 5, 159–172.
Google Scholar
Burke, M. J., Finkelstein, L. M., & Dusig, M. S. (1999). On average deviation indices for estimating interrater agreement. Organizational Research Methods, 2, 49–68.
Google Scholar
Chan, D. (1998). Functional relations among constructs in the same content domain at different levels of analysis: A typology of compositional models. Journal of Applied Psychology, 83, 234–246.
Article Google Scholar
Clausen, M. (2002). Unterrichtsqualität: Eine Frage der Perspektive? [Quality of instruction: A matter of perspective?] Münster, Germany: Waxmann.
Google Scholar
Cohen, A., Doveh, E., & Eick, U. (2001). Statistical properties of the r _WG(J) index of agreement. Psychological Methods, 6, 297–310.
Article Google Scholar
Dunlap, W. P., Burke, M. J., & Smith-Crowe, K. (2003). Accurate tests of statistical significance for r _WG and average deviation interrater agreement indexes. Journal of Applied Psychology, 88, 356–362.
Article Google Scholar
Finn, R. H. (1970). A note on estimating the reliability of categorical data. Educational and Psychological Measurement, 30, 71–76.
Google Scholar
Fraser, B. (1991). Two decades of classroom environment research. In H. J. Walberg (Eds.), Educational environments: Evaluation, antecedents and consequences (pp. 3–27). Elmsford, NY: Pergamon.
Google Scholar
Grawitch, M. J., & Munz, D. C. (2004). Are your data nonindependent? A practical guide to evaluating nonindependence and within-group agreement. Understanding Statistics, 3, 231–257.
Article Google Scholar
Griffith, J. (2002). Is quality/effectiveness an empirically demonstrable school attribute? Statistical aids for determining appropriate levels of analysis. School Effectiveness and School Improvement, 13, 91–122.
Article Google Scholar
Gruehn, S. (2000). Unterricht und schulisches Lernen: Schüler als Quellen der Unterrichtsbeschreibung [Instruction and learning in school: Students as sources of information]. Münster, Germany: Waxmann.
Google Scholar
James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69, 85–98.
Article Google Scholar
James, L. R., Demaree, R. G., & Wolf, G. (1993). r _WG: An assessment of within-group interrater agreement. Journal of Applied Psychology, 78, 306–309.
Article Google Scholar
Kane, M. T., & Brennan, R. L. (1977). The generalizability of class means. Review of Educational Research, 47, 267–292.
Article Google Scholar
Klein, K. J., Conn, A. B., Smith, B. D., & Sorra, J. S. (2001). Is everyone in agreement? An exploration of within-group agreement in employee perceptions of the work environment. Journal of Applied Psychology, 86, 3–16.
Article Google Scholar
Kozlowski, S. W., & Hattrup, K. (1992). A disagreement about within-group agreement: Disentangling issues of consistency versus consensus. Journal of Applied Psychology, 77, 161–167.
Article Google Scholar
Kunter, M. (2005). Multiple Ziele im Mathematikunterricht [Multiple objectives in mathematics instruction]. Münster, Germany: Waxmann.
Google Scholar
Kunter, M., Baumert, J., & Köller, O. (2005). Effective classroom management and the development of subject-related interest. Manuscript submitted for publication.
LeBreton, J. M., James, L. R., & Lindell, M. K. (2005). Recent issues regarding r _WG, r*_WG, r _WG(J), and r*_WG(J). Organizational Research Methods, 8, 128–138.
Article Google Scholar
Lindell, M. K., & Brandt, C. J. (1997). Measuring interrater agreement for ratings of a single target. Applied Psychological Measurement, 21, 271–278.
Google Scholar
Lindell, M. K., & Brandt, C. J. (1999). Assessing interrater agreement on the job relevance of a test: A comparison of CVI, T, r _WG(J), and r*_WG(J) indexes. Journal of Applied Psychology, 84, 640–647.
Article Google Scholar
Lindell, M. K., & Brandt, C. J. (2000). Climate quality and climate consensus as mediators of the relationship between organizational antecedents and outcomes. Journal of Applied Psychology, 85, 331–348.
Article Google Scholar
Lindell, M. K., Brandt, C. J., & Whitney, D. J. (1999). A revised index of interrater agreement for multi-item ratings of a single target. Applied Psychological Measurement, 23, 127–135.
Article Google Scholar
Lüdtke, O., Köller, O., Marsh, H. W., & Trautwein, U. (2005). Teacher frame of reference and the big-fish-little-pond effect. Contemporary Educational Psychology, 30, 263–285.
Article Google Scholar
Lüdtke, O., Robitzsch, A., & Köller, O. (2002). Statistische Artefakte bei Kontexteffekten in der pädagogisch-psychologischen Forschung [Statistical artifacts in educational studies on context effects]. Zeitschrift für Pädagogische Psychologie, 16, 217–231.
Article Google Scholar
Lüdtke, O., Trautwein, U., Kunter, M., & Baumert, J. (2006). Analyse von Lernumwelten: Ansätze zur Bestimmung der Reliabilität und Übereinstimmung von Schülerwahrnehmungen [Analysis of learning environments: Approaches to determining the reliability and agreement of student ratings]. Zeitschrift für Pädagogische Psychologie, 20, 85–96.
Article Google Scholar
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46.
Article Google Scholar
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models (2nd ed.). Thousand Oaks, CA: Sage.
Google Scholar
Schmidt, F. L., & Hunter, J. E. (1989). Interrater reliability coefficients cannot be computed when only one stimulus is rated. Journal of Applied Psychology, 75, 322–327.
Google Scholar
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428.
Article Google Scholar
Snijders, T. A. B., & Bosker, R. J. (1999). Multilevel analysis: An introduction to basic and advanced multilevel modeling. London: Sage.
Google Scholar
Urdan, T., Midgley, C., & Anderman, E. M. (1998). The role of classroom goal structure in students’ use of self-handicapping strategies. American Educational Research Journal, 35, 101–122.
Article Google Scholar
Wong, A. F., Young, D. J., & Fraser, B. J. (1997). A multilevel analysis of learning environments and student attitudes. Educational Psychology, 17, 449–468.
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Educational Research, Max Planck Institute for Human Development, Lentzeallee 94, 14195, Berlin, Germany
Oliver Lüdtke, Ulrich Trautwein, Mareike Kunter & Jürgen Baumert

Authors

Oliver Lüdtke
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Trautwein
View author publications
You can also search for this author in PubMed Google Scholar
Mareike Kunter
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Baumert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oliver Lüdtke.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lüdtke, O., Trautwein, U., Kunter, M. et al. Reliability and agreement of student ratings of the classroom environment: A reanalysis of TIMSS data. Learning Environ Res 9, 215–230 (2006). https://doi.org/10.1007/s10984-006-9014-8

Download citation

Received: 12 December 2005
Accepted: 10 October 2006
Published: 09 January 2007
Issue Date: October 2006
DOI: https://doi.org/10.1007/s10984-006-9014-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reliability and agreement of student ratings of the classroom environment: A reanalysis of TIMSS data

Abstract

Access this article

Similar content being viewed by others

Moving beyond means: revealing features of the learning environment by investigating the consensus among student ratings

Understanding (Dis)Agreement in Student Ratings of Teaching and the Quality of the Learning Environment

Student Ratings of Teaching Quality Dimensions: Empirical Findings and Future Directions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Reliability and agreement of student ratings of the classroom environment: A reanalysis of TIMSS data

Abstract

Access this article

Similar content being viewed by others

Moving beyond means: revealing features of the learning environment by investigating the consensus among student ratings

Understanding (Dis)Agreement in Student Ratings of Teaching and the Quality of the Learning Environment

Student Ratings of Teaching Quality Dimensions: Empirical Findings and Future Directions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation