The measurement properties of the Teaching Strategies GOLD® assessment system

doi:10.1016/j.ecresq.2015.05.004

Early Childhood Research Quarterly

Volume 33, 4th Quarter 2015, Pages 49-63

https://doi.org/10.1016/j.ecresq.2015.05.004 Get rights and content

Highlights

•
Evidence for the reliability and validity of the Teaching Strategies GOLD^® scale scores is reported.
•
Results of confirmatory factor analysis, classical and modern indexes of reliability, inter-rater agreement statistics are presented.
•
Moderate associations with the scores from an external direct assessment measure are presented.

Abstract

The Teaching Strategies GOLD^® assessment system (GOLD^®) is a teacher rating system (authentic performance assessment) child observation tool designed to measure the on-going development and learning progress of children birth through kindergarten across various domains: social–emotional, physical, language, cognitive, literacy, mathematics, and English language acquisition. This article explores evidence for the reliability and validity of the information provided by GOLD^® using two national samples (n₁ = 10,963, n₂ = 20,970). Support for the reliability and validity of scales scores based on teacher ratings is reported, including confirmatory factor analysis, classical and modern indexes of reliability, and inter-rater reliability statistics. In a separate study, concurrent validity was explored using a different sample of 3-and-4-year-olds (n = 1241). Accounting for teacher ratings and clustering effects, moderate associations were found between GOLD^® scale scores and a direct assessment measure. Implications for teachers using the measure in the early childhood classroom and for future research are discussed.

Introduction

Appropriate child assessment plays a vital role in high-quality early care and education programs (NAEYC & NAECS/SDE, 2003). Assessment measures that are well designed, implemented effectively, and interpreted and used appropriately, can inform teaching and contribute to better child outcomes (Snow & Van Hemel, 2008). On the other hand, inadequately designed measures or those that do not consider the rapidly changing demographics in early education programs (Barnett et al., 2010, Copple and Bredekamp, 2009) can result in children being mislabeled or not receiving appropriate support and optimal learning experiences. To ensure that all children regardless of culture, language, or disabilities are assessed fairly, scientifically-informed assessment measures are needed (Hirsh-Pasek et al., 2005, Snow and Van Hemel, 2008, Qi and Marley, 2009). In addition, assessment measures should be validated using samples that are representative of the diversity of children with whom the measure will be used (Bordignon and Lam, 2004, Pena and Halle, 2011, Snow and Van Hemel, 2008). The present study adds to the limited research on one type of child assessment, authentic assessment (Hallam, Grisham-Brown, Gao, & Brookshire, 2007). More specifically, it explores the reliability and validity of the recently developed Teaching Strategies GOLD^® (GOLD^®) (Heroman, Burts, Berke, & Bickart, 2010) and its use with children birth through kindergarten.

National reports and public policy statements support the call for better accountability and the need to decrease disparities among subgroups of children. Simultaneously they issue cautions and identify standards related to the assessment process. NAEYC and NAECS/SDE indicate that assessment measures should be developmentally appropriate, educationally important, and linguistically and culturally responsive (Copple and Bredekamp, 2009, NAEYC and NAECS/SDE, 2003, NAEYC and NAECS/SDE, 2005). Assessment evidence should be gathered over time, from multiple sources including families, in naturally occurring settings so they accurately reflect and support children's development and learning. Young children's development and learning is uneven and changes rapidly. Although distinct developmental and learning domains are clearly identified in the literature, these areas are overlapping and interrelated (Scott-Little et al., 2006, Shonkoff and Phillips, 2000, Snow and Van Hemel, 2008). Each area influences and is influenced by other areas (Berk, 2009, Copple and Bredekamp, 2009). During the later preschool and kindergarten years, the more integrated and global abilities of infants and toddlers (Moreno & Klute, 2011) are replaced with those that are more differentiated (Kim & Smith, 2010). Various factors affect children's development and learning such as individual differences, culture, and the environment (Copple and Bredekamp, 2009, Hindman et al., 2010, Hyson, 2008, Shonkoff and Phillips, 2000). Because the child's immediate environment exerts tremendous influence on development (Bronfenbrenner & Morris, 2006), it can be limiting to assess young children without considering important contextual settings. Close communication with families provides teachers with useful information about the child and helps them bridge the developmental contexts (Bronfenbrenner, 1986, Dewey, 1897, Mooney, 2000).

In addition to the accepted indices of adequate psychometrics (Snow & Van Hemel, 2008), assessment instruments should also exhibit “empirical validity” (Hirsh-Pasek et al., 2005) so that what they measure is based on current empirical findings in the developmental domains with a focus on predictors of success. Instruments should emphasize the processes (how) of learning rather than just the products (what) of learning. Further, Messick (1995) has argued that the body of research supporting the use of an assessment should include evidence of associations between the information provided by the measure and the information provided by external measures designed to measure similar constructs. Assessment instruments should also be minimally intrusive for teachers and children (McDermott et al., 2009).

Many authorities, including those in early childhood intervention, support a type of informal assessment sometimes referred to as “authentic assessment” (Ackerman and Coley, 2012, Bagnato, 2005, Bagnato et al., 2011, Keilty et al., 2009, Macy and Bagnato, 2010). Although there is no universal definition of the term (Frey, Schmitt, & Allen, 2012), in early childhood, authentic assessment is commonly used to mean a type of performance-based assessment of a child's real-life practical and intellectual challenges that occur within typical daily contexts (Mcafee & Leong, 2011). Teachers collect information on each child as they document what is said and done, select suitable examples and artifacts that illustrate particular abilities and knowledge, and incorporate relevant information from families and others who work with the child (Mcafee and Leong, 2011, Meisels et al., 2010). Familial insights are especially useful in assessing children representing minority groups and children with disabilities. Teachers summarize and interpret the assessment information and use it for instructional planning, individualizing instruction, and communicating child progress with families and other stakeholders.

Because authentic assessment is curriculum embedded (i.e., integrated within typical, everyday activities), it is less intrusive for teachers and children (Mcafee & Leong, 2011). Further, it can provide more complete information about a child's strengths and developmental needs than can other types of measures (Cabell, Justice, Zucker, & Kilday, 2009). Capturing a child's emerging abilities over time and their performance as they engage in the active process of learning, provides insights that may not be obtained in one assessment setting, as is typically the case with direct assessment measures.

As with any type of teacher report, authentic assessment measures have limitations which must be acknowledged (Ackerman and Coley, 2012, Südkamp et al., 2012, Waterman et al., 2012). Teacher-based observational assessments, including authentic assessments, are more subjective than standardized measures which adhere to specific procedures (Cabell et al., 2009), and they rely on teachers’ abilities to collect relevant information, accurately observe, and effectively analyze and evaluate a range of evidences that represent each child's developmental status and performance.

Some authorities question if teachers can objectively and reliably assess children (Kilday et al., 2012, Mashburn and Henry, 2004, Phillips and Lonigan, 2010, Waterman et al., 2012), particularly when informal assessment measures rather than standardized instruments are used (Lonigan, Allan, & Lerner, 2011). Teachers’ evaluations sometimes do not align with those of parents or with external evaluators (Dinnebeil et al., 2013, Sims and Lonigan, 2012), and informant discrepancies may reflect teacher variability or bias rather than characteristics of the child (Ackerman and Coley, 2012, Konold and Pianta, 2007). Teachers’ assessments may be influenced by their preconceived ideas about children (Burchinal et al., 2011, Mashburn and Henry, 2004), value differences between themselves and families (Hauser-Cram et al., 2003, Sirin et al., 2009), training or education level (Mashburn and Henry, 2004, Meisels et al., 2010), and instrument training/experience (Ackerman and Coley, 2012, Meisels et al., 2010). Classroom factors including percentage of children with special needs (Gallagher & Lambert, 2006), percentage of infants and toddlers (Meisels et al., 2010), and classroom/school socioeconomic status context (Phillips and Lonigan, 2010, Ready and Wright, 2011) can also affect teachers’ assessments.

Despite the acknowledged limitations, authentic assessment instruments have become increasingly popular (Ackerman and Coley, 2012, Bagnato et al., 2014, Frey et al., 2012), and results from several studies suggest that teachers can accurately assess children using such measures (Meisels et al., 2001, Moreno and Klute, 2011). Relevance to curriculum, involvement with families, and the ability to consider contextual information and functional competencies as teachers regularly assess children during daily activities make such measures particularly appealing (Copple and Bredekamp, 2009, Gullo, 2006). Furthermore, authentic assessment tools utilize the principles of universal design for learning, especially important for children with disabilities and dual language learners because such assessment allows them to demonstrate what they know and can do using multiple means of representation, engagement, and expression (Buysse and Hollingsworth, 2009, Horn and Banerjee, 2009). Careful observation as children play and engage in different activities can help teachers better understand the child's thought processes and provide timely and appropriate instruction and support (Piaget, 1972, Vygotsky, 1978). This type of information is needed by teachers to guide their daily instruction and to plan future experiences to advance each child's development and learning.

Several authentic assessment instruments are currently used in early childhood classrooms, with a few being widely used for a number of years. In each of these measures teachers familiar with the child gather information on multiple occasions and document children's demonstration of their knowledge, skills, and behaviors across different domains (Hyson, 2008). Assessment measures differ in several respects including the domains measured and how they are organized, how teachers rate and evaluate each child, the age ranges for which they are intended, their psychometric properties, and the samples used in their validation studies. Overall, low-moderate to high psychometric properties are reported by instrument developers and/or other researchers. However, validation studies of the popular measures were generally conducted using relatively small and/or limited samples from Early Head Start, Head Start, or urban school districts located in restricted geographic areas. Described as follows are several of the most popular and/or new authentic observational measures for use in early childhood settings (see individual studies for detailed information on each instrument).

Meisels and colleagues developed several authentic, teacher-observation measures for use with children of varying ages. The Work Sampling System^® (WSS) is a curriculum-embedded assessment instrument intended for use with children age 3 to grade 5 (Meisels, Liaw, Dorfman, & Nelson, 1995). Teachers evaluate children's performance in the areas of personal and social development, language and literacy, mathematical thinking, scientific thinking, social studies, the arts, and physical development (Meisels et al., 2001). Checklists (performance indicators rated as Not Yet, In Process, or Proficient), Portfolios (related to core and individual items), and Summary Reports (three times per year) are used to evaluate children's performance (Meisels, 1996). Moderate to high internal consistency reliability (α = .84–.95) and moderate inter-rater reliability (.68–.73) with a kindergarten sample (n = 100) were reported (Meisels et al., 1995). In a study of the language/literacy and mathematics portions of the WSS with K-3 children in Title I classrooms (n = 345), 92% of the correlations between a standardized measure and the WSS were between .50 and .75 (Meisels et al., 2001). The measure accurately discriminated first through third grade children who were at-risk and those not at-risk for language and literacy (84%) and mathematics (80%).

Work Sampling for Head Start (WSHS) is a modification of the WSS for use with 3- and 4-year-old children in Head Start programs (Meisels, Xue, & Shamblott, 2008). While it incorporates observational and checklist components, it does not use the portfolio component of WSS. Children are rated on performance indicators as Not Yet, In Process, or Proficient for each of the 55 items on the measure. High internal consistency reliability (α = .90–.94) of the language, literacy, and mathematics subscales was noted using a sample of 112 older threes and four-year-old children attending Head Start and community-based programs (Meisels et al., 2008). Correlations between direct, standardized, norm-referenced tests ranged from .30 to .44. Predictive validity was reported for children at risk for learning problems in literacy and mathematics with the WSHS accounting for approximately 20% of the variance after controlling for demographic variables (Meisels et al., 2008).

The COR is a 30-item observational measure intended to assess the cognitive, social, and motor development of children 2.5 to 6 years of age (High/Scope Educational Research Foundation, 1992). In a study of reliability and validity, teacher teams in one state collected ongoing anecdotal observations of approximately 2500 Head Start children for at least a month. They used the observations as the basis for evaluating child performance on a 5-level competency sequence of low to high. COR demonstrated acceptable internal consistency (α = .66–.93) for its six-factor structure (i.e., initiative, social relations, creative representation, music and movement, language and literacy, and logic and mathematics). Inter-rater reliability between teachers and assistants varied from .57–.76. Concurrent validity with a subsample of 94 children with a standardized measure ranged from .27 to .66 with most correlations falling in .30 to .40 range (Schweinhart, McNair, Barnes, & Larner, 1993). In a study conducted by Fantuzzo and colleagues including two samples of urban preschool Head Start (n = 733) and a mixed program sample (n = 1427), high internal consistency reliability (α = .86–.95) was reported (Fantuzzo, Hightower, Grim, & Montes, 2002) for a three-factor structure (i.e., Cognitive Skills, Social Engagement, and Coordinated Movement) of the measure. Convergent validity for the Cognitive Skills dimension and convergent and discriminant validity for the Social-Engagement dimension were also noted by the researchers. More recently, convergent and divergent validity of the COR was explored with a different sample of urban Head Start children (n = 242) (Sekino & Fantuzzo, 2005).

Designed for use with younger children, the Infant-Toddler COR (High/Scope Educational Research Foundation, 2002) is a 28- item measure assessing children six weeks to age three in six broad areas (i.e., sense of self, social relations, creative representation, movement, communication and language, exploration and early logic). Adults make anecdotal observations related to measurement items and then organize and analyze them to create a portrait of each child. In a small study of 50 infants, high internal consistency reliability was reported (α = .99 for the total 28 items; α = .92 or .93 for the items in the six areas). Inter-rater reliability between nine pairs of caregivers was .93 for the total scale and ranged from .83 to .91 for the six areas. High concurrent validity of the total Infant-Toddler COR with a standardized measure was also reported (i.e., .91 and .87); however, the correlations were reduced when the effects of age were removed (High/Scope Educational Research Foundation, 2002).

In response to the call for additional options for appropriate, valid, and reliable tools for assessing young children, new authentic assessment measures have been developed. These new instruments fill a critical gap in the field (Snow & Van Hemel, 2008) as increasing numbers of young children are enrolled in out-of-home programs (Barnett et al., 2010). Many of these children are infants and/or represent diverse populations such as dual language learners, children with disabilities, or children living in poverty.

The scale was developed by Meisels and colleagues for use with children from birth to 42 months (Meisels et al., 2010). It is used by caregivers/teachers to measure child progress in social–emotional, language, cognitive, and physical development. The Ounce™ includes three components. Observational Records are used to guide teacher/caregiver observations and documentations. Family Albums provide a way for families to learn more about development and to document their child's development. Developmental Profiles and Standards are used by teachers/staff to summarize information from the observational records. The profiles contain eight non-overlapping age ranges with 12–16 different items per age band. Using the appropriate chronological age range, teachers evaluate child performance using a two-point scale (i.e., “Developing as Expected” or “Needs Improvement”). A sample of 287 children and 124 teachers in Early Head Start programs in a large metropolitan area were included in a study of the measure's reliability and validity. Internal consistency as indicated by Cronbach's alpha ranged from .19 to .89 with most age groups showing reliabilities of more than .62. Correlations with standardized measures were higher for older children (i.e., 30–42 months) than for younger children. Receiver operating characteristic curve analyses supported teachers’ abilities to accurately identify children who were at risk with more than 70% of children identified correctly (Meisels et al., 2010).

CAR is designed to be used to assess children birth to three (Moreno & Klute, 2011). Unlike the previously mentioned measures, it takes a learning-topics approach to assessment (i.e., play, interacting in tune, expressive communication, receptive communication, emerging literacy) rather than a developmental domain approach. Three or four times a year each child is scored as “Got It” or “Open Opportunity” on 144 items on the checklist. In a study of 136 children enrolled in an Early Head Start program, internal consistency reliability of the measure ranged from α = .66 to .95 in the fall and from α = .89 to .97 in the spring. In general, the measure functioned as expected relative to item difficulty (i.e., only 5 of 144 items were out of order) and chronological age (fall r [135] = .68, p < .001); spring r [122] = .70, p < .001). Agreement between criterion measures and LTR-CAR for prediction of children at-risk was generally high (range, 77.6%–.89.1%; average = 84.6%).

GOLD^® is designed to measure a child's progress in the major developmental and content areas for children ages birth through kindergarten (Heroman et al., 2010). It is intended for use with typically developing children, children with disabilities, children who demonstrate competencies beyond typical developmental expectations, and dual language learners. The assessment tool can be used in early education programs that incorporate the Teaching Strategies curricula as well as in programs which do not use the curricula (Teaching Strategies LLC, n.d.).

The 38 GOLD^® objectives and accompanying rating scale items help teachers focus the assessment process as they regularly gather child information through observations, conversations with children and families, samples of children's work, photos, video clips, recordings, etc. The assessment information is to be used in planning appropriate experiences, individualizing instruction, and monitoring and communicating child progress to families and other stakeholders. Data may also be used to help teachers ascertain when additional information or more specific evaluation is needed. As such, GOLD^® is a formative assessment measure; it is not a test, nor is it intended to be used as a diagnostic, clinical, or high-stakes instrument.

Although GOLD^® is similar in some ways to other authentic measures, the tool adds unique contributions to the validated authentic assessment measures currently used. For example, the ability to use a single instrument to assess children from birth to 71 months, rather than having several different measures (High/Scope Educational Research Foundation, 2002, Meisels et al., 1995, Meisels et al., 2008, Meisels et al., 2010, Schweinhart et al., 1993) has benefits. One instrument which assesses the same broad objectives throughout the early childhood period, with developmentally appropriate progressions, can be especially beneficial for tracking development and learning longitudinally. It can also assist with program continuity (Snow & Van Hemel, 2008) when children move from one classroom to the next because teachers are already familiar with the assessment system and objectives.

The broader range of item-level rating scale points and behavioral anchors in GOLD^® (i.e. 10 levels) than those in COR (5 levels), Work Sampling (3 levels), and CAR (2 categories) helps to decrease the likelihood of floor and ceiling effects and can provide useful instructional information for teachers. Indicator and “in between levels” in GOLD^® allow for additional rating scale points and steps in the progressions. The levels in between demonstrate that a child's knowledge, skills, and behaviors are emerging but are not fully established. They help teachers know when to provide support or scaffold child efforts (Early et al., 2010; Vygotsky, 1978). They are also especially helpful for documenting increments of progress for younger children, dual language learners, and children with disabilities. If used as intended, they can assist the teacher in providing supportive experiences for all children, for planning individualized instruction or specialized small group activities, and knowing when additional information or more specific evaluation is needed (Lopez, Salas, & Flores, 2005).

Development of GOLD^® occurred over several years. Its publishers originally proposed to revise the three developmental continua (Teaching Strategies LLC, 2001, Teaching Strategies LLC, 2005, Teaching Strategies LLC, 2006) which were being widely used in early childhood programs (Hyson, 2008). Upon further review of the existing measures and new research, the decision was made to develop a completely new assessment instrument.

Feedback from teachers, administrators, consultants, and Teaching Strategies, LLC professional-development and research personnel was used in the development process. Pilot studies with diverse populations were conducted, and a draft of the measure was sent to leading authorities in their respective fields for content review. Revisions were made based on results of the content validation and pilot studies. Final assessment items were selected on the basis of feedback received during the development process; state early learning standards and the Head Start Child Development and Early Learning Framework (U.S. Department of Health & Human Services, Administration of Children & Families, Office of Head Start, 2010); and current research and professional literature including literature that identifies which knowledge, skills, and behaviors are most predictive of school success. This process resulted in a measure having a total of 38 objectives with 23 of them in the areas of social–emotional, physical, language, cognitive, literacy, and mathematics. Although GOLD^® includes objectives in other areas (i.e., science and technology, social studies, the arts, and English language acquisition), they are not included in the analyses reported in this paper.

Objectives in the social–emotional domain involve understanding, regulating, and expressing emotions; building relationships with others; and interacting appropriately in social situations. Social–emotional competence is critical to children's later academic, social, and psychological outcomes (McCabe & Altamura, 2011). When children's interactions and relationships are positive, they are more likely to have positive short- and long-term outcomes (Commodari, 2013, Peisner-Feinberg et al., 2001, Rubin et al., 1998, Smith and Hart, 2002). Self-regulation is a particularly important construct in the social–emotional domain and is related to academic achievement (McClelland and Cameron, 2011, Ponitz et al., 2009, Suchodoletz et al., 2009). Both self-regulation and social competence predict children's later reading and math skills (McClelland, Acock, & Morrison, 2006).

The physical domain objectives include gross-motor development (traveling, balancing, and gross-motor manipulative skills) and fine-motor strength and coordination. Physical development affects children's emotional development and their school performance (Rule and Stewart, 2002, Son and Meisels, 2006). It can also affect their social and language development as they interact with peers (Kim, 2005).

The language objectives include understanding and using language to communicate or express thoughts and needs. Language comprehension influences other areas of development such as the closeness of teacher–child relationships (Justice, Cottone, Mashburn, & Rimm-Kaufman, 2008). Language has been found to predict reading skills several years later (Snow, Burns, & Griffin, 1998), and aspects of oral language predict reading comprehension (Roth, Speece, & Cooper, 2002). Children without early experiences that support language development show substantial differences in language understanding and use by age three (Strickland & Shanahan, 2004). When dual language learners acquire English proficiency by the end of kindergarten, they have better cognitive and behavioral outcomes throughout the early school years and beyond than children who become English-proficient after kindergarten (Halle, Hair, Wandner, McNamara, & Chien, 2012).

Objectives in the cognitive domain include approaches to learning (e.g., attention, curiosity, initiative, flexibility, problem solving); memory; classification skills; and the use of symbols to represent objects, events, or persons not present. Symbolic thinking is necessary for language development, problem solving, reading, writing, and mathematical thinking (Deloache, 2004, Younger and Johnson, 2004), and children's symbolic substitution during sociodramatic play is related to their later reading and math skills (Hanline, Milton, & Phelps, 2008). Children's ability to classify is important for learning and remembering (Larkina, Guler, Kleinknecht, & Bauer, 2008), and the more knowledgeable they are about a topic, the more likely they are to categorize at a more mature level (Bjorklund, 2005, Gelman, 1998). The way children approach learning has received increased attention in recent years (Hyson, 2008). Children who have positive approaches to learning are more likely to succeed academically (Howse, Lange, Farran, & Boyles, 2003) and to have more positive interactions with peers (Fantuzzo et al., 2004, Hyson, 2008) than children who do not exhibit these characteristics.

The literacy objectives incorporate phonological awareness; alphabet, print, and book knowledge; comprehension; and emergent writing skills. Letter/name writing predicts later literacy, and phonological sensitivity; alphabet knowledge and knowledge of print concepts predict later reading, writing, and spelling success (National Early Literacy Panel [NELP], 2008). Preschool children's development in oral language, phonological awareness, and print knowledge is predictive of later reading abilities (Lonigan et al., 2011). Children who begin school with less phonological sensitivity, familiarity with the basic purposes and mechanisms of reading, and letter knowledge are especially likely to have difficulty learning to read in the primary grades (NELP, 2008, Snow et al., 1998). Letter knowledge and global phonological sensitivity have been found to be predictors of early reading abilities (Lonigan, Burgess, & Anthony, 2000), while vocabulary and print knowledge are predictive of later numeracy (Purpura, Hume, Sims, & Lonigan, 2011).

The mathematics objectives focus on number concepts and operations, spatial relationships and shapes, measurement and comparison, and pattern knowledge. Children enter school with “everyday” mathematics abilities (Ginsburg, Lee, & Boyd, 2008), and their mathematical skills upon entry to kindergarten are predictive of later reading and math achievement (Duncan et al., 2007). Children's spatial sense is important to other aspects of mathematics, and children with a strong spatial sense tend to do better in mathematics than children without a strong spatial sense (Clements, 2004). Their understandings about counting, number symbols, and number operations are fundamental to their success with more complex mathematics (Ginsburg and Baroody, 2003, Zur and Gelman, 2004).

Several researchers have examined the psychometric properties of GOLD^® for its use with children representing different ethnic, racial, language, functional status, and age groups. These initial studies, summarized as follows, suggest that GOLD^® is a psychometrically promising instrument which has utility for children representing diverse populations. Using a small sample of infants through children two years of age (n = 290), high internal consistency reliability of GOLD^® (α = .95–.99) was found (Kim & Smith, 2010). Rasch reliability statistics were also high (person separation = 9.42, item separation = 19.20, person reliability = .99, item reliability = .99). Another study looked at the validity of GOLD^® for assessing children with disabilities and those for whom English is not their first language. Assessment information was collected on 3-, 4-, and 5-year-old children at the fall (n = 79,324), winter (n = 132,693), and spring (n = 50,558) checkpoints. Differential Item Functioning (DIF) analysis indicated that in general, teachers’ ratings were similar for children of similar abilities, regardless of their subgroup membership (Kim, Lambert, & Burts, 2013).

Associations of teacher ratings with child demographics (e.g., age, gender, disability status, and English language status) and classroom composition characteristics (e.g., class mean age and percentage ELLs, children with disabilities, and males) were examined with a sample of 21,592 children ages 12 months through 59 months. Using three-level growth curve modeling, findings indicated that teachers’ GOLD^® ratings were associated in anticipated directions for both child and classroom characteristics (Lambert, Kim, & Burts, 2014).

The dimensionality, rating scale effectiveness, hierarchy of item difficulties, and the relationship of GOLD^® developmental scale scores to child age have also been examined. Data from a sample (n = 10,963) of children ages birth to 71 months were analyzed using the Rasch Rating Scale Model. Support was found for the unidimensionality of each domain (i.e., items in each scale measure one and only one underlying latent construct). Results further indicated that teachers can make valid ratings of the developmental progress of children across the measured age range. Correlations were moderately high between each of the scale scores and child age in months, with correlation coefficients ranging from .67 to .73 (Kim et al., 2013).

Using a different sample of preschool children, researchers examined the relationships between GOLD^® scale scores and teacher ratings of children's social functioning and learning behaviors and child performance on individually administered direct assessments of academic skills (Lambert, Kim, & Burts, 2013). The sample (n = 299) was diverse and included children attending 51 different Head Start, public pre-k, and private school classrooms across 16 centers in the Northeast United States. In general, the correlations of the external measures with the GOLD^® domains were moderate and in expected, aligned areas. For example, scores from the Social–Emotional scale correlated moderately with measures of similar constructs (r = .42–.52). Similar results were found for the Language (r = .20–.48), Literacy (r = .18–.45), and Mathematics scales (r = .29–.52).

Concurrent validity was also examined by researchers in Washington state (Soderberg et al., 2013). Using a modified version of GOLD^® (i.e., WaKIDS) with kindergarten children (n = 333), moderate correlations (r = .50–.64) with a battery of established norm-referenced achievement instruments were found for the Language, Literacy, and Mathematics areas.

The purpose of this article is to describe three studies presenting additional evidence for the reliability and validity of the scale scores that can be produced using teacher ratings elicited by the GOLD^® assessment system. Although previous studies provide some support for the use of GOLD^® with various groups, they have not addressed several issues of psychometric adequacy as recommended by the National Research Council (Snow and Van Hemel, 2008). Specifically, this study addresses the following research questions: (a) Is there evidence that the information the measure provides can be used to measure the intended constructs?; (b) Is there evidence that the factorial structure is invariant across the three measurements during the typical academic year?; (c) Using both classical and modern measures of score reliability, is there evidence that the scale scores measure the intended constructs reliably?; and (d) Is there evidence of inter-rater reliability when the ratings of teachers and master raters are compared? The concurrent validity study addressed the following question: While accounting for teacher rating and clustering effects, is there evidence of associations between GOLD^® scale scores and child performance on a direct assessment of academic skills?

Section snippets

Main study

A total population of 111,059 children was rated using the GOLD^® assessment system for the fall 2010 checkpoint. These children received educational services in 735 different programs at 3792 different centers located in all regions of the United States. These programs and centers included Head Start, private child care, and school-based sites. All 50 states and the District of Columbia were represented. Most of the participating centers, although not all, used the curricula developed by

Confirmatory factor analysis of cross-sectional data

To address the first research question, the factorial structure of the GOLD^® was examined using confirmatory factor analysis (CFA) in Mplus (Muthén & Muthén, 1998–2010). The first sample data was used for CFA. Given its basis in developmental theory, a six-factor model at the item level that corresponds to the designed structure of the instrument was examined. The chi-square test can be used to evaluate model fit. However, given the sensitivity of this test to sample sizes, alternative

Discussion

Previous studies suggest that GOLD^® (Heroman et al., 2010) yields valid and reliable inferences for its intended populations (Kim et al., 2013, Kim et al., 2014). However, several important questions have heretofore been unanswered. The studies reported in this paper were conducted to expand initial research addressing the reliability and validity of the scale scores produced by teacher ratings elicited using GOLD^®.

Although it is a relatively new assessment instrument, GOLD^® is widely used in

Author note

This article is based on some of the same datasets and analyses contained in a previously released technical report entitled Technical Manual for the Teaching Strategies GOLD™ assessment system. Partial funding for this research was provided by Teaching Strategies, LLC. Opinions are those of the authors and do not necessarily reflect those of the funding agency.

References (120)

D.R. Becker et al.
Behavioral self-regulation and executive function both predict visuomotor skills and early academic achievement
Early Childhood Research Quarterly
(2014)
M. Bridges et al.
Bien Educado: Measuring the social behaviors of Mexican American children
Early Childhood Research Quarterly
(2012)
E. Commodari
Preschool teacher attachment, school readiness and risk of learning difficulties
Early Childhood Research Quarterly
(2013)
J.S. Deloache
Becoming symbol-minded
Trends in Cognitive Sciences
(2004)
L.A. Dinnebeil et al.
Influences on the congruence between parents’ and teachers’ ratings of young children's social skills and problem behaviors
Early Childhood Research Quarterly
(2013)
D.M. Early et al.
How do pre-kindergarteners spend their time? Gender, ethnicity, and income as predictors of experiences in pre-kindergarten classrooms.
Early Childhood Research Quarterly
(2010)
J.F. Fantuzzo et al.
Generalization of the child observation record: A validity study for diverse samples of urban, low-income preschool children
Early Childhood Research Quarterly
(2002)
T. Halle et al.
Predictors and outcomes of early versus later English language proficiency among English language learners
Early Childhood Research Quarterly
(2012)
A.H. Hindman et al.
Ecological contexts and early learning: Contributions of child, family, and classroom factors during Head Start, to literacy and mathematics growth through first grade
Early Childhood Research Quarterly
(2010)
M. Larkina et al.
Maternal provision of structure in a deliberate memory task in relation to their preschool children's recall
Journal of Experimental Child Psychology
(2008)

M.M. McClelland et al.

The impact of kindergarten learning related skills on academic trajectories at the end of elementary school

Early Childhood Research Quarterly

(2006)

P.A. McDermott et al.

Measuring preschool cognitive growth while it's still happening: The learning express

Journal of School Psychology

(2009)

S.J. Meisels et al.

The work sampling system: Reliability and validity of a performance assessment for young children

Early Childhood Research Quarterly

(1995)

A.J. Moreno et al.

Infant-toddler teachers can successfully employ authentic assessment: The Learning Through Relating system

Early Childhood Research Quarterly

(2011)

D.J. Purpura et al.

Early literacy and early numeracy: The value of including early literacy skills in the prediction of numeracy

Journal of Experimental Child Psychology

(2011)

C. Scott-Little et al.

Conceptualization of readiness and the content of early learning standards: The intersection of policy and research?

Early Childhood Research Quarterly

(2006)

D.J. Ackerman et al.

State Pre-K assessment policies: Issues and status

Policy information report. Educational testing service

(2012)

S.J. Bagnato

The authentic alternative for assessment in early intervention: An emerging evidence-based practice

Journal of Early Intervention

(2005)

S.J. Bagnato et al.

Identifying instructional targets for early childhood via authentic assessment: Alignment of professional standards and practice-based evidence

Journal of Early Intervention

(2011)

S.J. Bagnato et al.

Authentic assessment as “Best Practice” for early childhood intervention: National consumer social validity research

Topics in Early Childhood Special Education

(2014)

W.S. Barnett et al.

The state of preschool 2010

(2010)

L.E. Berk

Child development

(2009)

D.F. Bjorklund

Children's thinking: Cognitive development and individual differences

(2005)

T.G. Bond et al.

Applying the Rasch model: Fundamental measurement in the human sciences

(2007)

C.M. Bordignon et al.

The early assessment conundrum: Lessons from the past, implications for the future

Psychology in the Schools

(2004)

U. Bronfenbrenner

Ecology of the family as a context for human development: Research perspectives

Developmental Psychology

(1986)

U. Bronfenbrenner et al.

The bioecological model of human development

M.W. Browne et al.

Alternative ways of assessing model fit

M. Burchinal et al.

Examining the Black–White achievement gap among low-income children using the NICHD study of early child care and youth development

Child Development

(2011)

V. Buysse et al.

Program quality and early childhood inclusion: Recommendations for professional development

Topics in Early Childhood Special Education Online First

(2009)

S.Q. Cabell et al.

Validity of teacher report for assessing the emergent literacy skills of at-risk preschoolers

Language and Speech & Hearing Services in Schools

(2009)

G.W. Cheung et al.

Evaluating goodness-of-fit indexes for testing measurement invariance

Structural Equation Modeling

(2002)

D.H. Clements

Geometric and spatial thinking in early childhood education

C.G. Decker

Teaching strategies GOLD: Testing reliability and validity using the Bracken School Readiness Assessment

(2013)

J. Dewey

My pedagogic creed

School Journal

(1897)

G.J. Duncan et al.

School readiness and later achievement

Developmental Psychology

(2007)

J. Fantuzzo et al.

Preschool approaches to learning and their relationship to other relevant classroom competencies for low-income children

School Psychology Quarterly

(2004)

B.B. Frey et al.

Defining authentic classroom assessment

Practical Assessment, Research & Evaluation

(2012)

P.A. Gallagher et al.

Classroom quality, concentration of children with special needs, and child outcomes in Head Start

Exceptional Children

(2006)

S.A. Gelman

Categories in young children's thinking

Young Children

(1998)

H.P. Ginsburg et al.

Test of early mathematics ability: Examiner's manual

(2003)

H.P. Ginsburg et al.

Mathematics education for young children: What it is and how to promote it

Social Policy Report

(2008)

D.F. Gullo

Assessment in kindergarten

R. Hallam et al.

The effects of outcomes driven authentic assessment on classroom quality

(2007)

M.F. Hanline et al.

A longitudinal study exploring the relationship of representational levels of three aspects of preschool sociodramatic play and early academic skills

(2008)

P. Hauser-Cram et al.

When teachers’ and parents’ values differ: Teachers’ ratings of academic competence in children from low-income families

Journal of Educational Psychology

(2003)

C. Heroman et al.

The creative curriculum for preschool—Volume 5, objectives for development & learning: Birth through kindergarten

(2010)

High/Scope Educational Research Foundation

High/scope Child Observation Record (COR) for ages 2½–6

(1992)

High/Scope Educational Research Foundation

Infant–toddler Child Observation Record (COR) Appendix B: Development and validation

(2002)

Cited by (33)

Pre-kindergarten teachers’ family engagement practices and English Language Learners’ attendance and early learning skills: Exploring the role of the linguistic context
2023, Early Childhood Research Quarterly
As linguistic diversity increases in the U.S., it is essential for pre-kindergarten (pre-k) programs to expand their capacity to serve families whose home languages are not English. Family engagement is a key component of early childhood education; however, it is unclear whether family engagement practices uniformly benefit students from diverse backgrounds, including English Language Learners (ELL). In this mixed methods study, we explored whether teachers’ family engagement practices were associated with ELL children's attendance and early learning, focusing on whether two aspects of the linguistic context—classroom composition of ELL students and teachers’ practices for communicating in families’ home languages—moderates these associations. Additionally, we used parent focus groups to shed light on ELL families’ experiences with family engagement. We found consistent evidence that associations between teachers’ family engagement practices and ELL children's attendance and socioemotional skills were moderated by classroom composition of ELL students. Specifically, family engagement practices were associated with better attendance and higher socioemotional skills among ELL children in minority ELL classrooms (less than 20% ELL) but not in classrooms with more ELL students (20% or more). Results aligned with themes from our qualitative analysis, which found that having few ELL families in the classroom made it difficult for ELL parents to make connections with other families, which might make it challenging to build a sense of community. This suggests that families without access to networks of linguistically similar peers at school might need additional support from teachers to feel welcome and encouraged to participate.
How head start professionals use and perceive teaching strategies gold: Associations with individual characteristics including assessment conceptions
2023, Teaching and Teacher Education
Citation Excerpt :
The Teaching Strategies GOLD® (TS Gold) is a process-focused instrument that is advertised to be used with any early childhood curriculum, but was intentionally designed to align with Head Start's Early Learning Outcomes Framework (Ackerman & Coley, 2012; Burts & Kim, 2014; Hirsh-Pasek et al., 2005; Kim, 2016). As such, the TS Gold purports to connect naturalistic observations of individual children in 38 skill objectives under five developmental domains, with instructional recommendations based on assessment results generated from the observational data (Burts & Kim, 2014; Lambert, Kim, & Burts, 2015). A majority of Head Start organizations use the TS Gold (Isaacs et al., 2015).
This sequential mixed methods study examined Head Start professionals’ use and conceptions about Teaching Strategies GOLD (TS Gold) assessment system in the Northeastern U. S. We conducted 17 interviews which then informed a survey (N = 153). We analyzed all data sources together, with three primary findings. Professionals endorsed recommended uses of TS Gold. The demands of using TS Gold often outweighed resources available to Head Start professionals, especially teachers. Respondents with a background in education or in a teaching role used TS Gold more, and those who generally agreed that assessment improves educational outcomes also reported more positive TS Gold conceptions.
Alignment of teacher ratings and child direct assessments in preschool: A closer look at teaching strategies GOLD
2021, Early Childhood Research Quarterly
Authentic, teacher report measures are a popular approach to assessing young children, but prior research has shown that teacher over- and under-estimates of early elementary children's skills can be associated with child characteristics like race and gender and are often associated with achievement gains over time. The current study extended this work by examining teacher over- and under-estimates of preschool students’ academic skills with the widely-used measure, Teaching Strategies GOLD. 1045 children (ages 46–61 months, mean = 54.4 months, SD = 3.7) from 89 publicly-funded preschool classrooms were rated by teachers using Teaching Strategies GOLD in the fall and were independently assessed on language, literacy, and math by trained assessors in the fall and spring. Results of multilevel path models indicated that the greatest discrepancies between teacher ratings and direct assessments were for children whose direct assessment scores were farther from the classroom mean; discrepancies were not significantly associated with children's race/ethnicity or gender. Discrepancies were associated with achievement gains from fall to spring, with children whose skills were overestimated making greater gains than achievement scores alone would have predicted. Indirect pathways indicated that teacher discrepancies partially transmitted the effects of higher and lower fall skills to children's spring outcomes. Results are discussed as they relate to the use of teacher report measures as formative assessments, and the potential sources of systematic error in these ratings.
Construct validation of an innovative observational child assessment system: Teaching Strategies GOLD® birth through third grade edition
2021, Early Childhood Research Quarterly
Citation Excerpt :
Fourthly, regarding the high factor correlations, although this finding is consistent with previous research, authors do not have sufficient information to explain it. In other words, even as Lambert et al. (2015a) noted this could be due to the interrelated nature of development and learning areas for young children, there is not adequate evidence in our study to verify this speculation. This lack of discrimination between domains can be a concern when assessment performance is used for screening purposes.
Teaching Strategies GOLD® child assessment system has been frequently adopted in state-funded early childhood policy initiatives, but there is little validation research about its newest edition, GOLD® Birth through Third Grade (GOLD® B-3^rd). Based on a sample of children aged from birth through pre-kindergarten, this study investigated validity evidence regarding internal structure of this observational assessment, and the reliability estimates for each learning domain. From the results of confirmatory factor analysis, acceptable fit of the proposed measurement structure and high composite reliability estimates for the learning domains were found. We found evidence that the GOLD® B-3^rd is a psychometrically adequate measure (to the extent of its internal structure) of developmental progress in social-emotional, physical, language, cognitive, literacy, and mathematics domains for young children. Findings of this research add to the body of validation research of GOLD® B-3^rd, and encourage future investigations on other types of validity evidence for this scale.
The kindergarten Early Development Instrument predicts third grade academic proficiency
2020, Early Childhood Research Quarterly
School readiness skills predict later educational achievement, health, and social-emotional outcomes. Measures of school readiness can provide valuable information to assess both the impact of strategies and policies that prepare children for school as well as informing strategies for improving children’s educational trajectories across their school years. The Early Development Instrument (EDI) is a measure of school readiness skills based on teacher-reported observational recall. It has been used extensively in Canada and Australia and is in the early stages of adoption in a number of U.S. cities. The current study uses data from roughly 3000 children followed longitudinally from kindergarten through third grade from 7 school districts in Orange County, California. The study assesses whether EDI ratings in kindergarten predict third grade proficiency in mathematics and English Language Arts on state assessments. Ratings on the EDI were strongly associated with proficiency in both academic areas, even in the presence of controls for child-level factors and neighborhood fixed effects. Among its components, ratings on the language and cognitive development, communication skills and general knowledge, and social competence domains strongly differentiated children’s likelihood of later proficiency in both academic areas. Implications for improving comprehensive early childhood education and schooling policies based on indicators of school readiness are discussed.
Examining the validity of a widely-used school readiness assessment: Implications for teachers and early childhood programs
2019, Early Childhood Research Quarterly
Citation Excerpt :
The QRIS included multiple components, including the implementation of TS GOLD. The present study replicated and extended previous work (Lambert et al., 2015; Miller-Bains et al., 2017) examining the validity of a widely-used, performance-based, observational assessment, TS GOLD. We explored the convergent and discriminant validity of teachers’ assessments of children’s readiness skills using TS GOLD relative to independent data collectors’ assessments of children’s readiness skills using well-validated direct assessments in the fall and spring of the preschool year.
This study explored the validity of a widely-used, performance-based assessment of children’s school readiness skills in the fall and spring of preschool. Using a sample of 1109 children (mean age in the fall = 4.54 years; SD = 3.69 months) in 90 classrooms, we compared children’s school readiness skills as assessed by teachers using Teaching Strategies GOLD (TS GOLD) to readiness skills as assessed by independent data collectors using standardized, direct assessments. Findings indicated evidence of convergent validity: TS GOLD scores were significantly associated with other assessments of similar skills. Evidence of discriminant validity was limited: TS GOLD domains were highly associated with one another and did not show differentiation in predicting direct assessment scores. In addition, comparison of intraclass correlations (ICCs) showed that children’s skills were estimated as being much more similar to one another within a classroom when assessed using TS GOLD as compared to the direct assessments. More research is needed to ensure psychometrically sound readiness assessments, and prior to making strong policy and practice recommendations.

View all citing articles on Scopus

View full text

The measurement properties of the Teaching Strategies GOLD® assessment system

Highlights

Abstract

Introduction

Section snippets

Main study

Confirmatory factor analysis of cross-sectional data

Discussion

Author note

Early Childhood Research Quarterly

Early Childhood Research Quarterly

Early Childhood Research Quarterly

Trends in Cognitive Sciences

Early Childhood Research Quarterly

Early Childhood Research Quarterly

Early Childhood Research Quarterly

Early Childhood Research Quarterly

Early Childhood Research Quarterly

Journal of Experimental Child Psychology

Early Childhood Research Quarterly

Journal of School Psychology

Early Childhood Research Quarterly

Early Childhood Research Quarterly

Journal of Experimental Child Psychology

Early Childhood Research Quarterly

State Pre-K assessment policies: Issues and status

Policy information report. Educational testing service

The authentic alternative for assessment in early intervention: An emerging evidence-based practice

Journal of Early Intervention

Identifying instructional targets for early childhood via authentic assessment: Alignment of professional standards and practice-based evidence

Journal of Early Intervention

Authentic assessment as “Best Practice” for early childhood intervention: National consumer social validity research

Topics in Early Childhood Special Education

The state of preschool 2010

Child development

Children's thinking: Cognitive development and individual differences

Applying the Rasch model: Fundamental measurement in the human sciences

The early assessment conundrum: Lessons from the past, implications for the future

Psychology in the Schools

Ecology of the family as a context for human development: Research perspectives

Developmental Psychology

The bioecological model of human development

Alternative ways of assessing model fit

Examining the Black–White achievement gap among low-income children using the NICHD study of early child care and youth development

Child Development

Program quality and early childhood inclusion: Recommendations for professional development

Topics in Early Childhood Special Education Online First

Validity of teacher report for assessing the emergent literacy skills of at-risk preschoolers

Language and Speech & Hearing Services in Schools

Evaluating goodness-of-fit indexes for testing measurement invariance

Structural Equation Modeling

Geometric and spatial thinking in early childhood education

Teaching strategies GOLD: Testing reliability and validity using the Bracken School Readiness Assessment

My pedagogic creed

School Journal

School readiness and later achievement

Developmental Psychology

Preschool approaches to learning and their relationship to other relevant classroom competencies for low-income children

School Psychology Quarterly

Defining authentic classroom assessment

Practical Assessment, Research & Evaluation

Classroom quality, concentration of children with special needs, and child outcomes in Head Start

Exceptional Children

Categories in young children's thinking

Young Children

Test of early mathematics ability: Examiner's manual

Mathematics education for young children: What it is and how to promote it

Social Policy Report

Assessment in kindergarten

The effects of outcomes driven authentic assessment on classroom quality

A longitudinal study exploring the relationship of representational levels of three aspects of preschool sociodramatic play and early academic skills

When teachers’ and parents’ values differ: Teachers’ ratings of academic competence in children from low-income families

Journal of Educational Psychology

The creative curriculum for preschool—Volume 5, objectives for development & learning: Birth through kindergarten

High/scope Child Observation Record (COR) for ages 2½–6

Infant–toddler Child Observation Record (COR) Appendix B: Development and validation

The measurement properties of the Teaching Strategies GOLD^® assessment system