The measurement properties of the Teaching Strategies GOLD® assessment system
Introduction
Appropriate child assessment plays a vital role in high-quality early care and education programs (NAEYC & NAECS/SDE, 2003). Assessment measures that are well designed, implemented effectively, and interpreted and used appropriately, can inform teaching and contribute to better child outcomes (Snow & Van Hemel, 2008). On the other hand, inadequately designed measures or those that do not consider the rapidly changing demographics in early education programs (Barnett et al., 2010, Copple and Bredekamp, 2009) can result in children being mislabeled or not receiving appropriate support and optimal learning experiences. To ensure that all children regardless of culture, language, or disabilities are assessed fairly, scientifically-informed assessment measures are needed (Hirsh-Pasek et al., 2005, Snow and Van Hemel, 2008, Qi and Marley, 2009). In addition, assessment measures should be validated using samples that are representative of the diversity of children with whom the measure will be used (Bordignon and Lam, 2004, Pena and Halle, 2011, Snow and Van Hemel, 2008). The present study adds to the limited research on one type of child assessment, authentic assessment (Hallam, Grisham-Brown, Gao, & Brookshire, 2007). More specifically, it explores the reliability and validity of the recently developed Teaching Strategies GOLD® (GOLD®) (Heroman, Burts, Berke, & Bickart, 2010) and its use with children birth through kindergarten.
National reports and public policy statements support the call for better accountability and the need to decrease disparities among subgroups of children. Simultaneously they issue cautions and identify standards related to the assessment process. NAEYC and NAECS/SDE indicate that assessment measures should be developmentally appropriate, educationally important, and linguistically and culturally responsive (Copple and Bredekamp, 2009, NAEYC and NAECS/SDE, 2003, NAEYC and NAECS/SDE, 2005). Assessment evidence should be gathered over time, from multiple sources including families, in naturally occurring settings so they accurately reflect and support children's development and learning. Young children's development and learning is uneven and changes rapidly. Although distinct developmental and learning domains are clearly identified in the literature, these areas are overlapping and interrelated (Scott-Little et al., 2006, Shonkoff and Phillips, 2000, Snow and Van Hemel, 2008). Each area influences and is influenced by other areas (Berk, 2009, Copple and Bredekamp, 2009). During the later preschool and kindergarten years, the more integrated and global abilities of infants and toddlers (Moreno & Klute, 2011) are replaced with those that are more differentiated (Kim & Smith, 2010). Various factors affect children's development and learning such as individual differences, culture, and the environment (Copple and Bredekamp, 2009, Hindman et al., 2010, Hyson, 2008, Shonkoff and Phillips, 2000). Because the child's immediate environment exerts tremendous influence on development (Bronfenbrenner & Morris, 2006), it can be limiting to assess young children without considering important contextual settings. Close communication with families provides teachers with useful information about the child and helps them bridge the developmental contexts (Bronfenbrenner, 1986, Dewey, 1897, Mooney, 2000).
In addition to the accepted indices of adequate psychometrics (Snow & Van Hemel, 2008), assessment instruments should also exhibit “empirical validity” (Hirsh-Pasek et al., 2005) so that what they measure is based on current empirical findings in the developmental domains with a focus on predictors of success. Instruments should emphasize the processes (how) of learning rather than just the products (what) of learning. Further, Messick (1995) has argued that the body of research supporting the use of an assessment should include evidence of associations between the information provided by the measure and the information provided by external measures designed to measure similar constructs. Assessment instruments should also be minimally intrusive for teachers and children (McDermott et al., 2009).
Many authorities, including those in early childhood intervention, support a type of informal assessment sometimes referred to as “authentic assessment” (Ackerman and Coley, 2012, Bagnato, 2005, Bagnato et al., 2011, Keilty et al., 2009, Macy and Bagnato, 2010). Although there is no universal definition of the term (Frey, Schmitt, & Allen, 2012), in early childhood, authentic assessment is commonly used to mean a type of performance-based assessment of a child's real-life practical and intellectual challenges that occur within typical daily contexts (Mcafee & Leong, 2011). Teachers collect information on each child as they document what is said and done, select suitable examples and artifacts that illustrate particular abilities and knowledge, and incorporate relevant information from families and others who work with the child (Mcafee and Leong, 2011, Meisels et al., 2010). Familial insights are especially useful in assessing children representing minority groups and children with disabilities. Teachers summarize and interpret the assessment information and use it for instructional planning, individualizing instruction, and communicating child progress with families and other stakeholders.
Because authentic assessment is curriculum embedded (i.e., integrated within typical, everyday activities), it is less intrusive for teachers and children (Mcafee & Leong, 2011). Further, it can provide more complete information about a child's strengths and developmental needs than can other types of measures (Cabell, Justice, Zucker, & Kilday, 2009). Capturing a child's emerging abilities over time and their performance as they engage in the active process of learning, provides insights that may not be obtained in one assessment setting, as is typically the case with direct assessment measures.
As with any type of teacher report, authentic assessment measures have limitations which must be acknowledged (Ackerman and Coley, 2012, Südkamp et al., 2012, Waterman et al., 2012). Teacher-based observational assessments, including authentic assessments, are more subjective than standardized measures which adhere to specific procedures (Cabell et al., 2009), and they rely on teachers’ abilities to collect relevant information, accurately observe, and effectively analyze and evaluate a range of evidences that represent each child's developmental status and performance.
Some authorities question if teachers can objectively and reliably assess children (Kilday et al., 2012, Mashburn and Henry, 2004, Phillips and Lonigan, 2010, Waterman et al., 2012), particularly when informal assessment measures rather than standardized instruments are used (Lonigan, Allan, & Lerner, 2011). Teachers’ evaluations sometimes do not align with those of parents or with external evaluators (Dinnebeil et al., 2013, Sims and Lonigan, 2012), and informant discrepancies may reflect teacher variability or bias rather than characteristics of the child (Ackerman and Coley, 2012, Konold and Pianta, 2007). Teachers’ assessments may be influenced by their preconceived ideas about children (Burchinal et al., 2011, Mashburn and Henry, 2004), value differences between themselves and families (Hauser-Cram et al., 2003, Sirin et al., 2009), training or education level (Mashburn and Henry, 2004, Meisels et al., 2010), and instrument training/experience (Ackerman and Coley, 2012, Meisels et al., 2010). Classroom factors including percentage of children with special needs (Gallagher & Lambert, 2006), percentage of infants and toddlers (Meisels et al., 2010), and classroom/school socioeconomic status context (Phillips and Lonigan, 2010, Ready and Wright, 2011) can also affect teachers’ assessments.
Despite the acknowledged limitations, authentic assessment instruments have become increasingly popular (Ackerman and Coley, 2012, Bagnato et al., 2014, Frey et al., 2012), and results from several studies suggest that teachers can accurately assess children using such measures (Meisels et al., 2001, Moreno and Klute, 2011). Relevance to curriculum, involvement with families, and the ability to consider contextual information and functional competencies as teachers regularly assess children during daily activities make such measures particularly appealing (Copple and Bredekamp, 2009, Gullo, 2006). Furthermore, authentic assessment tools utilize the principles of universal design for learning, especially important for children with disabilities and dual language learners because such assessment allows them to demonstrate what they know and can do using multiple means of representation, engagement, and expression (Buysse and Hollingsworth, 2009, Horn and Banerjee, 2009). Careful observation as children play and engage in different activities can help teachers better understand the child's thought processes and provide timely and appropriate instruction and support (Piaget, 1972, Vygotsky, 1978). This type of information is needed by teachers to guide their daily instruction and to plan future experiences to advance each child's development and learning.
Several authentic assessment instruments are currently used in early childhood classrooms, with a few being widely used for a number of years. In each of these measures teachers familiar with the child gather information on multiple occasions and document children's demonstration of their knowledge, skills, and behaviors across different domains (Hyson, 2008). Assessment measures differ in several respects including the domains measured and how they are organized, how teachers rate and evaluate each child, the age ranges for which they are intended, their psychometric properties, and the samples used in their validation studies. Overall, low-moderate to high psychometric properties are reported by instrument developers and/or other researchers. However, validation studies of the popular measures were generally conducted using relatively small and/or limited samples from Early Head Start, Head Start, or urban school districts located in restricted geographic areas. Described as follows are several of the most popular and/or new authentic observational measures for use in early childhood settings (see individual studies for detailed information on each instrument).
Meisels and colleagues developed several authentic, teacher-observation measures for use with children of varying ages. The Work Sampling System® (WSS) is a curriculum-embedded assessment instrument intended for use with children age 3 to grade 5 (Meisels, Liaw, Dorfman, & Nelson, 1995). Teachers evaluate children's performance in the areas of personal and social development, language and literacy, mathematical thinking, scientific thinking, social studies, the arts, and physical development (Meisels et al., 2001). Checklists (performance indicators rated as Not Yet, In Process, or Proficient), Portfolios (related to core and individual items), and Summary Reports (three times per year) are used to evaluate children's performance (Meisels, 1996). Moderate to high internal consistency reliability (α = .84–.95) and moderate inter-rater reliability (.68–.73) with a kindergarten sample (n = 100) were reported (Meisels et al., 1995). In a study of the language/literacy and mathematics portions of the WSS with K-3 children in Title I classrooms (n = 345), 92% of the correlations between a standardized measure and the WSS were between .50 and .75 (Meisels et al., 2001). The measure accurately discriminated first through third grade children who were at-risk and those not at-risk for language and literacy (84%) and mathematics (80%).
Work Sampling for Head Start (WSHS) is a modification of the WSS for use with 3- and 4-year-old children in Head Start programs (Meisels, Xue, & Shamblott, 2008). While it incorporates observational and checklist components, it does not use the portfolio component of WSS. Children are rated on performance indicators as Not Yet, In Process, or Proficient for each of the 55 items on the measure. High internal consistency reliability (α = .90–.94) of the language, literacy, and mathematics subscales was noted using a sample of 112 older threes and four-year-old children attending Head Start and community-based programs (Meisels et al., 2008). Correlations between direct, standardized, norm-referenced tests ranged from .30 to .44. Predictive validity was reported for children at risk for learning problems in literacy and mathematics with the WSHS accounting for approximately 20% of the variance after controlling for demographic variables (Meisels et al., 2008).
The COR is a 30-item observational measure intended to assess the cognitive, social, and motor development of children 2.5 to 6 years of age (High/Scope Educational Research Foundation, 1992). In a study of reliability and validity, teacher teams in one state collected ongoing anecdotal observations of approximately 2500 Head Start children for at least a month. They used the observations as the basis for evaluating child performance on a 5-level competency sequence of low to high. COR demonstrated acceptable internal consistency (α = .66–.93) for its six-factor structure (i.e., initiative, social relations, creative representation, music and movement, language and literacy, and logic and mathematics). Inter-rater reliability between teachers and assistants varied from .57–.76. Concurrent validity with a subsample of 94 children with a standardized measure ranged from .27 to .66 with most correlations falling in .30 to .40 range (Schweinhart, McNair, Barnes, & Larner, 1993). In a study conducted by Fantuzzo and colleagues including two samples of urban preschool Head Start (n = 733) and a mixed program sample (n = 1427), high internal consistency reliability (α = .86–.95) was reported (Fantuzzo, Hightower, Grim, & Montes, 2002) for a three-factor structure (i.e., Cognitive Skills, Social Engagement, and Coordinated Movement) of the measure. Convergent validity for the Cognitive Skills dimension and convergent and discriminant validity for the Social-Engagement dimension were also noted by the researchers. More recently, convergent and divergent validity of the COR was explored with a different sample of urban Head Start children (n = 242) (Sekino & Fantuzzo, 2005).
Designed for use with younger children, the Infant-Toddler COR (High/Scope Educational Research Foundation, 2002) is a 28- item measure assessing children six weeks to age three in six broad areas (i.e., sense of self, social relations, creative representation, movement, communication and language, exploration and early logic). Adults make anecdotal observations related to measurement items and then organize and analyze them to create a portrait of each child. In a small study of 50 infants, high internal consistency reliability was reported (α = .99 for the total 28 items; α = .92 or .93 for the items in the six areas). Inter-rater reliability between nine pairs of caregivers was .93 for the total scale and ranged from .83 to .91 for the six areas. High concurrent validity of the total Infant-Toddler COR with a standardized measure was also reported (i.e., .91 and .87); however, the correlations were reduced when the effects of age were removed (High/Scope Educational Research Foundation, 2002).
In response to the call for additional options for appropriate, valid, and reliable tools for assessing young children, new authentic assessment measures have been developed. These new instruments fill a critical gap in the field (Snow & Van Hemel, 2008) as increasing numbers of young children are enrolled in out-of-home programs (Barnett et al., 2010). Many of these children are infants and/or represent diverse populations such as dual language learners, children with disabilities, or children living in poverty.
The scale was developed by Meisels and colleagues for use with children from birth to 42 months (Meisels et al., 2010). It is used by caregivers/teachers to measure child progress in social–emotional, language, cognitive, and physical development. The Ounce™ includes three components. Observational Records are used to guide teacher/caregiver observations and documentations. Family Albums provide a way for families to learn more about development and to document their child's development. Developmental Profiles and Standards are used by teachers/staff to summarize information from the observational records. The profiles contain eight non-overlapping age ranges with 12–16 different items per age band. Using the appropriate chronological age range, teachers evaluate child performance using a two-point scale (i.e., “Developing as Expected” or “Needs Improvement”). A sample of 287 children and 124 teachers in Early Head Start programs in a large metropolitan area were included in a study of the measure's reliability and validity. Internal consistency as indicated by Cronbach's alpha ranged from .19 to .89 with most age groups showing reliabilities of more than .62. Correlations with standardized measures were higher for older children (i.e., 30–42 months) than for younger children. Receiver operating characteristic curve analyses supported teachers’ abilities to accurately identify children who were at risk with more than 70% of children identified correctly (Meisels et al., 2010).
CAR is designed to be used to assess children birth to three (Moreno & Klute, 2011). Unlike the previously mentioned measures, it takes a learning-topics approach to assessment (i.e., play, interacting in tune, expressive communication, receptive communication, emerging literacy) rather than a developmental domain approach. Three or four times a year each child is scored as “Got It” or “Open Opportunity” on 144 items on the checklist. In a study of 136 children enrolled in an Early Head Start program, internal consistency reliability of the measure ranged from α = .66 to .95 in the fall and from α = .89 to .97 in the spring. In general, the measure functioned as expected relative to item difficulty (i.e., only 5 of 144 items were out of order) and chronological age (fall r [135] = .68, p < .001); spring r [122] = .70, p < .001). Agreement between criterion measures and LTR-CAR for prediction of children at-risk was generally high (range, 77.6%–.89.1%; average = 84.6%).
GOLD® is designed to measure a child's progress in the major developmental and content areas for children ages birth through kindergarten (Heroman et al., 2010). It is intended for use with typically developing children, children with disabilities, children who demonstrate competencies beyond typical developmental expectations, and dual language learners. The assessment tool can be used in early education programs that incorporate the Teaching Strategies curricula as well as in programs which do not use the curricula (Teaching Strategies LLC, n.d.).
The 38 GOLD® objectives and accompanying rating scale items help teachers focus the assessment process as they regularly gather child information through observations, conversations with children and families, samples of children's work, photos, video clips, recordings, etc. The assessment information is to be used in planning appropriate experiences, individualizing instruction, and monitoring and communicating child progress to families and other stakeholders. Data may also be used to help teachers ascertain when additional information or more specific evaluation is needed. As such, GOLD® is a formative assessment measure; it is not a test, nor is it intended to be used as a diagnostic, clinical, or high-stakes instrument.
Although GOLD® is similar in some ways to other authentic measures, the tool adds unique contributions to the validated authentic assessment measures currently used. For example, the ability to use a single instrument to assess children from birth to 71 months, rather than having several different measures (High/Scope Educational Research Foundation, 2002, Meisels et al., 1995, Meisels et al., 2008, Meisels et al., 2010, Schweinhart et al., 1993) has benefits. One instrument which assesses the same broad objectives throughout the early childhood period, with developmentally appropriate progressions, can be especially beneficial for tracking development and learning longitudinally. It can also assist with program continuity (Snow & Van Hemel, 2008) when children move from one classroom to the next because teachers are already familiar with the assessment system and objectives.
The broader range of item-level rating scale points and behavioral anchors in GOLD® (i.e. 10 levels) than those in COR (5 levels), Work Sampling (3 levels), and CAR (2 categories) helps to decrease the likelihood of floor and ceiling effects and can provide useful instructional information for teachers. Indicator and “in between levels” in GOLD® allow for additional rating scale points and steps in the progressions. The levels in between demonstrate that a child's knowledge, skills, and behaviors are emerging but are not fully established. They help teachers know when to provide support or scaffold child efforts (Early et al., 2010; Vygotsky, 1978). They are also especially helpful for documenting increments of progress for younger children, dual language learners, and children with disabilities. If used as intended, they can assist the teacher in providing supportive experiences for all children, for planning individualized instruction or specialized small group activities, and knowing when additional information or more specific evaluation is needed (Lopez, Salas, & Flores, 2005).
Development of GOLD® occurred over several years. Its publishers originally proposed to revise the three developmental continua (Teaching Strategies LLC, 2001, Teaching Strategies LLC, 2005, Teaching Strategies LLC, 2006) which were being widely used in early childhood programs (Hyson, 2008). Upon further review of the existing measures and new research, the decision was made to develop a completely new assessment instrument.
Feedback from teachers, administrators, consultants, and Teaching Strategies, LLC professional-development and research personnel was used in the development process. Pilot studies with diverse populations were conducted, and a draft of the measure was sent to leading authorities in their respective fields for content review. Revisions were made based on results of the content validation and pilot studies. Final assessment items were selected on the basis of feedback received during the development process; state early learning standards and the Head Start Child Development and Early Learning Framework (U.S. Department of Health & Human Services, Administration of Children & Families, Office of Head Start, 2010); and current research and professional literature including literature that identifies which knowledge, skills, and behaviors are most predictive of school success. This process resulted in a measure having a total of 38 objectives with 23 of them in the areas of social–emotional, physical, language, cognitive, literacy, and mathematics. Although GOLD® includes objectives in other areas (i.e., science and technology, social studies, the arts, and English language acquisition), they are not included in the analyses reported in this paper.
Objectives in the social–emotional domain involve understanding, regulating, and expressing emotions; building relationships with others; and interacting appropriately in social situations. Social–emotional competence is critical to children's later academic, social, and psychological outcomes (McCabe & Altamura, 2011). When children's interactions and relationships are positive, they are more likely to have positive short- and long-term outcomes (Commodari, 2013, Peisner-Feinberg et al., 2001, Rubin et al., 1998, Smith and Hart, 2002). Self-regulation is a particularly important construct in the social–emotional domain and is related to academic achievement (McClelland and Cameron, 2011, Ponitz et al., 2009, Suchodoletz et al., 2009). Both self-regulation and social competence predict children's later reading and math skills (McClelland, Acock, & Morrison, 2006).
The physical domain objectives include gross-motor development (traveling, balancing, and gross-motor manipulative skills) and fine-motor strength and coordination. Physical development affects children's emotional development and their school performance (Rule and Stewart, 2002, Son and Meisels, 2006). It can also affect their social and language development as they interact with peers (Kim, 2005).
The language objectives include understanding and using language to communicate or express thoughts and needs. Language comprehension influences other areas of development such as the closeness of teacher–child relationships (Justice, Cottone, Mashburn, & Rimm-Kaufman, 2008). Language has been found to predict reading skills several years later (Snow, Burns, & Griffin, 1998), and aspects of oral language predict reading comprehension (Roth, Speece, & Cooper, 2002). Children without early experiences that support language development show substantial differences in language understanding and use by age three (Strickland & Shanahan, 2004). When dual language learners acquire English proficiency by the end of kindergarten, they have better cognitive and behavioral outcomes throughout the early school years and beyond than children who become English-proficient after kindergarten (Halle, Hair, Wandner, McNamara, & Chien, 2012).
Objectives in the cognitive domain include approaches to learning (e.g., attention, curiosity, initiative, flexibility, problem solving); memory; classification skills; and the use of symbols to represent objects, events, or persons not present. Symbolic thinking is necessary for language development, problem solving, reading, writing, and mathematical thinking (Deloache, 2004, Younger and Johnson, 2004), and children's symbolic substitution during sociodramatic play is related to their later reading and math skills (Hanline, Milton, & Phelps, 2008). Children's ability to classify is important for learning and remembering (Larkina, Guler, Kleinknecht, & Bauer, 2008), and the more knowledgeable they are about a topic, the more likely they are to categorize at a more mature level (Bjorklund, 2005, Gelman, 1998). The way children approach learning has received increased attention in recent years (Hyson, 2008). Children who have positive approaches to learning are more likely to succeed academically (Howse, Lange, Farran, & Boyles, 2003) and to have more positive interactions with peers (Fantuzzo et al., 2004, Hyson, 2008) than children who do not exhibit these characteristics.
The literacy objectives incorporate phonological awareness; alphabet, print, and book knowledge; comprehension; and emergent writing skills. Letter/name writing predicts later literacy, and phonological sensitivity; alphabet knowledge and knowledge of print concepts predict later reading, writing, and spelling success (National Early Literacy Panel [NELP], 2008). Preschool children's development in oral language, phonological awareness, and print knowledge is predictive of later reading abilities (Lonigan et al., 2011). Children who begin school with less phonological sensitivity, familiarity with the basic purposes and mechanisms of reading, and letter knowledge are especially likely to have difficulty learning to read in the primary grades (NELP, 2008, Snow et al., 1998). Letter knowledge and global phonological sensitivity have been found to be predictors of early reading abilities (Lonigan, Burgess, & Anthony, 2000), while vocabulary and print knowledge are predictive of later numeracy (Purpura, Hume, Sims, & Lonigan, 2011).
The mathematics objectives focus on number concepts and operations, spatial relationships and shapes, measurement and comparison, and pattern knowledge. Children enter school with “everyday” mathematics abilities (Ginsburg, Lee, & Boyd, 2008), and their mathematical skills upon entry to kindergarten are predictive of later reading and math achievement (Duncan et al., 2007). Children's spatial sense is important to other aspects of mathematics, and children with a strong spatial sense tend to do better in mathematics than children without a strong spatial sense (Clements, 2004). Their understandings about counting, number symbols, and number operations are fundamental to their success with more complex mathematics (Ginsburg and Baroody, 2003, Zur and Gelman, 2004).
Several researchers have examined the psychometric properties of GOLD® for its use with children representing different ethnic, racial, language, functional status, and age groups. These initial studies, summarized as follows, suggest that GOLD® is a psychometrically promising instrument which has utility for children representing diverse populations. Using a small sample of infants through children two years of age (n = 290), high internal consistency reliability of GOLD® (α = .95–.99) was found (Kim & Smith, 2010). Rasch reliability statistics were also high (person separation = 9.42, item separation = 19.20, person reliability = .99, item reliability = .99). Another study looked at the validity of GOLD® for assessing children with disabilities and those for whom English is not their first language. Assessment information was collected on 3-, 4-, and 5-year-old children at the fall (n = 79,324), winter (n = 132,693), and spring (n = 50,558) checkpoints. Differential Item Functioning (DIF) analysis indicated that in general, teachers’ ratings were similar for children of similar abilities, regardless of their subgroup membership (Kim, Lambert, & Burts, 2013).
Associations of teacher ratings with child demographics (e.g., age, gender, disability status, and English language status) and classroom composition characteristics (e.g., class mean age and percentage ELLs, children with disabilities, and males) were examined with a sample of 21,592 children ages 12 months through 59 months. Using three-level growth curve modeling, findings indicated that teachers’ GOLD® ratings were associated in anticipated directions for both child and classroom characteristics (Lambert, Kim, & Burts, 2014).
The dimensionality, rating scale effectiveness, hierarchy of item difficulties, and the relationship of GOLD® developmental scale scores to child age have also been examined. Data from a sample (n = 10,963) of children ages birth to 71 months were analyzed using the Rasch Rating Scale Model. Support was found for the unidimensionality of each domain (i.e., items in each scale measure one and only one underlying latent construct). Results further indicated that teachers can make valid ratings of the developmental progress of children across the measured age range. Correlations were moderately high between each of the scale scores and child age in months, with correlation coefficients ranging from .67 to .73 (Kim et al., 2013).
Using a different sample of preschool children, researchers examined the relationships between GOLD® scale scores and teacher ratings of children's social functioning and learning behaviors and child performance on individually administered direct assessments of academic skills (Lambert, Kim, & Burts, 2013). The sample (n = 299) was diverse and included children attending 51 different Head Start, public pre-k, and private school classrooms across 16 centers in the Northeast United States. In general, the correlations of the external measures with the GOLD® domains were moderate and in expected, aligned areas. For example, scores from the Social–Emotional scale correlated moderately with measures of similar constructs (r = .42–.52). Similar results were found for the Language (r = .20–.48), Literacy (r = .18–.45), and Mathematics scales (r = .29–.52).
Concurrent validity was also examined by researchers in Washington state (Soderberg et al., 2013). Using a modified version of GOLD® (i.e., WaKIDS) with kindergarten children (n = 333), moderate correlations (r = .50–.64) with a battery of established norm-referenced achievement instruments were found for the Language, Literacy, and Mathematics areas.
The purpose of this article is to describe three studies presenting additional evidence for the reliability and validity of the scale scores that can be produced using teacher ratings elicited by the GOLD® assessment system. Although previous studies provide some support for the use of GOLD® with various groups, they have not addressed several issues of psychometric adequacy as recommended by the National Research Council (Snow and Van Hemel, 2008). Specifically, this study addresses the following research questions: (a) Is there evidence that the information the measure provides can be used to measure the intended constructs?; (b) Is there evidence that the factorial structure is invariant across the three measurements during the typical academic year?; (c) Using both classical and modern measures of score reliability, is there evidence that the scale scores measure the intended constructs reliably?; and (d) Is there evidence of inter-rater reliability when the ratings of teachers and master raters are compared? The concurrent validity study addressed the following question: While accounting for teacher rating and clustering effects, is there evidence of associations between GOLD® scale scores and child performance on a direct assessment of academic skills?
Section snippets
Main study
A total population of 111,059 children was rated using the GOLD® assessment system for the fall 2010 checkpoint. These children received educational services in 735 different programs at 3792 different centers located in all regions of the United States. These programs and centers included Head Start, private child care, and school-based sites. All 50 states and the District of Columbia were represented. Most of the participating centers, although not all, used the curricula developed by
Confirmatory factor analysis of cross-sectional data
To address the first research question, the factorial structure of the GOLD® was examined using confirmatory factor analysis (CFA) in Mplus (Muthén & Muthén, 1998–2010). The first sample data was used for CFA. Given its basis in developmental theory, a six-factor model at the item level that corresponds to the designed structure of the instrument was examined. The chi-square test can be used to evaluate model fit. However, given the sensitivity of this test to sample sizes, alternative
Discussion
Previous studies suggest that GOLD® (Heroman et al., 2010) yields valid and reliable inferences for its intended populations (Kim et al., 2013, Kim et al., 2014). However, several important questions have heretofore been unanswered. The studies reported in this paper were conducted to expand initial research addressing the reliability and validity of the scale scores produced by teacher ratings elicited using GOLD®.
Although it is a relatively new assessment instrument, GOLD® is widely used in
Author note
This article is based on some of the same datasets and analyses contained in a previously released technical report entitled Technical Manual for the Teaching Strategies GOLD™ assessment system. Partial funding for this research was provided by Teaching Strategies, LLC. Opinions are those of the authors and do not necessarily reflect those of the funding agency.
References (120)
- et al.
Behavioral self-regulation and executive function both predict visuomotor skills and early academic achievement
Early Childhood Research Quarterly
(2014) - et al.
Bien Educado: Measuring the social behaviors of Mexican American children
Early Childhood Research Quarterly
(2012) Preschool teacher attachment, school readiness and risk of learning difficulties
Early Childhood Research Quarterly
(2013)Becoming symbol-minded
Trends in Cognitive Sciences
(2004)- et al.
Influences on the congruence between parents’ and teachers’ ratings of young children's social skills and problem behaviors
Early Childhood Research Quarterly
(2013) - et al.
How do pre-kindergarteners spend their time? Gender, ethnicity, and income as predictors of experiences in pre-kindergarten classrooms.
Early Childhood Research Quarterly
(2010) - et al.
Generalization of the child observation record: A validity study for diverse samples of urban, low-income preschool children
Early Childhood Research Quarterly
(2002) - et al.
Predictors and outcomes of early versus later English language proficiency among English language learners
Early Childhood Research Quarterly
(2012) - et al.
Ecological contexts and early learning: Contributions of child, family, and classroom factors during Head Start, to literacy and mathematics growth through first grade
Early Childhood Research Quarterly
(2010) - et al.
Maternal provision of structure in a deliberate memory task in relation to their preschool children's recall
Journal of Experimental Child Psychology
(2008)
The impact of kindergarten learning related skills on academic trajectories at the end of elementary school
Early Childhood Research Quarterly
Measuring preschool cognitive growth while it's still happening: The learning express
Journal of School Psychology
The work sampling system: Reliability and validity of a performance assessment for young children
Early Childhood Research Quarterly
Infant-toddler teachers can successfully employ authentic assessment: The Learning Through Relating system
Early Childhood Research Quarterly
Early literacy and early numeracy: The value of including early literacy skills in the prediction of numeracy
Journal of Experimental Child Psychology
Conceptualization of readiness and the content of early learning standards: The intersection of policy and research?
Early Childhood Research Quarterly
State Pre-K assessment policies: Issues and status
Policy information report. Educational testing service
The authentic alternative for assessment in early intervention: An emerging evidence-based practice
Journal of Early Intervention
Identifying instructional targets for early childhood via authentic assessment: Alignment of professional standards and practice-based evidence
Journal of Early Intervention
Authentic assessment as “Best Practice” for early childhood intervention: National consumer social validity research
Topics in Early Childhood Special Education
The state of preschool 2010
Child development
Children's thinking: Cognitive development and individual differences
Applying the Rasch model: Fundamental measurement in the human sciences
The early assessment conundrum: Lessons from the past, implications for the future
Psychology in the Schools
Ecology of the family as a context for human development: Research perspectives
Developmental Psychology
The bioecological model of human development
Alternative ways of assessing model fit
Examining the Black–White achievement gap among low-income children using the NICHD study of early child care and youth development
Child Development
Program quality and early childhood inclusion: Recommendations for professional development
Topics in Early Childhood Special Education Online First
Validity of teacher report for assessing the emergent literacy skills of at-risk preschoolers
Language and Speech & Hearing Services in Schools
Evaluating goodness-of-fit indexes for testing measurement invariance
Structural Equation Modeling
Geometric and spatial thinking in early childhood education
Teaching strategies GOLD: Testing reliability and validity using the Bracken School Readiness Assessment
My pedagogic creed
School Journal
School readiness and later achievement
Developmental Psychology
Preschool approaches to learning and their relationship to other relevant classroom competencies for low-income children
School Psychology Quarterly
Defining authentic classroom assessment
Practical Assessment, Research & Evaluation
Classroom quality, concentration of children with special needs, and child outcomes in Head Start
Exceptional Children
Categories in young children's thinking
Young Children
Test of early mathematics ability: Examiner's manual
Mathematics education for young children: What it is and how to promote it
Social Policy Report
Assessment in kindergarten
The effects of outcomes driven authentic assessment on classroom quality
A longitudinal study exploring the relationship of representational levels of three aspects of preschool sociodramatic play and early academic skills
When teachers’ and parents’ values differ: Teachers’ ratings of academic competence in children from low-income families
Journal of Educational Psychology
The creative curriculum for preschool—Volume 5, objectives for development & learning: Birth through kindergarten
High/scope Child Observation Record (COR) for ages 2½–6
Infant–toddler Child Observation Record (COR) Appendix B: Development and validation
Cited by (33)
How head start professionals use and perceive teaching strategies gold: Associations with individual characteristics including assessment conceptions
2023, Teaching and Teacher EducationCitation Excerpt :The Teaching Strategies GOLD® (TS Gold) is a process-focused instrument that is advertised to be used with any early childhood curriculum, but was intentionally designed to align with Head Start's Early Learning Outcomes Framework (Ackerman & Coley, 2012; Burts & Kim, 2014; Hirsh-Pasek et al., 2005; Kim, 2016). As such, the TS Gold purports to connect naturalistic observations of individual children in 38 skill objectives under five developmental domains, with instructional recommendations based on assessment results generated from the observational data (Burts & Kim, 2014; Lambert, Kim, & Burts, 2015). A majority of Head Start organizations use the TS Gold (Isaacs et al., 2015).
Alignment of teacher ratings and child direct assessments in preschool: A closer look at teaching strategies GOLD
2021, Early Childhood Research QuarterlyConstruct validation of an innovative observational child assessment system: Teaching Strategies GOLD® birth through third grade edition
2021, Early Childhood Research QuarterlyCitation Excerpt :Fourthly, regarding the high factor correlations, although this finding is consistent with previous research, authors do not have sufficient information to explain it. In other words, even as Lambert et al. (2015a) noted this could be due to the interrelated nature of development and learning areas for young children, there is not adequate evidence in our study to verify this speculation. This lack of discrimination between domains can be a concern when assessment performance is used for screening purposes.
The kindergarten Early Development Instrument predicts third grade academic proficiency
2020, Early Childhood Research QuarterlyExamining the validity of a widely-used school readiness assessment: Implications for teachers and early childhood programs
2019, Early Childhood Research QuarterlyCitation Excerpt :The QRIS included multiple components, including the implementation of TS GOLD. The present study replicated and extended previous work (Lambert et al., 2015; Miller-Bains et al., 2017) examining the validity of a widely-used, performance-based, observational assessment, TS GOLD. We explored the convergent and discriminant validity of teachers’ assessments of children’s readiness skills using TS GOLD relative to independent data collectors’ assessments of children’s readiness skills using well-validated direct assessments in the fall and spring of the preschool year.