Chapter 2 Practical Foundations

Note:: Some material in this chapter is adapted from M. C. Frank et al. (2016) and Marchman and Dale (2017).

In general, a major issue of experimental psychology is that the constructs may depart from the ecological task that is of principal interest (Cronbach and Meehl 1955). Early language is a rare case where these problems are minimized. Measures of early language comprehension and production tend to be face valid and tightly linked to the construct of interest. And early language measures are often very closely related to the ecological task – linguistic communication – that is the theoretical target for explanation. Thus, early language is the rare case where, in principle, consistency and variability can both be explored in a single set of measurements (Bates et al. 1994).

Unfortuntely, often researchers interested in consistency have measured theoretically-important, carefully-chosen phenomena using small convenience samples that suffice to show a proof-of-concept but do not provide information about variability. In contrast, work on variability between individuals has often focused on larger samples with more reliable tasks, that – perhaps as a consequence of their reliability – are less tightly linked to a particular theoretical construct of interest. As we discuss in this chapter, the CDI is a rare case where the overall assessment of language is reliable enough to index variability, yet the individual items are detailed enough to be used to study consistency.

In this chapter, we begin by introducing the CDI and contrasting it with other methods of measuring early language. We then discuss how the cross-linguistic use of parent report methods creates both challenges and opportunities, ending with our development of Wordbank as a way of archiving data on cross-linguistic parent report.

2.1 Measuring early language

2.1.1 The logic of parent report and its strengths

How do you measure young children’s language? Parent report survey instruments like the MacArthur-Bates CDI (Fenson et al. 1994, 2007) and the Language Development Survey (LDS; Rescorla 1989) provide an inexpensive method for researchers to get a global picture of children’s language. The CDIs in particular were developed across a period of more than 40 years. Originally designed for use in a research study (Bates 1976), the instruments have evolved from a structured face-to-face interview to a paper-and-pencil format and are now increasingly administered online (e.g., the web-cdi project; Kristoffersen et al. (2013) for Norwegian; laboratorium.detskarec.sk for Slovak). While other assessment tools exist for slightly older children, to our knowledge, no other measure allows cost-effective global language assessment for children in the critical age ranges between the emergence of language and the period when children become more able to engage in structured, face-to-face activities (around 30 months).

Parent-report instruments, like the CDI and LDS, take advantage of the fact that parents (or other primary caregivers) are experts in the behavior of their own child. Parent reports are based on experiences with the child which are not only more extensive than any researcher or clinician can obtain, but are also more representative of the child’s ability. Parents have experience with their child at play, at meals, at bath and bedtime, at tantrums – in short, with the full range of the child’s life and therefore with the full range of language structures used in these contexts. Parents also have opportunities to hear the child interact with other people: other caregivers, grandparents, siblings, and friends. Because responses on these instruments represent an aggregation over much time and many situations, they are less influenced by factors that can mask a child’s true ability in the laboratory or clinic, such as shyness or compliance, or that can impact the validity of naturalistic sampling, such as word frequency. As Bates, Bretherton, and Snyder (1991) point out, “parental report is likely to reflect what a child knows, whereas [a sample of] free speech reflects those forms that she is more likely to use.” (p. 57).

Because of its format, parent report enables the collection of data from far larger samples of children than would be possible with standardized tests or naturalistic observation. Information from more adequate samples, especially in the form of norms, can benefit both clinical practice and research. Fenson et al. (1994), for example, used the norming data from English versions of the CDIs - a sample of 2,550 children aged 8 to 30 months - to address questions about variability in communicative development. Large samples are especially needed to provide an accurate statistical description of extreme scores, i.e., what score corresponds to the 10th percentile? What does the most advanced child (e.g., > 90th percentile) look like at a given age?

Research on questions such as environmental influences on language development benefits substantially from large samples. Correlational research in general is hampered by the problem of multicollinearity: Predictor variables such as parental education, number of books in the home, family size, use of questions vs. imperatives, are likely to be intercorrelated, making it difficult to separate the effects of each of them individually. Large samples in which there is a substantial amount of non-overlapping variance are essential for addressing these questions.

The core of the CDIs is the vocabulary checklist. This list is essentially a “bag of words” which represents the set of words that best capture variation in lexical development across the full spectrum of child ages and abilities. Parents choose the words they believe that their child can currently “understand” (comprehension, measured for younger children) or “understand and say” (production, measured for both younger and older children). A child’s score on a vocabulary checklist represents their comprehension or production “vocabulary size,” indexing that child’s relative status against other children assessed with the same list. In their initial English and Spanish instantiations, the vocabulary checklists were developed in two versions: Words & Gestures (WG; 8–18 months) which contains about 400 words, and the Words & Sentences (WS; 16–30 months), which contains about 700 words. This structure has often been replicated across cross-linguistic adaptations, though there is some variation in form construction (see Chapter 3), and some forms include substantially different numbers of words or include/exclude other measures.

The vocabulary checklists contain words from many different semantic (e.g., animal names, household items) and syntactic (e.g., action words, connectives) categories, resulting in broader samples of lexical knowledge than are available from other methods. Importantly, however, these words are not chosen to create a complete list of all words understood or produced by a child. Instead, CDI word lists are constructed to include a set of words that most children will know as well as a sampling of intermediate and more difficult words that will be useful in assessing variability between children.⁵

An additional advantage of the parent-report method is that parents can also report on many different sub-components and correlates of early vocabulary development (limited, of course, by what parents can observe and can reliably report). In particular, the CDI instruments ask about use of communicative gestures, grammar, and symbolic play, in addition to vocabulary comprehension and production. Information about what early vocabulary development correlates with, and what it does not, can yield important theoretical information about the common mechanisms underlying learning.

Of course, parent report has substantial limitations that can lead to both measurement error and bias. These are addressed to some extent by design features of the CDI, and further addressed by evidence for the reliability and validity of the instrument. Because these concerns are so central to our enterprise here, we discuss these issues at length in Chapter 4 both from theoretical and analytic points of view.

2.1.2 Other methods of measuring early vocabulary

As we discuss throughout, parent report is an imperfect method. But often critics of parent report forms like the CDI fail to consider the weaknesses of the alternatives. Here we briefly consider two alternatives: naturalistic observation and experimental testing.

Since they are highly face valid and have the potential for tremendous ecological richness, naturalistic observations are the other leading candidate for measurement of early language. Unfortunately, such observations are extremely costly and time-consuming to transcribe and annotate. These difficulties lead to a trade-off where most observation-based studies either include dense data about a small number of children or smaller amounts of data with a larger sample size. Dense datasets currently provide the best method for in-depth study of the interaction between learning mechanisms and language input in individuals (e.g., Lieven, Salomo, and Tomasello 2009; Roy et al. 2015), but the generalizability of these studies is necessarily limited by their small sample sizes. And sample sizes for such studies are in turn limited by the costs and practicality of gathering and transcribing such data (see e.g., Bergelson and Aslin 2017 for the state of the art). At the other end of the spectrum, assessment of many individual language samples can yield information about individual variability (e.g., Dickinson and Tabors 2001; Cartmill et al. 2013; Weisleder and Fernald 2013), but at a cost in terms of depth.

Further, standardization and avoidance of confounds in naturalistic observation studies is challenging. Although parent report seems at first glance to be much more subject to the biases of individual parents, in fact many of the same confounds arise in other paradigms. For example, should an observation session be during play with the parent or an experimenter? Given that parents vary in their talkativeness during a play session, play with a parent is bound to measure parents’ ability to elicit language, as well as variation in children’s ability or knowledge. But for toddlers, temperamental variation is extreme, so an experimenter play session may simply be impractical for some children (and language use may be limited by shyness rather than a lack of ability). Another difficulty is that words will not occur in a laboratory play session in the same distributions as they would occur across other, more naturalistic and varied contexts: words’ frequency of occurance will be biased by the particular activities and objects used in the play session. Although these difficulties can be navigated through careful procedural and statistical control, the point of this example is that no observational method offers a perfect solution.

Estimates of production vocabulary from naturalistic observation are highly correlated with the CDI within studies (e.g., Bornstein and Haynes 1998), but are likely to be affected substantially by length of the session, context, and interlocutor when comparing across studies (see e.g., Hidaka 2015 for discussion). And although there exist methods to extract insights about global vocabulary from naturalistic observation, these statistical extrapolations are relatively new and have not been validated extensively (Hidaka 2015). Finally, naturalistic observations do not measure children’s language comprehension, a variable of interest for many early language researchers.

Experimental testing, in contrast, is an excellent method for measuring individual aspects of children’s linguistic abilities, for example their comprehension of a handful of words or their speed of processing (e.g., Bergelson and Swingley 2012; Fernald, Perfors, and Marchman 2006). These methods are much less subject to the confounding of observational methods. But an infant or toddler can only provide a limited number of trials during a single measurement session, even in tasks using eye-movements. Thus, the ability to measure a particular child’s global language ability is limited. Further, the specific words to measure for children of different ages vary – those words that are appropriate for measuring a 14-month-old’s competence are trivially easy for a 24-month-old. And attrition can be quite high for a long measurement session, requiring repeated testing for many participants. Other comprehension vocabulary measures are also available across some range of languages (e.g., the Peabody Picture Vocabulary Test 4, Dunn and Dunn 2007; the Computerized Comprehension Task (CCT), Friend and Keplinger 2008), but most of these assessments are tailored for children older than 2 1/2 years.

In sum, despite some clear weaknesses, for breadth, depth, and ease of data gathering, parent report is unmatched. In Chapter 4, we provide a more extensive discussion of issues surrounding the reliability and validity of parent report.

2.2 Cross-linguistic use of the CDI

2.2.1 Adaptation, not translation!

The CDI was originally designed for English and quickly adapted to Spanish. Since these initial efforts, parallel CDI instruments have now been adapted, or are underway, for more than 100 languages (mb-cdi.stanford.edu/adaptations.html), with data from 29 languages or dialects represented in this book. The ethic behind the development of these parallel instruments is “adaptation, not translation” – in other words, to create forms with the same spirit as the English form, but not simply by translating the items (Dale 2015). Instead, developers have been strongly encouraged to craft instruments that reflect the linguistic and cultural contexts that influence the early acquisition of vocabulary and other aspects of language in that particular linguistic and cultural context.

The resulting forms vary widely, including differences in length and intended age range. Some forms include hundreds of items more than the original 680 words on the English Words & Sentences form; others are so-called “short forms” and include only a hundred or a few hundred carefully selected words. Some are designed to capture development from the emergence of language through ages 3–4 years, while others are focused on very early development (like the English Words & Gestures form, designed for ages 8–18 months). These differences also reflect differences in goals for the developers of adaptations – for example, some focus on research assessment while others are designed for clinical screening.

While many words on the English-language checklist may easily translate to other languages, some may not be present in another language, and still others will be present but be less relevant within the same developmental time frame, e.g., cheese in Japanese or snow in Arabic. Conversely, additional words may be needed in the new language that were not included on the English-language vocabulary checklist, e.g., tortilla in Mexican Spanish. In all languages, though, the vocabulary checklists include a range of words that appear earlier and later in normal development, as well as a roughly similar proportion of words from different lexical classes, for example, nouns, verbs, adjectives, and so on. Taken individually then, each adapted instrument captures key trends in vocabulary development when scores are aggregated across all items.

Due to variation in language structure, the interests of the developers of CDI adaptations, and the target age ranges of the forms, the CDI instruments vary in structure across languages. Most adaptations of the WG generally include gestures as well as vocabulary comprehension and production, however, it is not always the case. Further, while adaptations of the WS always include vocabulary production, not all instruments also contain some measures of grammar, for example, early use of closed-class morphology or combinatorial syntax. Cross-linguistic differences also render the structure and format of many parts of the grammar sections to be very different, and hence, not always amenable to comparisons across languages. Finally, a few instruments included in our dataset are pure vocabulary checklists, with no other sections included.

In sum, CDIs are a useful tool for many languages, but the forms differ substantially between languages. When these differences obviously confound our analysis, we present the relevant control or comparison analyses (e.g., as in Chapter 11).

2.2.2 Our approach to comparison

The wide cross-linguistic adoption of the CDI provides an opportunity for cross-linguistic comparison but it also creates many challenges that are not present in datasets that are designed from the start for such comparisons. Differences in instruments and items as well as differences in samples and administration conditions all make it potentially quite problematic to compare scores and score distributions across forms. We discuss differences in instruments and items here and defer discussion of differences in samples and administration to Chapter 3.

Differences in length between CDI forms mean that comparisons of raw scores across instruments are inappropriate. Dividing raw scores by the total number of items on a form results in proportions, which are somewhat more comparable but still potentially misleading. A more comprehensive form with more items on it will yield lower proportions for children with the same vocabulary size. Despite this weakness, we typically use proportions for visualizing differences across forms as it is cumbersome to compare raw scores with different totals. More discussion of absolute and relative vocabulary size differences between instruments can be found in Chapter 5.

Like other psychometric instruments, CDI instruments can also be normed, and many of the most popular forms are. In the standard norming process, the form is administered to a large, typically-developing sample so that percentile ranks can be computed. Following norming, the percentile of a particular raw score for both children in the norming sample and new administrations can be computed and used in place of the proportions or raw scores. These percentile ranks can be useful for clinical purposes, but they also complicate comparison across instruments because of potential differences in the norming populations. In addition, the Wordbank dataset includes normed and un-normed forms, and for the normed forms, we sometimes have access to both the norming dataset and other data but sometimes only have access to the norming data. For these reasons, we do not employ normative percentile ranks in our analyses.

As the preceding discussion shows, there are serious difficulties that crop up immediately in comparisons across instruments. We will grapple with these difficulties throughout the book, but we generally adopt two approaches that help us navigate this complexity.

Our first approach to cross-linguistic and cross-instrument data is to provide standardized analyses within each instrument and language, without assuming equivalence across words, instruments, or populations. Thus, we will typically investigate a particular phenomenon (say the “noun bias” or the “female advantage”) independently and in parallel for each of the instruments available to us.⁶ We can then – still with caution – analyze and compare the magnitude of this phenomenon across languages, having abstracted away from the specifics of each particular instrument. We sometimes colloquially refer to this approach as “every form an island,” meaning that each instrument is analyzed separately and only the analytic results at the highest level are compared.

Our second approach recognizes the necessity to make cross-linguistic and cross-form comparisons at the level of particular words. Cross-linguistic conceptual comparisons are fraught, both philosophically (e.g., Quine 1960) and practically. We refer to the practical issue as the tortilla problem: in American English, we have the word bread, which translates to pan in Mexican Spanish. But the Spanish word tortilla takes some of the cultural role of bread in English; thus, bread has two reasonable translations.⁷ Thus, in order to facilitate (cautious) cross-linguistic comparison, we developed a set of rough-and-ready translation equivalents. We call these “unilemmas” (short for “universal lemmas”). A “lemma” is a canonical form of a word, typically used for gathering frequency counts across different morphological variants (e.g., walk is the lemma for walks, walked, and walking). Unilemmas are used for mapping distinct lexical forms across languages. Unilemmas enable a number of desirable analyses, but more practically, they also provide consistent glosses that make it easier for researchers to work in languages with which they are not familiar. For convenience, our unilemmas are written in English, but they could of course have been written in any other language as well. Further details on the unilemmas are given in Chapter 3.

Even with the care we used here to construct a robust set translation equivalents, individual items are likely to only be roughly equivalent cross-linguistically, and may have significantly different referential scopes for children learning the different languages. That is, if a parent indicates that a child can produce the word dog in English and another parent indicates the translation equivalent in, for example, Spanish (perro), it may nevertheless be the case that these words are produced with different frequencies and in different contexts by children speaking the two languages.

2.3 Wordbank

2.3.1 History

To take advantage of the opportunity posed by the broad use of CDI instruments in the child language community – and in particular the widespread cross-linguistic adaptation of the CDI – in 2014 we began constructing Wordbank, an open repository for CDI data. Our inspiration for Wordbank came from two successful projects for sharing data on children’s language acquisition. The first is the Child Language Data Exchange System (CHILDES; MacWhinney 2000). A database of transcripts of children’s speech and speech to children, CHILDES has grown into a robust and important tool for the community, with many contributors and affiliated projects. The second is the Cross-Linguistic Lexical Norms site (CLEX; Jørgensen et al. 2010), which is closer in content to Wordbank, and effectively our precursor. CLEX archived normative data from a range of CDI adaptations across languages, allowing browsing of acquisition trajectories for individual items or age groups.

Wordbank initially built on CLEX, offering the same functionality but allowing flexible and interactive visualization and analysis, as well as direct database access and data download. In addition, Wordbank’s goal was always to extend beyond normative data by dynamically incorporating data from many different researchers and projects of varying sizes and scopes. While the resulting datasets in Wordbank are much more heterogeneous than if they were just based on norming samples alone, they are also larger and more representative than the individual norming datasets (in some cases), and available in languages where no norms exist (for others).

We began the Wordbank project – our first large-scale, data-aggregation project – with a relatively naive attitude. We thought “if you build it, they will come”: that contributors would flock to the opportunity to share their data with the world. We were unprepared for the challenges of contacting academics around the world, asking them to volunteer their time and hard-won data to an unknown cause, and then understanding the myriad formats and conventions represented in the data we eventually received. For the first couple of years, our data were largely co-extensive with those gathered by CLEX.

Fortunately, in the years since the Wordbank project began, attitudes towards data sharing have been shifting rapidly (in part as a result of work on replication and reproducibility, e.g. Open Science Collaboration 2015). In addition, the credibility of the Wordbank project has gradually grown, in part due to the support of the MacArthur-Bates CDI Advisory Board. And, as we received successively more data, our expertise in dealing with heterogeneous datasets has grown. Thus, the dataset has grown quickly in recent years. We hope that in the future, authors see contribution to Wordbank as an aspirational endpoint for future studies using CDI instruments.

2.3.2 Gaps

From the perspective of the study of child language, there are a number of notable omissions from the datasets represented in Wordbank and the analyses reported in this book. We discuss three of these below: our focus on typical development, monolinguals, and (for the most part) WEIRD populations.

First, our analyses here focus exclusively on typical development. The study of atypical language development is an important part of characterizing the mechanisms of acquisition; further, characterizing language development in these circumstances can have important applied benefits. Studies of language in developmental disorders (Tager-Flusberg et al. 2009; Eigsti et al. 2011), in cases of sensory deficits (Landau, Gleitman, and Landau 2009), and in cases of abnormal input (Curtiss 1977), among others. Many of these studies have made use of dense observations of individual children, however, an approach that is fundamentally different than our large-scale, statistical approach here. While CDI-type instruments are increasingly being used with atypical populations (e.g., Charman et al. 2003; Luyster, Lopez, and Lord 2007), in practice these datasets still tend to be smaller, concentrated in on English-speaking children, and difficult to access publicly. Thus, Wordbank does not currently archive sufficient data from atypical populations to justify inclusion of these analyses in the current book.

Second, the Wordbank dataset focuses on monolingual acquisition. CDI instruments were initially developed to provide normative measurements of variation within a single language. Since then, however, they have increasingly been used for comparison between monolingual and bilingual groups based on the administration of CDIs in both languages (e.g., Pearson, Fernandez, and Oller 1993; Hoff et al. 2012). These studies initially focused on specific bilingual populations (e.g., Spanish/English bilinguals). Recent studies have moved beyond this strategy and have begun to examine general trends across multiple bilingual pairings (e.g., Bilson et al. 2015; Floccia et al. 2018). Questions of bilingual acquisition are fascinating and important from both a theoretical and practical perspective. But, there are practical obstacles to applying our approach to bilingual data that mean that this book does not consider the bilingual acquisition situation. Prqactically speaking, because most of the largest CDI datasets were generated from monolingual norming studies, the vast majority of our data are not bilingual. Further, the combinatorics of bilingualism mean that data on nearly all language pairs will be non-existent. For these reasons, our book focuses on monolingual acquisition, though we recognize this as a limitation that must be addressed by future work.

Finally, the sample of languages we include is limited by our access to data. We have made efforts to include any large CDI datasets whose existence we are aware of – including extensive outreach to CDI authors, especially by professional networking through the CDI Advisory Board. Despite these efforts, our dataset is limited both by the sample of languages in which such studies have been conducted and by international attitudes towards data sharing. Thus, although we do cover many languages around the world (see Chapter 3 for a map), these languages are skewed towards Europe and the United States, as well as towards WEIRD – western, educated, rich, industrialized, and democratic – populations (Henrich, Heine, and Norenzayan 2010).

In sum, while there are inherent limitations in comparing different instruments across languages – limitations that we return to again and again throughout the book – our dataset is the first that allows the exploration of data on variability and consistency in early language within and across such a large and diverse set of languages. As such, the availability of cross-linguistic CDI adaptations remain at the core of the analyses that we offer within Wordbank.

While there have been some efforts to estimate the child’s total vocabulary from CDI scores by Mayor and Plunkett (2011), the resulting estimator is calibrated based on a small handful of diary studies, and cannot easily be extended across forms or languages.↩︎
An exception to this approach is that we do sometimes interpolate words’ trajectories across matched instruments for the same language, e.g. the proportion of children who say the word cat on both Words & Gestures and Words & Sentences forms for American English; see Appendix C.↩︎
These translation issues go the other way as well: reloj translates to two distinct words (clock and watch) in English.↩︎