Chapter 8 Consistency in Early Vocabulary Composition

Which words do children learn first? In spite of tremendous individual variation in rate of development (see Chapter 5; Fenson et al. 1994; Hart and Risley 1995), the first words that children utter are reported to be quite consistent. Based on the examination of diary studies, a number of early examinations of vocabulary composition noted the similarities in children’s first words across languages (e.g.,. Clark 1973; Slobin 1970). This observation formed the basis for a number of theories of usage (including e.g., Clark’s influential semantic feature hypothesis). While we return briefly to the question of why we see similarity across langauges in the contents of early vocabulary, in this chapter we are primarily concerned with establishing the empirical observation with more certainty.

Our approach primarily follows the lead of a more recent systematic examination. Tardif et al. (2008) used CDI data to examine early vocabulary in English, Mandarin, and Cantonese. They found that babies’ first 10 words tended to be about important people in their life (mom, dad), social routines (hi, uh oh), animals (dog, duck), and foods (milk, banana). Here we attempt to generalize this analysis, asking more broadly whether words tend to be learned in the same order across languages.

One challenge is that the precise words that children learn in different languages are (of course) language-specific. Thus, we really want to know whether the concepts that are being talked about are the same – or at least similar. As detailed in the Chapter 2, the items on each language’s form are adaptations and not translations: They are intended to capture the spirit of the items on the English form rather to replicate them exactly. So conceptual mappings are approximate, rather than exact. Nonetheless, when these approximate translation equivalents appear on multiple forms we can look at the variability in how quickly they are acquired across languages. Thus, we assume for simplicity that dog, chien, and perro name (roughly) the same concept.

To estimate the similarity of each item’s trajectory, we use a single measure of its difficulty: age of acquisition (AoA) – the age at which 50% of children in each language are estimated to have acquired it (Appendix D). We analyzed consistency in both comprehension and production, with production ages of acquisition estimated by stitching across both forms. Consequently, we analyzed only the 29 languages for which data for both forms was available.

In total, we estimated ages of acquisition for 945 total words spread across the 29 languages. Unfortunately, not every word appeared on all forms. Figure 8.1 shows the cumulative proportion of forms on which every word appears. For our consistency analysis, we considered only the 335 words that appeared in at least 8 of the 15 languages.

The proportion of words found on at least each of a number of languages' CDI forms (e.g. all words appear on at least one form, 111 words appear on at least 2 forms, and so on, with 34 words appearing on all 15 languages' forms). The dotted line shows the cutoff value we chose (8).

Figure 8.1: The proportion of words found on at least each of a number of languages’ CDI forms (e.g. all words appear on at least one form, 111 words appear on at least 2 forms, and so on, with 34 words appearing on all 15 languages’ forms). The dotted line shows the cutoff value we chose (8).

8.1 The first 10 words

Following Tardif et al. (2008), we begin by examining the first 10 words acquired by children across the 15 languages we measured (Tables 8.1 and 8.2). Similar words appeared in the top 10 across languages, especially in children’s first productions. These words consist primarily of important family members (mommy, daddy, grandma), social routines (hi, bye, peekaboo), and sounds (yum yum, vroom, woof woof).

Table 8.1: The 10 earliest words that children produce in each language.
Croatian Danish English (American) French (French) French (Quebecois) Hebrew Italian Kiswahili Korean Norwegian Russian Slovak Spanish (Mexican) Swedish Turkish
mommy hi mommy daddy mommy mommy mommy mommy mommy vroom meow mommy mommy mommy mommy
daddy woof woof daddy mommy daddy yum yum daddy daddy daddy mommy daddy daddy daddy daddy yum yum
grandma thank you ball baby no grandma woof woof car peekaboo yum yum woof woof woof woof water (beverage) thank you brother
bye mommy bye bye bye vroom grandma cat woof woof hi grandpa grandma yum yum woof woof woof woof
woof woof no hi thank you baby grandpa water (beverage) meow cracker daddy aunt vroom woof woof hi baby
baby bye no bread ball daddy hi motorcycle water (beverage) bye mommy food bread peekaboo vroom
no daddy dog peekaboo vroom banana grandpa baby baby thank you grandma yum yum no drawer bye
yes vroom baby ball sock this meow bug yes woof woof bye bye bye meow water (beverage)
grandpa yes woof woof sock peekaboo bye no banana ball yes cereal dog baby moo ball
aunt food banana shoe moo car shoe baa baa no peekaboo ball car yes no doll
Table 8.2: The 10 earliest words that children comprehend in each language.
Croatian Danish English (American) French (French) French (Quebecois) Hebrew Italian Kiswahili Korean Norwegian Russian Slovak Spanish (Mexican) Swedish Turkish
grandma daddy bottle no milk yum yum mommy baa baa food child’s own name yum yum woof woof yum yum child’s own name water (beverage)
mommy child’s own name daddy mommy mommy cat daddy meow daddy mommy meow yum yum water (beverage) mommy ball
bye mommy mommy daddy daddy doll peekaboo car mommy daddy cat dog milk daddy up
daddy peekaboo child’s own name peekaboo child’s own name balloon child’s own name bug peekaboo bye eye car mommy peekaboo yum yum
peekaboo yum yum bye bye bye ball hi cat no hi daddy ball daddy bye peekaboo
vroom no no bath peekaboo head woof woof doll dirty peekaboo grandma food bye bath mommy
grandpa bye peekaboo hi no light (object) dog ball bath no grandpa pacifier no hi daddy
cat hi hi good night bathtub daddy water (beverage) milk ball yum yum hi mommy woof woof lamp child’s own name
child’s own name woof woof dog yes bath mommy bottle medicine cracker good night woof woof daddy dog grandma bye
woof woof ball ball meow ball grandpa grandma spoon water (beverage) grandma dog grandma cookie no bottle

Unfortunately, we cannot determine if the greater consistency found in early production is a real regularity about children’s lexical development, or is instead a measurement artifact arising from the greater difficulty of reporting on a child’s comprehension (see Chapter 4).

8.2 Global cross-linguistic similarity

Correlation between the average age of acquisition in comprehension and production for each measured word.

Figure 8.2: Correlation between the average age of acquisition in comprehension and production for each measured word.

Despite these differences between comprehension and production, words that are reported to be acquired early in one measure are also generally reported to be acquired early in the other. The Figure 8.2 shows the relationship between the mean age of acquisition in production and the mean age of acquisition of the each of these 335 words across the 15 languages. The correlation between the two measures was quite high: r = 0.8 (p < 0.001). Taken together, these analyses suggest that the rate at which the words on the CDI, and by inference the processes that underpin them, are highly similar across languages.

The source of this similarity is hard to pin down. One possibility is that the difficulty of learning a word is determined predominantly by the complexity of the concept denoted by that word, and thus that variability in linguistic (e.g., phonological and syntactic complexity) and cultural (e.g., styles of parental interaction with children) features play a relatively small role in determining the difficulty of learning a word (Gentner and Boroditsky 2001). Alternatively, the primary driver of difficulty could be linguistic, but the dimensions of linguistic variability could be orthogonal to the difficulty of learning. For instance, verbs may be more difficult than nouns because they are relational, and thus learning nouns makes learning verbs relatively easier than learning verbs makes learning nouns (Gleitman 1990). In this case, the linguistically relevant dimensions would be relatively invariant across languages (Snedeker, Geren, and Shafto 2007). (Finally, because the words on the CDI are not a random sample of words in each language, it could be that generalization from these analyses overestimates the degree of cross-linguistic similarity.)

In Chapter 10 we begin to take up these questions using predictive models. Prior to taking this step, however we consider cross-linguistic ordering more holistically. In the remainder of the chapter, we address this problem from two directions: (1) Does similarity in order of acquisition vary with language-to-language similarity, and (2) Does similarity in order of acquisition change over development?

8.3 Acquisition similarity and linguistic similarity

Unfortunately, the 15 languages in our analyses are both a small and non-representative sample of the world’s languages, and thus do not have sufficient power to detect typological features of language that might be responsible for differences in the similarity of acquisition across languages. Nonetheless, the languages do come from different languages families, and do vary in their phylogenetic distance. We leverage this variability to ask whether the similarity between two languages is related to similarity in how quickly words for the same concepts are learned in those two languages.

Instead of correlating the average similarity of age of acquisition across all languages, we consider the pairwise similarities in the age of acquisition of each of the 335 words in each language. Figure 8.3 shows these pairwise correlations for both comprehension and production as matrices in which each cell shows a single pairwise correlation. These correlation matrices appear to contain a significant amount of structure, with languages that are from the same language family (e.g. Norwegian and Danish) showing higher correlations between the ages of acquisition for the same concepts. Perhaps unsurprisingly from the high average correlation between production and comprehension, pairwise correlations were nearly identical for production and comprehension (r = 0.98, p < 0.001). Figures 8.4 and 8.5 respectively show dendrograms produced by hierarchically clustering these pairwise correlations.

Correlation matrics showing pairwise correlations in words' age of acquisition. Languages that are more similar have more similar acquisition orders.

Figure 8.3: Correlation matrics showing pairwise correlations in words’ age of acquisition. Languages that are more similar have more similar acquisition orders.

A hierarchical clustering of the similarity in the ages of words' first production cross-linguistically.

Figure 8.4: A hierarchical clustering of the similarity in the ages of words’ first production cross-linguistically.

A hierarchical clustering of the similarity in the ages of words' first comprehension cross-linguistically.

Figure 8.5: A hierarchical clustering of the similarity in the ages of words’ first comprehension cross-linguistically.

These dendrograms show high similarity within the North Germanic, Slavic, and Romance language families. Some relationships resist straightforward linguistic explanations (e.g. the relationship of Quebec French to other languages). These may be due to non-uniform sparsity of data across these languages, or may instead reflect interesting cultural or other source of variability. Despite these cases, the structure of ages of acquisition appear to a high degree to reflect the structure of the languages that children learning these words speak. To confirm this observation quantitatively, we borrowed an established measure for measuring linguistic similarity: the lexical similarity of words for the same meaning (Wichmann et al. 2010).

Using a set of 40 words for meanings common to all of the words languages, Holman et al. (2008) were able to use a string-edit distance metric to recover linguistic similarity estimates that correlated highly with geographic distance and also several typological systems. This method is appealing for our purposes as it is relatively agnostic as to the processes of language contact and change that have produced modern-day languages and instead tracks the similarity of word forms themselves. The language distance measures produced by this method were highly correlated with pairwise correlations in acquisition trajectories for both production (r = -0.44, p < 0.001) and comprehension (r = -0.41, p < 0.001).

We also applied this same analysis to the words on the CDI themselves. For each language, we compute the average normalized Levenshtein (1966) distance between words for each of the 335 common words in our analyses.18 This measure was even more highly correlated with pairwise acquisition trajectories than similarity computed using the 40 words identified by Holman et al. (2008), with relatively high correlations for both production (r = -0.57, p < 0.001) and comprehension (r = -0.52, p < 0.001).

Because this analysis likely overestimates the dissimilarity of languages written in different scripts – as every word receives a normalized Levenshtein distance of 1 in this case – we replicated this analysis at the phonemic level. We used eSpeak to compute phonetic transcripts of each word and repeated the same analysis on distance between words’ phonetic units in the International Phonetic Alphabet (IPA; Decker and others 1999). These correlations between IPA distance and pairwise age of acquisition trajectories were again reliable although slightly attenuated for both production (r = -0.29, p = 0.009) and comprehension (r = -0.26, p = 0.020). The robustness of these correlations across a variety of methods suggests that in addition to the high degree of general cross-linguistic similarities in the order of acquisition of words, the dissimilarities between them likely reflect differences in the target languages being learned.

8.4 Consistency across development

In the next analysis, we ask whether similarities in ages of acquisition are constant over the course of acquisition, or whether the similarity across languages changes over development. If variability in acquisition trajectories across languages reflects variability in those languages, we might expect that children’s trajectories diverge over the course of language acquisition as the structure of their target language or their cultural milieu play a stronger role in guiding which words are easy or important to learn. Our analyses of the first 10 words above shows striking similarity in the earliest words. Does this similarity decrease for the next 300 words?

In order to measure change in cross-linguistic consistency over development, we extend the Age of Acquisition-correlation approach we have used throughout this chapter. For each concept that appeared in at least 8 languages, we computed its average age of acquisition across all languages in whose CDIs it appeared in both comprehension and production. We then ordered these words from the earliest learned word on average (mommy to the latest learned word how). For each measure, we then computed the average cross-linguistic correlation in age of acquisition for each increasingly large set of words starting with 5 words to 335 words. If the correlation increases over acquisition, we can infer that acquisition trajectories become more similar as more words are learned, that is, the hardest to learn words are learned more similarly across languages. In contrast, if the correlation decreases, we can infer that children start out learning similar concepts regardless of their native language, but that linguistic and cultural variability plays a greater role in the learning of later words.

Cross-linguistic correlation ages of words' acquisition over the course of language development. Colored lines show empirical correlations, the gray area shows a 95 percent confidence interval for a randomly shuffled baseline. Especially in production, cross-linguistic similarity declines over the course of language development.

Figure 8.6: Cross-linguistic correlation ages of words’ acquisition over the course of language development. Colored lines show empirical correlations, the gray area shows a 95 percent confidence interval for a randomly shuffled baseline. Especially in production, cross-linguistic similarity declines over the course of language development.

Figure 8.6 shows these correlations for both comprehension and production over the course of acquisition. In addition, the gray shaded region shows a 95% confidence interval for a random baseline in which the concepts were ordered randomly, rather than in average acquisition order. This baseline is important to control for changes in measurement error that arise from changing numbers of concepts in the correlation. For both comprehension and production, the trajectories are reliably above the shuffled baseline, and decrease over the course of acquisition. These results suggest that, indeed, there is significantly more similarity in the earliest learned words than in later learned words cross-linguistically, especially in production. This is exactly the pattern of results we would predict if language and culture produce more variability in the forms, frequencies, and contexts of use of later learned words.

8.5 Conclusions

Children in all languages and culture learn language, but the languages they learn vary, and the cultures into which they are born may have quite different cultural practices around both language and cognitive development. Nonetheless, the order in which children learn the word for specific concepts in their own language shows a great degree of cross-linguistic similarity, and dissimilarities are well explained by measurable linguistic dissimilarity. In addition, we found that the degree of cross-linguistic similarity decreases over the course of acquisition. While the first ten words acquired in each language were highly consistent, later words were substantially more different.

As we noted in the introduction, this general observation of similarity in early vocabulary has been taken as evidence for a wide variety of different theoretical claims. Our view is that these results indicate a shared core of concepts – e.g., social routines, important people, and some early foods and household animals – that are perhaps especially communicatively important independent of their linguistic realization.

We acknwoledge, however, that there are likely many reasons for consistency of early words. One intriguing suggestion is that the phonological forms of words used with children actually evolve (or are adapted by parents) to be easier for children to say. One version of this hypothesis comes from Jakobson (1962), who hypothesized that parents adapt the word forms for mother and father to be easy for children to say or even to babble. Thus, the sound convergence across languages in the forms of words for these concepts (which is quite substantial) is due to convergence in what sounds are easy for children to say. This same mechanism could operate over other important early vocabulary as well, though note that this account presupposes some notion of cognitive importance!

Regardless of the precise reason for this phenomenon, the similarity in early vocabulary is undeniable (ratifying suggestions by Clark 1973 and others). As acquisition unfolds, however, the features that make languages (and cultures) different from one another play an ever increasing role in driving vocabulary development. In Chapter 9, we explore demographic differences in acquisition that help to explain why two children learning the same language may acquire different words at different rates.


  1. Levenshtein distance is a measure of the minimum number of insertions, deletions, or substitutions required to transform one string into another. For instance, the distance between the Italian and Norwegian words for dog (cane and hund) is 3. We computed this measure pairwise for all words, and then divided it by the number of characters in the longest word in order to get the edit distance per character (0.75 for cane and hund).