Skip to contents

The wordbankr package allows you to access data in the Wordbank database from R. This vignette shows some examples of how to use the data loading functions and what the resulting data look like.

There are three different data views that you can pull out of Wordbank: by-administration, by-item, and administration-by-item. Additionally, you can get metadata about the datasets and instruments underlying the data. Advanced functionality let’s you get estimates of words’ age of acquisition and word mappings across languages.

Administrations

The get_administration_data() function gives by-administration information, either for a specific language and/or form or for all instruments.

get_administration_data(language = "English (American)", form = "WS")
## # A tibble: 7,601 × 12
##    data_id date_of…¹   age compr…² produ…³ is_no…⁴ datas…⁵ datas…⁶ langu…⁷ form 
##      <dbl> <chr>     <int>   <int>   <int> <lgl>   <chr>   <chr>   <chr>   <chr>
##  1  245518 1996-11-…    28     497     497 TRUE    Marchm… Marchm… Englis… WS   
##  2  245519 1996-10-…    22     369     369 TRUE    Marchm… Marchm… Englis… WS   
##  3  245520 1996-10-…    26     190     190 TRUE    Marchm… Marchm… Englis… WS   
##  4  245521 1996-11-…    27     264     264 TRUE    Marchm… Marchm… Englis… WS   
##  5  245522 1996-10-…    19     159     159 TRUE    Marchm… Marchm… Englis… WS   
##  6  245523 1996-10-…    30     513     513 TRUE    Marchm… Marchm… Englis… WS   
##  7  245524 1996-10-…    25     444     444 TRUE    Marchm… Marchm… Englis… WS   
##  8  245525 1996-11-…    24     582     582 TRUE    Marchm… Marchm… Englis… WS   
##  9  245526 1996-10-…    28     558     558 TRUE    Marchm… Marchm… Englis… WS   
## 10  245527 1991-10-…    18       7       7 TRUE    Marchm… Marchm… Englis… WS   
## # … with 7,591 more rows, 2 more variables: form_type <chr>, child_id <int>,
## #   and abbreviated variable names ¹​date_of_test, ²​comprehension, ³​production,
## #   ⁴​is_norming, ⁵​dataset_name, ⁶​dataset_origin_name, ⁷​language
## # A tibble: 90,897 × 12
##    data_id date_of…¹   age compr…² produ…³ is_no…⁴ datas…⁵ datas…⁶ langu…⁷ form 
##      <dbl> <chr>     <int>   <int>   <int> <lgl>   <chr>   <chr>   <chr>   <chr>
##  1   26372 NA           13     293      88 TRUE    CLEX    CLEX__… Croati… WG   
##  2   26373 NA           16     122      12 TRUE    CLEX    CLEX__… Croati… WG   
##  3   26374 NA            9       3       0 TRUE    CLEX    CLEX__… Croati… WG   
##  4   26375 NA           12       0       0 TRUE    CLEX    CLEX__… Croati… WG   
##  5   26376 NA           12      44       0 TRUE    CLEX    CLEX__… Croati… WG   
##  6   26377 NA            8      14       5 TRUE    CLEX    CLEX__… Croati… WG   
##  7   26378 NA            9       2       1 TRUE    CLEX    CLEX__… Croati… WG   
##  8   26379 NA           10      44       1 TRUE    CLEX    CLEX__… Croati… WG   
##  9   26380 NA           13     172      51 TRUE    CLEX    CLEX__… Croati… WG   
## 10   26381 NA           16     241      68 TRUE    CLEX    CLEX__… Croati… WG   
## # … with 90,887 more rows, 2 more variables: form_type <chr>, child_id <int>,
## #   and abbreviated variable names ¹​date_of_test, ²​comprehension, ³​production,
## #   ⁴​is_norming, ⁵​dataset_name, ⁶​dataset_origin_name, ⁷​language

Items

The get_item_data() function gives by-item information, either for a specific language and/or form or for all instruments.

get_item_data(language = "Italian", form = "WG")
## # A tibble: 505 × 11
##    item_id langu…¹ form  form_…² item_…³ categ…⁴ item_…⁵ engli…⁶ uni_l…⁷ lexic…⁸
##    <chr>   <chr>   <chr> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
##  1 item_1  Italian WG    WG      first_… NA      Rispon… Respon… NA      NA     
##  2 item_2  Italian WG    WG      first_… NA      Rispon… Respon… NA      NA     
##  3 item_3  Italian WG    WG      first_… NA      Reagis… Look f… NA      NA     
##  4 item_4  Italian WG    WG      phrases NA      Vuoi l… Are yo… NA      NA     
##  5 item_5  Italian WG    WG      phrases NA      Hai so… Are yo… NA      NA     
##  6 item_6  Italian WG    WG      phrases NA      Vuoi b… Do you… NA      NA     
##  7 item_7  Italian WG    WG      phrases NA      Stai a… Be car… NA      NA     
##  8 item_8  Italian WG    WG      phrases NA      Stai b… NA      NA      NA     
##  9 item_9  Italian WG    WG      phrases NA      Batti … Clap y… NA      NA     
## 10 item_10 Italian WG    WG      phrases NA      Cambia… Change… NA      NA     
## # … with 495 more rows, 1 more variable: complexity_category <chr>, and
## #   abbreviated variable names ¹​language, ²​form_type, ³​item_kind, ⁴​category,
## #   ⁵​item_definition, ⁶​english_gloss, ⁷​uni_lemma, ⁸​lexical_category
## # A tibble: 40,959 × 11
##    item_id langu…¹ form  form_…² item_…³ categ…⁴ item_…⁵ engli…⁶ uni_l…⁷ lexic…⁸
##    <chr>   <chr>   <chr> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
##  1 item_1  Britis… WG    WG      phrases NA      be car… be car… NA      NA     
##  2 item_2  Britis… WG    WG      phrases NA      bring … bring … NA      NA     
##  3 item_3  Britis… WG    WG      phrases NA      change… change… NA      NA     
##  4 item_4  Britis… WG    WG      phrases NA      come h… come h… NA      NA     
##  5 item_5  Britis… WG    WG      phrases NA      daddy/… daddy/… NA      NA     
##  6 item_6  Britis… WG    WG      phrases NA      dontto… dontto… NA      NA     
##  7 item_7  Britis… WG    WG      phrases NA      finish  finish  NA      NA     
##  8 item_8  Britis… WG    WG      phrases NA      get up  get up  NA      NA     
##  9 item_9  Britis… WG    WG      phrases NA      give m… give m… NA      NA     
## 10 item_10 Britis… WG    WG      phrases NA      give m… give m… NA      NA     
## # … with 40,949 more rows, 1 more variable: complexity_category <chr>, and
## #   abbreviated variable names ¹​language, ²​form_type, ³​item_kind, ⁴​category,
## #   ⁵​item_definition, ⁶​english_gloss, ⁷​uni_lemma, ⁸​lexical_category

Administrations x Items

If you are only looking at total vocabulary size, admins is all you need, since it has both productive and receptive vocabulary sizes calculated. If you are looking at specific items or subsets of items, you need to load instrument data, using the get_instrument_data() function. Pass it an instrument language and form, along with a list of items you want to extract (by item_id).

get_instrument_data(
  language = "English (American)",
  form = "WS",
  items = c("item_26", "item_46")
)
## # A tibble: 17,098 × 5
##    data_id item_id value      produces understands
##      <dbl> <chr>   <chr>      <lgl>    <lgl>      
##  1  245518 item_26 "produces" TRUE     NA         
##  2  245519 item_26 "produces" TRUE     NA         
##  3  245520 item_26 "produces" TRUE     NA         
##  4  245521 item_26 "produces" TRUE     NA         
##  5  245522 item_26 ""         FALSE    NA         
##  6  245523 item_26 "produces" TRUE     NA         
##  7  245524 item_26 "produces" TRUE     NA         
##  8  245525 item_26 "produces" TRUE     NA         
##  9  245526 item_26 "produces" TRUE     NA         
## 10  245527 item_26 ""         FALSE    NA         
## # … with 17,088 more rows

By default get_instrument_table() returns a data frame with columns of the administration’s data_id, the item’s num_item_id (numerical item_id), and the corresponding value. To include administration information, you can set the administrations argument to TRUE, or pass the result of get_administration_data() as administrations (that way you can prevent the administration data from being loaded multiple times). Similarly, you can set the iteminfo argument to TRUE, or pass it result of get_item_data().

Loading the data is fast if you need only a handful of items, but the time scales about linearly with the number of items, and can get quite slow if you need many or all of them. So, it’s a good idea to filter down to only the items you need before calling get_instrument_data().

As an example, let’s say we want to look at the production of animal words on English Words & Sentences over age. First we get the items we want:

animals <- get_item_data(language = "English (American)", form = "WS") %>%
  filter(category == "animals")

Then we get the instrument data for those items:

animal_data <- get_instrument_data(language = "English (American)",
                                   form = "WS",
                                   items = animals$item_id,
                                   administration_info = TRUE,
                                   item_info = TRUE)

Finally, we calculate how many animals words each child produces and the median number of animals of each age bin:

animal_summary <- animal_data %>%
  group_by(age, data_id) %>%
  summarise(num_animals = sum(produces, na.rm = TRUE)) %>%
  group_by(age) %>%
  summarise(median_num_animals = median(num_animals, na.rm = TRUE))
  
ggplot(animal_summary, aes(x = age, y = median_num_animals)) +
  geom_point() +
  labs(x = "Age (months)", y = "Median animal words producing")

Metadata

Instruments

The get_instruments() function gives information on all the CDI instruments in Wordbank.

## # A tibble: 78 × 8
##    instrument_id language          form  form_…¹ age_min age_max has_g…² unile…³
##            <int> <chr>             <chr> <chr>     <int>   <int>   <int>   <dbl>
##  1             1 British Sign Lan… WG    WG            8      36       0       1
##  2             2 Cantonese         WS    WS           16      30       0       1
##  3             3 Croatian          WG    WG            8      16       0       1
##  4             4 Croatian          WS    WS           16      30       0       1
##  5             5 Danish            WG    WG            8      20       0       1
##  6             6 Danish            WS    WS           16      36       1       1
##  7             7 English (America… WG    WG            8      18       0       1
##  8             8 English (America… WS    WS           16      30       1       1
##  9             9 French (Quebecoi… WG    WG            8      16       0       1
## 10            10 French (Quebecoi… WS    WS           16      30       1       1
## # … with 68 more rows, and abbreviated variable names ¹​form_type, ²​has_grammar,
## #   ³​unilemma_coverage

Datasets

The get_datasets() function gives information on all the datasets in Wordbank, either for a specific language and/or form or for all instruments. If the admin_data argument is set to TRUE, the results will also include the number of administrations in the database from that dataset.

get_datasets(form = "WG")
## # A tibble: 37 × 10
##    datas…¹ datas…² datas…³ contr…⁴ citat…⁵ license longi…⁶ langu…⁷ form  form_…⁸
##      <int> <chr>   <chr>   <chr>   <chr>   <chr>   <lgl>   <chr>   <chr> <chr>  
##  1       5 Marchm… Marchm… Larry … "Fenso… CC-BY   FALSE   Englis… WG    WG     
##  2       6 Byers   Byers_… Krista… ""      CC-BY   FALSE   Englis… WG    WG     
##  3       7 Thal    Thal    Donna … "Thal,… CC-BY   TRUE    Englis… WG    WG     
##  4       9 Marchm… Marchm… Donna … "Jacks… CC-BY   FALSE   Spanis… WG    WG     
##  5      12 Kristo… Kristo… Hanne … "Simon… CC-BY   TRUE    Norweg… WG    WG     
##  6      13 CLEX    CLEX__… Melita… "Kovac… CC-BY   FALSE   Croati… WG    WG     
##  7      17 CLEX    CLEX__… Stella… "Е.А.В… CC-BY   FALSE   Russian WG    WG     
##  8      19 CLEX    CLEX__… Mårten… "Eriks… CC-BY   FALSE   Swedish WG    WG     
##  9      21 CLEX    CLEX__… Aylin … "Acarl… CC-BY   FALSE   Turkish WG    WG     
## 10      23 Shalev  Shalev… Hila G… "Gendl… CC-BY   FALSE   Hebrew  WG    WG     
## # … with 27 more rows, and abbreviated variable names ¹​dataset_id,
## #   ²​dataset_name, ³​dataset_origin_name, ⁴​contributor, ⁵​citation,
## #   ⁶​longitudinal, ⁷​language, ⁸​form_type
get_datasets(language = "Spanish (Mexican)", admin_data = TRUE)
## # A tibble: 6 × 11
##   datase…¹ datas…² datas…³ contr…⁴ citat…⁵ license longi…⁶ langu…⁷ form  form_…⁸
##      <int> <chr>   <chr>   <chr>   <chr>   <chr>   <lgl>   <chr>   <chr> <chr>  
## 1        8 Marchm… Marchm… Donna … Marchm… CC-BY   FALSE   Spanis… WS    WS     
## 2        9 Marchm… Marchm… Donna … Jackso… CC-BY   FALSE   Spanis… WG    WG     
## 3       55 Fernald Fernal… Anne F… Weisle… CC-BY   TRUE    Spanis… WG    WG     
## 4       56 Fernald Fernal… Anne F… Weisle… CC-BY   TRUE    Spanis… WS    WS     
## 5       76 Marchm… Marchm… Donna … Jackso… CC-BY   FALSE   Spanis… WS    WS     
## 6       87 Hoff    Hoff_E… Hoff, E Hoff, … CC-BY   TRUE    Spanis… WS    WS     
## # … with 1 more variable: n_admins <dbl>, and abbreviated variable names
## #   ¹​dataset_id, ²​dataset_name, ³​dataset_origin_name, ⁴​contributor, ⁵​citation,
## #   ⁶​longitudinal, ⁷​language, ⁸​form_type

Advanced functionality: Age of acquisition

The fit_aoa() function computes estimates of items’ age of acquisition (AoA). It needs to be provided with a data frame returned by get_instrument_data() – one row per administration x item combination, and minimally the columns age and num_item_id. It returns a data frame with one row per item and an aoa column with the estimate, preserving and item-level columns in the input data. The AoA is estimated by computing the proportion of administrations for which the child understands/produces (measure) each word, smoothing the proportion using method, and taking the age at which the smoothed value is greater than proportion.

fit_aoa(animal_data)
## # A tibble: 43 × 8
##      aoa item_id item_kind item_definition category lexical_ca…¹ uni_l…² compl…³
##    <dbl> <chr>   <chr>     <chr>           <chr>    <chr>        <chr>   <chr>  
##  1    26 item_13 word      alligator       animals  nouns        alliga… NA     
##  2    25 item_14 word      animal          animals  nouns        animal  NA     
##  3    26 item_15 word      ant             animals  nouns        ant     NA     
##  4    20 item_16 word      bear            animals  nouns        bear    NA     
##  5    22 item_17 word      bee             animals  nouns        bee     NA     
##  6    18 item_18 word      bird            animals  nouns        bird    NA     
##  7    23 item_19 word      bug             animals  nouns        bug     NA     
##  8    22 item_20 word      bunny           animals  nouns        bunny   NA     
##  9    24 item_21 word      butterfly       animals  nouns        butter… NA     
## 10    19 item_22 word      cat             animals  nouns        cat     NA     
## # … with 33 more rows, and abbreviated variable names ¹​lexical_category,
## #   ²​uni_lemma, ³​complexity_category
fit_aoa(animal_data, method = "glmrob", proportion = 1/3)
## # A tibble: 43 × 8
##      aoa item_id item_kind item_definition category lexical_ca…¹ uni_l…² compl…³
##    <dbl> <chr>   <chr>     <chr>           <chr>    <chr>        <chr>   <chr>  
##  1    24 item_13 word      alligator       animals  nouns        alliga… NA     
##  2    23 item_14 word      animal          animals  nouns        animal  NA     
##  3    24 item_15 word      ant             animals  nouns        ant     NA     
##  4    18 item_16 word      bear            animals  nouns        bear    NA     
##  5    19 item_17 word      bee             animals  nouns        bee     NA     
##  6    NA item_18 word      bird            animals  nouns        bird    NA     
##  7    20 item_19 word      bug             animals  nouns        bug     NA     
##  8    19 item_20 word      bunny           animals  nouns        bunny   NA     
##  9    22 item_21 word      butterfly       animals  nouns        butter… NA     
## 10    NA item_22 word      cat             animals  nouns        cat     NA     
## # … with 33 more rows, and abbreviated variable names ¹​lexical_category,
## #   ²​uni_lemma, ³​complexity_category

Advanced functionality: Cross-linguistic data

One of the item-level fields is uni_lemma (“universal lemma”), which is intended to be an approximate semantic mapping between words across the languages in Wordbank. The function get_crossling_items() simply gives all the available uni_lemma values.

## # A tibble: 2,552 × 2
##       id uni_lemma                
##    <int> <chr>                    
##  1  1739 (hair)brush              
##  2  1552 (play)pen                
##  3  1494 (sheep)                  
##  4  1783 (to be in) pain          
##  5  1777 (to be) hungry           
##  6  1775 (to be) thirsty          
##  7  1769 (to have) breakfast      
##  8  1272 [possessive]             
##  9  1593 [to splash in the water?]
## 10  1951 1PL                      
## # … with 2,542 more rows

The function get_crossling_data() takes a vector of uni_lemmas and returns a data frame of summary statistics for each item mapped to that uni_lemma in any language (on WG forms). Each row is combination of item and age, and the columns indicate the number of children (n_children), means (comprehension, production), standard deviations (comprehension_sd, production_sd), and item-level fields.

get_crossling_data(uni_lemmas = c("hat", "nose")) %>%
  select(language, uni_lemma, item_definition, age, n_children, comprehension,
         production, comprehension_sd, production_sd) %>%
  arrange(uni_lemma)