Accessing the Wordbank database

Mika Braginsky

2018-05-05

The wordbankr package allows you to access data in the Wordbank database from R. This vignette shows some examples of how to use the data loading functions and what the resulting data look like.

There are three different data views that you can pull out of Wordbank: by-administration, by-item, and administration-by-item. Additionally, you can get metadata about the sources and instruments underlying the data. Advanced functionality let’s you get estimates of words’ age of acquistion and word mappings across languages.

Administrations

The get_administration_data() function gives by-administration information, either for a specific language and/or form or for all instruments.

get_administration_data(language = "English (American)", form = "WS")
## # A tibble: 5,520 x 15
##    data_id   age comprehension production language       form  birth_order
##      <dbl> <int>         <int>      <int> <chr>          <chr> <fct>      
##  1  129242    27           497        497 English (Amer… WS    Fourth     
##  2  129243    21           369        369 English (Amer… WS    Second     
##  3  129244    26           190        190 English (Amer… WS    Fourth     
##  4  129245    27           264        264 English (Amer… WS    Second     
##  5  129246    19           159        159 English (Amer… WS    Second     
##  6  129247    30           513        513 English (Amer… WS    Second     
##  7  129248    25           444        444 English (Amer… WS    Second     
##  8  129249    24           582        582 English (Amer… WS    Second     
##  9  129250    28           558        558 English (Amer… WS    Second     
## 10  129251    18             7          7 English (Amer… WS    Fourth     
## # ... with 5,510 more rows, and 8 more variables: ethnicity <fct>,
## #   sex <fct>, zygosity <chr>, norming <lgl>, mom_ed <fct>,
## #   longitudinal <lgl>, source_name <chr>, license <chr>
get_administration_data()
## # A tibble: 81,961 x 15
##    data_id   age comprehension production language form  birth_order
##      <dbl> <int>         <int>      <int> <chr>    <chr> <fct>      
##  1   29821    13           293         88 Croatian WG    <NA>       
##  2   29822    16           122         12 Croatian WG    <NA>       
##  3   29823     9             3          0 Croatian WG    <NA>       
##  4   29824    12             0          0 Croatian WG    <NA>       
##  5   29825    12            44          0 Croatian WG    <NA>       
##  6   29826     8            14          5 Croatian WG    <NA>       
##  7   29827     9             2          1 Croatian WG    <NA>       
##  8   29828    10            44          1 Croatian WG    <NA>       
##  9   29829    13           172         51 Croatian WG    <NA>       
## 10   29830    16           241         68 Croatian WG    <NA>       
## # ... with 81,951 more rows, and 8 more variables: ethnicity <fct>,
## #   sex <fct>, zygosity <chr>, norming <lgl>, mom_ed <fct>,
## #   longitudinal <lgl>, source_name <chr>, license <chr>

Items

The get_item_data() function gives by-item information, either for a specific language and/or form or for all instruments.

get_item_data(language = "Italian", form = "WG")
## # A tibble: 505 x 11
##    item_id definition      language form  type   category lexical_category
##    <chr>   <chr>           <chr>    <chr> <chr>  <chr>    <chr>           
##  1 item_1  Risponde quand… Italian  WG    first… <NA>     <NA>            
##  2 item_2  Risponde ad un… Italian  WG    first… <NA>     <NA>            
##  3 item_3  Reagisce ad un… Italian  WG    first… <NA>     <NA>            
##  4 item_4  Vuoi la pappa   Italian  WG    phras… <NA>     <NA>            
##  5 item_5  Hai sonno? Sei… Italian  WG    phras… <NA>     <NA>            
##  6 item_6  Vuoi bere?      Italian  WG    phras… <NA>     <NA>            
##  7 item_7  Stai attento    Italian  WG    phras… <NA>     <NA>            
##  8 item_8  Stai buono      Italian  WG    phras… <NA>     <NA>            
##  9 item_9  Batti le manine Italian  WG    phras… <NA>     <NA>            
## 10 item_10 Cambiamo il pa… Italian  WG    phras… <NA>     <NA>            
## # ... with 495 more rows, and 4 more variables: lexical_class <chr>,
## #   uni_lemma <chr>, complexity_category <chr>, num_item_id <dbl>
get_item_data()
## # A tibble: 31,811 x 11
##    item_id  definition language form  type  category     lexical_category
##    <chr>    <chr>      <chr>    <chr> <chr> <chr>        <chr>           
##  1 item_81  gristi     Croatian WG    word  action_words predicates      
##  2 item_264 puhati     Croatian WG    word  action_words predicates      
##  3 item_269 razbiti    Croatian WG    word  action_words predicates      
##  4 item_64  donijeti   Croatian WG    word  action_words predicates      
##  5 item_153 kupiti     Croatian WG    word  action_words predicates      
##  6 item_36  čistiti    Croatian WG    word  action_words predicates      
##  7 item_384 zatvoriti  Croatian WG    word  action_words predicates      
##  8 item_243 plakati    Croatian WG    word  action_words predicates      
##  9 item_246 plesati    Croatian WG    word  action_words predicates      
## 10 item_42  crtati     Croatian WG    word  action_words predicates      
## # ... with 31,801 more rows, and 4 more variables: lexical_class <chr>,
## #   uni_lemma <chr>, complexity_category <chr>, num_item_id <dbl>

Administrations x Items

If you are only looking at total vocabulary size, admins is all you need, since it has both productive and receptive vocabulary sizes calculated. If you are looking at specific items or subsets of items, you need to load instrument data, using the get_instrument_data() function. Pass it an instrument language and form, along with a list of items you want to extract (by item_id).

get_instrument_data(
  language = "English (American)",
  form = "WS",
  items = c("item_26", "item_46")
)
## # A tibble: 11,692 x 3
##    data_id value    num_item_id
##      <dbl> <chr>          <dbl>
##  1  129242 produces          26
##  2  129243 produces          26
##  3  129244 produces          26
##  4  129245 produces          26
##  5  129246 ""                26
##  6  129247 produces          26
##  7  129248 produces          26
##  8  129249 produces          26
##  9  129250 produces          26
## 10  129251 ""                26
## # ... with 11,682 more rows

By default get_instrument_table() returns a data frame with columns of the administration’s data_id, the item’s num_item_id (numerical item_id), and the corresponding value. To include administration information, you can set the administrations argument to TRUE, or pass the result of get_administration_data() as administrations (that way you can prevent the administration data from being loaded multiple times). Similarly, you can set the iteminfo argument to TRUE, or pass it result of get_item_data().

Loading the data is fast if you need only a handful of items, but the time scales about linearly with the number of items, and can get quite slow if you need many or all of them. So, it’s a good idea to filter down to only the items you need before calling get_instrument_data().

As an example, let’s say we want to look at the production of animal words on English Words & Sentences over age. First we get the items we want:

animals <- get_item_data(language = "English (American)", form = "WS") %>%
  filter(category == "animals")

Then we get the instrument data for those items:

animal_data <- get_instrument_data(language = "English (American)",
                                   form = "WS",
                                   items = animals$item_id,
                                   administrations = TRUE)

Finally, we calculate how many animals words each child produces and the median number of animals of each age bin:

animal_summary <- animal_data %>%
  mutate(produces = value == "produces") %>%
  group_by(age, data_id) %>%
  summarise(num_animals = sum(produces, na.rm = TRUE)) %>%
  group_by(age) %>%
  summarise(median_num_animals = median(num_animals, na.rm = TRUE))
  
ggplot(animal_summary, aes(x = age, y = median_num_animals)) +
  geom_point() +
  labs(x = "Age (months)", y = "Median animal words producing")

Metadata

Instruments

The get_instruments() function gives information on all the CDI instruments in Wordbank.

get_instruments()
## # A tibble: 56 x 7
##    instrument_id language              form  age_min age_max has_grammar
##            <int> <chr>                 <chr>   <int>   <int>       <int>
##  1             1 British Sign Language WG          8      36           0
##  2             2 Cantonese             WS         16      30           0
##  3             3 Croatian              WG          8      16           0
##  4             4 Croatian              WS         16      30           0
##  5             5 Danish                WS         16      36           1
##  6             6 English (American)    WG          8      18           0
##  7             7 English (American)    WS         16      30           1
##  8             8 German                WS         18      30           0
##  9             9 Hebrew                WG         11      25           0
## 10            10 Hebrew                WS         25      36           1
## # ... with 46 more rows, and 1 more variable: unilemma_coverage <dbl>

Sources

The get_sources() function gives information on all the data sources in Wordbank, either for a specific language and/or form or for all instruments. If the admin_data argument is set to TRUE, the results will also include the number of administrations in the database from that source and the minimum and maximum ages of those administrations.

get_sources(form = "WG")
## # A tibble: 29 x 9
##    source_id name   dataset instrument_lang… instrument_form contributor  
##        <int> <chr>  <chr>   <chr>            <fct>           <chr>        
##  1         9 March… Norming English (Americ… Words & Gestur… Larry Fenson…
##  2        10 Byers  ""      English (Americ… Words & Gestur… Krista Byers…
##  3        11 Thal   13      English (Americ… Words & Gestur… Donna Thal, …
##  4        12 Thal   16      English (Americ… Words & Gestur… Donna Thal, …
##  5        14 March… Norming Spanish (Mexica… Words & Gestur… Donna Jackso…
##  6        18 Krist… ""      Norwegian        Words & Gestur… Hanne Simons…
##  7        19 Krist… longit… Norwegian        Words & Gestur… Hanne Simons…
##  8        20 CLEX   ""      Croatian         Words & Gestur… Melita Kovac…
##  9        24 CLEX   ""      Russian          Words & Gestur… Stella Ceytl…
## 10        26 CLEX   ""      Swedish          Words & Gestur… Mårten Eriks…
## # ... with 19 more rows, and 3 more variables: citation <chr>,
## #   longitudinal <lgl>, license <fct>
get_sources(language = "Spanish (Mexican)", admin_data = TRUE) %>%
  select(source_id, name, dataset, instrument_form, n_admins, age_min, age_max)
## # A tibble: 4 x 7
##   source_id name     dataset  instrument_form   n_admins age_min age_max
##       <int> <chr>    <chr>    <fct>                <int>   <int>   <int>
## 1        13 Marchman Norming  Words & Sentences     1094      15      30
## 2        14 Marchman Norming  Words & Gestures       778       8      19
## 3        65 Fernald  Outreach Words & Gestures        55      16      22
## 4        66 Fernald  Outreach Words & Sentences       80      18      38

Advanced functionality: Age of acquisition

The fit_aoa() function computes estimates of items’ age of acquisition (AoA). It needs to be provided with a data frame returned by get_instrument_data() – one row per administration x item combination, and minimally the columns age and num_item_id. It returns a data frame with one row per item and an aoa column with the estimate, preserving and item-level columns in the input data. The AoA is estimated by computing the proportion of administrations for which the child understands/produces (measure) each word, smoothing the proportion using method, and taking the age at which the smoothed value is greater than proportion.

eng_ws_data <- get_instrument_data(language = "English (American)",
                                   form = "WS",
                                   items = c("item_1", "item_42"),
                                   administrations = TRUE,
                                   iteminfo = TRUE)
fit_aoa(eng_ws_data)
## # A tibble: 2 x 10
##   num_item_id   aoa item_id definition type  category lexical_category
##         <dbl> <dbl> <chr>   <chr>      <chr> <chr>    <chr>           
## 1           1    16 item_1  baa baa    word  sounds   other           
## 2          42    24 item_42 owl        word  animals  nouns           
## # ... with 3 more variables: lexical_class <chr>, uni_lemma <chr>,
## #   complexity_category <chr>
fit_aoa(eng_ws_data, measure = "understands", method = "glmrob", proportion = 0.7)
## # A tibble: 2 x 10
##   num_item_id   aoa item_id definition type  category lexical_category
##         <dbl> <dbl> <chr>   <chr>      <chr> <chr>    <chr>           
## 1           1    21 item_1  baa baa    word  sounds   other           
## 2          42    27 item_42 owl        word  animals  nouns           
## # ... with 3 more variables: lexical_class <chr>, uni_lemma <chr>,
## #   complexity_category <chr>

Advanced functionality: Cross-linguistic data

One of the item-level fields is uni_lemma (“universal lemma”), which is intended to be an approximate semantic mapping between words across the languages in Wordbank. The function get_crossling_items() simply gives all the available uni_lemma values.

get_crossling_items()
## # A tibble: 1,380 x 1
##    uni_lemma      
##    <chr>          
##  1 a              
##  2 a little       
##  3 a lot          
##  4 able           
##  5 about          
##  6 above          
##  7 after          
##  8 afternoon      
##  9 again          
## 10 air conditioner
## # ... with 1,370 more rows

The function get_crossling_data() takes a vector of uni_lemmas and returns a data frame of summary statistics for each item mapped to that uni_lemma in any language (on WG forms). Each row is combination of item and age, and the columns indicate the number of children (n_children), means (comprehension, production), standard deviations (comprehension_sd, production_sd), and item-level fields.

get_crossling_data(uni_lemmas = c("hat", "nose")) %>%
  ungroup() %>%
  select(language, uni_lemma, definition, age, n_children, comprehension,
         production, comprehension_sd, production_sd) %>%
  arrange(uni_lemma)
## # A tibble: 381 x 9
##    language uni_lemma definition   age n_children comprehension production
##    <chr>    <chr>     <chr>      <int>      <int>         <dbl>      <dbl>
##  1 British… hat       hat            8          4         0          0    
##  2 British… hat       hat            9          4         0          0    
##  3 British… hat       hat           10          4         0          0    
##  4 British… hat       hat           11          6         0.167      0    
##  5 British… hat       hat           12          6         0          0    
##  6 British… hat       hat           13          6         0          0    
##  7 British… hat       hat           14          7         0.143      0    
##  8 British… hat       hat           15          6         0          0    
##  9 British… hat       hat           16          7         0.143      0.143
## 10 British… hat       hat           17          7         0.286      0.143
## # ... with 371 more rows, and 2 more variables: comprehension_sd <dbl>,
## #   production_sd <dbl>