The wordbankr package allows you to access data in the Wordbank database from R. This vignette shows some examples of how to use the data loading functions and what the resulting data look like.
There are three different data views that you can pull out of Wordbank: by-administration, by-item, and administration-by-item. Additionally, you can get metadata about the sources and instruments underlying the data. Advanced functionality let’s you get estimates of words’ age of acquistion and word mappings across languages.
The get_administration_data() function gives by-administration information, either for a specific language and/or form or for all instruments.
## # A tibble: 5,520 x 15
## data_id age comprehension production language form birth_order
## <dbl> <int> <int> <int> <chr> <chr> <fct>
## 1 129242 27 497 497 English… WS Fourth
## 2 129243 21 369 369 English… WS Second
## 3 129244 26 190 190 English… WS Fourth
## 4 129245 27 264 264 English… WS Second
## 5 129246 19 159 159 English… WS Second
## 6 129247 30 513 513 English… WS Second
## 7 129248 25 444 444 English… WS Second
## 8 129249 24 582 582 English… WS Second
## 9 129250 28 558 558 English… WS Second
## 10 129251 18 7 7 English… WS Fourth
## # … with 5,510 more rows, and 8 more variables: ethnicity <fct>,
## # sex <fct>, zygosity <chr>, norming <lgl>, mom_ed <fct>,
## # longitudinal <lgl>, source_name <chr>, license <chr>
## # A tibble: 82,055 x 15
## data_id age comprehension production language form birth_order
## <dbl> <int> <int> <int> <chr> <chr> <fct>
## 1 29821 13 293 88 Croatian WG <NA>
## 2 29822 16 122 12 Croatian WG <NA>
## 3 29823 9 3 0 Croatian WG <NA>
## 4 29824 12 0 0 Croatian WG <NA>
## 5 29825 12 44 0 Croatian WG <NA>
## 6 29826 8 14 5 Croatian WG <NA>
## 7 29827 9 2 1 Croatian WG <NA>
## 8 29828 10 44 1 Croatian WG <NA>
## 9 29829 13 172 51 Croatian WG <NA>
## 10 29830 16 241 68 Croatian WG <NA>
## # … with 82,045 more rows, and 8 more variables: ethnicity <fct>,
## # sex <fct>, zygosity <chr>, norming <lgl>, mom_ed <fct>,
## # longitudinal <lgl>, source_name <chr>, license <chr>
The get_item_data() function gives by-item information, either for a specific language and/or form or for all instruments.
## # A tibble: 505 x 11
## item_id definition language form type category lexical_category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 item_1 Risponde … Italian WG firs… <NA> <NA>
## 2 item_2 Risponde … Italian WG firs… <NA> <NA>
## 3 item_3 Reagisce … Italian WG firs… <NA> <NA>
## 4 item_4 Vuoi la p… Italian WG phra… <NA> <NA>
## 5 item_5 Hai sonno… Italian WG phra… <NA> <NA>
## 6 item_6 Vuoi bere? Italian WG phra… <NA> <NA>
## 7 item_7 Stai atte… Italian WG phra… <NA> <NA>
## 8 item_8 Stai buono Italian WG phra… <NA> <NA>
## 9 item_9 Batti le … Italian WG phra… <NA> <NA>
## 10 item_10 Cambiamo … Italian WG phra… <NA> <NA>
## # … with 495 more rows, and 4 more variables: lexical_class <chr>,
## # uni_lemma <chr>, complexity_category <chr>, num_item_id <dbl>
## # A tibble: 31,811 x 11
## item_id definition language form type category lexical_category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 item_81 gristi Croatian WG word action_… predicates
## 2 item_2… puhati Croatian WG word action_… predicates
## 3 item_2… razbiti Croatian WG word action_… predicates
## 4 item_64 donijeti Croatian WG word action_… predicates
## 5 item_1… kupiti Croatian WG word action_… predicates
## 6 item_36 čistiti Croatian WG word action_… predicates
## 7 item_3… zatvoriti Croatian WG word action_… predicates
## 8 item_2… plakati Croatian WG word action_… predicates
## 9 item_2… plesati Croatian WG word action_… predicates
## 10 item_42 crtati Croatian WG word action_… predicates
## # … with 31,801 more rows, and 4 more variables: lexical_class <chr>,
## # uni_lemma <chr>, complexity_category <chr>, num_item_id <dbl>
If you are only looking at total vocabulary size, admins is all you need, since it has both productive and receptive vocabulary sizes calculated. If you are looking at specific items or subsets of items, you need to load instrument data, using the get_instrument_data() function. Pass it an instrument language and form, along with a list of items you want to extract (by item_id).
get_instrument_data(
language = "English (American)",
form = "WS",
items = c("item_26", "item_46")
)## # A tibble: 11,692 x 3
## data_id value num_item_id
## <dbl> <chr> <dbl>
## 1 129242 produces 26
## 2 129243 produces 26
## 3 129244 produces 26
## 4 129245 produces 26
## 5 129246 "" 26
## 6 129247 produces 26
## 7 129248 produces 26
## 8 129249 produces 26
## 9 129250 produces 26
## 10 129251 "" 26
## # … with 11,682 more rows
By default get_instrument_table() returns a data frame with columns of the administration’s data_id, the item’s num_item_id (numerical item_id), and the corresponding value. To include administration information, you can set the administrations argument to TRUE, or pass the result of get_administration_data() as administrations (that way you can prevent the administration data from being loaded multiple times). Similarly, you can set the iteminfo argument to TRUE, or pass it result of get_item_data().
Loading the data is fast if you need only a handful of items, but the time scales about linearly with the number of items, and can get quite slow if you need many or all of them. So, it’s a good idea to filter down to only the items you need before calling get_instrument_data().
As an example, let’s say we want to look at the production of animal words on English Words & Sentences over age. First we get the items we want:
animals <- get_item_data(language = "English (American)", form = "WS") %>%
filter(category == "animals")Then we get the instrument data for those items:
animal_data <- get_instrument_data(language = "English (American)",
form = "WS",
items = animals$item_id,
administrations = TRUE)Finally, we calculate how many animals words each child produces and the median number of animals of each age bin:
animal_summary <- animal_data %>%
mutate(produces = value == "produces") %>%
group_by(age, data_id) %>%
summarise(num_animals = sum(produces, na.rm = TRUE)) %>%
group_by(age) %>%
summarise(median_num_animals = median(num_animals, na.rm = TRUE))
ggplot(animal_summary, aes(x = age, y = median_num_animals)) +
geom_point() +
labs(x = "Age (months)", y = "Median animal words producing")The get_instruments() function gives information on all the CDI instruments in Wordbank.
## # A tibble: 56 x 7
## instrument_id language form age_min age_max has_grammar
## <int> <chr> <chr> <int> <int> <int>
## 1 1 British… WG 8 36 0
## 2 2 Cantone… WS 16 30 0
## 3 3 Croatian WG 8 16 0
## 4 4 Croatian WS 16 30 0
## 5 5 Danish WS 16 36 1
## 6 6 English… WG 8 18 0
## 7 7 English… WS 16 30 1
## 8 8 German WS 18 30 0
## 9 9 Hebrew WG 11 25 0
## 10 10 Hebrew WS 25 36 1
## # … with 46 more rows, and 1 more variable: unilemma_coverage <dbl>
The get_sources() function gives information on all the data sources in Wordbank, either for a specific language and/or form or for all instruments. If the admin_data argument is set to TRUE, the results will also include the number of administrations in the database from that source and the minimum and maximum ages of those administrations.
## # A tibble: 29 x 9
## source_id name dataset instrument_lang… instrument_form contributor
## <int> <chr> <chr> <chr> <fct> <chr>
## 1 9 Marc… Norming English (Americ… Words & Gestur… Larry Fens…
## 2 10 Byers "" English (Americ… Words & Gestur… Krista Bye…
## 3 11 Thal 13 English (Americ… Words & Gestur… Donna Thal…
## 4 12 Thal 16 English (Americ… Words & Gestur… Donna Thal…
## 5 14 Marc… Norming Spanish (Mexica… Words & Gestur… Donna Jack…
## 6 18 Kris… "" Norwegian Words & Gestur… Hanne Simo…
## 7 19 Kris… longit… Norwegian Words & Gestur… Hanne Simo…
## 8 20 CLEX "" Croatian Words & Gestur… Melita Kov…
## 9 24 CLEX "" Russian Words & Gestur… Stella Cey…
## 10 26 CLEX "" Swedish Words & Gestur… Mårten Eri…
## # … with 19 more rows, and 3 more variables: citation <chr>,
## # longitudinal <lgl>, license <fct>
get_sources(language = "Spanish (Mexican)", admin_data = TRUE) %>%
select(source_id, name, dataset, instrument_form, n_admins, age_min, age_max)## # A tibble: 4 x 7
## source_id name dataset instrument_form n_admins age_min age_max
## <int> <chr> <chr> <fct> <int> <int> <int>
## 1 13 Marchman Norming Words & Sentences 1094 15 30
## 2 14 Marchman Norming Words & Gestures 778 8 19
## 3 65 Fernald Outreach Words & Gestures 55 16 22
## 4 66 Fernald Outreach Words & Sentences 80 18 38
The fit_aoa() function computes estimates of items’ age of acquisition (AoA). It needs to be provided with a data frame returned by get_instrument_data() – one row per administration x item combination, and minimally the columns age and num_item_id. It returns a data frame with one row per item and an aoa column with the estimate, preserving and item-level columns in the input data. The AoA is estimated by computing the proportion of administrations for which the child understands/produces (measure) each word, smoothing the proportion using method, and taking the age at which the smoothed value is greater than proportion.
eng_ws_data <- get_instrument_data(language = "English (American)",
form = "WS",
items = c("item_1", "item_42"),
administrations = TRUE,
iteminfo = TRUE)
fit_aoa(eng_ws_data)## # A tibble: 2 x 10
## # Groups: num_item_id [2]
## num_item_id aoa item_id definition type category lexical_category
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 NA item_1 baa baa word sounds other
## 2 42 24 item_42 owl word animals nouns
## # … with 3 more variables: lexical_class <chr>, uni_lemma <chr>,
## # complexity_category <chr>
## # A tibble: 2 x 10
## # Groups: num_item_id [2]
## num_item_id aoa item_id definition type category lexical_category
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 21 item_1 baa baa word sounds other
## 2 42 27 item_42 owl word animals nouns
## # … with 3 more variables: lexical_class <chr>, uni_lemma <chr>,
## # complexity_category <chr>
One of the item-level fields is uni_lemma (“universal lemma”), which is intended to be an approximate semantic mapping between words across the languages in Wordbank. The function get_crossling_items() simply gives all the available uni_lemma values.
## # A tibble: 1,380 x 1
## uni_lemma
## <chr>
## 1 a
## 2 a little
## 3 a lot
## 4 able
## 5 about
## 6 above
## 7 after
## 8 afternoon
## 9 again
## 10 air conditioner
## # … with 1,370 more rows
The function get_crossling_data() takes a vector of uni_lemmas and returns a data frame of summary statistics for each item mapped to that uni_lemma in any language (on WG forms). Each row is combination of item and age, and the columns indicate the number of children (n_children), means (comprehension, production), standard deviations (comprehension_sd, production_sd), and item-level fields.
get_crossling_data(uni_lemmas = c("hat", "nose")) %>%
ungroup() %>%
select(language, uni_lemma, definition, age, n_children, comprehension,
production, comprehension_sd, production_sd) %>%
arrange(uni_lemma)## # A tibble: 381 x 9
## language uni_lemma definition age n_children comprehension production
## <chr> <chr> <chr> <int> <int> <dbl> <dbl>
## 1 British… hat hat 8 4 0 0
## 2 British… hat hat 9 4 0 0
## 3 British… hat hat 10 4 0 0
## 4 British… hat hat 11 6 0.167 0
## 5 British… hat hat 12 6 0 0
## 6 British… hat hat 13 6 0 0
## 7 British… hat hat 14 7 0.143 0
## 8 British… hat hat 15 6 0 0
## 9 British… hat hat 16 7 0.143 0.143
## 10 British… hat hat 17 7 0.286 0.143
## # … with 371 more rows, and 2 more variables: comprehension_sd <dbl>,
## # production_sd <dbl>