Accessing the Wordbank database

2017-12-21

The wordbankr package allows you to access data in the Wordbank database from R. This vignette shows some examples of how to use the data loading functions and what the resulting data look like.

There are three different data views that you can pull out of Wordbank: by-administration, by-item, and administration-by-item.

The get_administration_data function gives by-administration information, for either a specific language and form or for all instruments:

english_ws_admins <- get_administration_data("English (American)", "WS")
head(english_ws_admins)
## # A tibble: 6 x 15
##   data_id   age comprehension production           language  form
##     <dbl> <int>         <int>      <int>              <chr> <chr>
## 1  129242    27           497        497 English (American)    WS
## 2  129243    21           369        369 English (American)    WS
## 3  129244    26           190        190 English (American)    WS
## 4  129245    27           264        264 English (American)    WS
## 5  129246    19           159        159 English (American)    WS
## 6  129247    30           513        513 English (American)    WS
## # ... with 9 more variables: birth_order <fctr>, ethnicity <fctr>,
## #   sex <fctr>, zygosity <chr>, norming <lgl>, mom_ed <fctr>,
## #   longitudinal <lgl>, source_name <chr>, license <chr>
all_admins <- get_administration_data()
head(all_admins)
## # A tibble: 6 x 15
##   data_id   age comprehension production language  form birth_order
##     <dbl> <int>         <int>      <int>    <chr> <chr>      <fctr>
## 1   29821    13           293         88 Croatian    WG        <NA>
## 2   29822    16           122         12 Croatian    WG        <NA>
## 3   29823     9             3          0 Croatian    WG        <NA>
## 4   29824    12             0          0 Croatian    WG        <NA>
## 5   29825    12            44          0 Croatian    WG        <NA>
## 6   29826     8            14          5 Croatian    WG        <NA>
## # ... with 8 more variables: ethnicity <fctr>, sex <fctr>, zygosity <chr>,
## #   norming <lgl>, mom_ed <fctr>, longitudinal <lgl>, source_name <chr>,
## #   license <chr>

The get_item_data function gives by-item information, for either a specific language and form or for all instruments:

spanish_wg_items <- get_item_data("Spanish", "WG")
head(spanish_wg_items)
## # A tibble: 6 x 11
##   item_id definition language  form  type category lexical_category
##     <chr>      <chr>    <chr> <chr> <chr>    <chr>            <chr>
## 1  item_1       ¡am!  Spanish    WG  word   sounds            other
## 2  item_2       ¡ay!  Spanish    WG  word   sounds            other
## 3  item_3    bee/mee  Spanish    WG  word   sounds            other
## 4  item_4     cuacuá  Spanish    WG  word   sounds            other
## 5  item_5     guaguá  Spanish    WG  word   sounds            other
## 6  item_6       miau  Spanish    WG  word   sounds            other
## # ... with 4 more variables: lexical_class <chr>, uni_lemma <chr>,
## #   complexity_category <chr>, num_item_id <dbl>
all_items <- get_item_data()
head(all_items)
## # A tibble: 6 x 11
##    item_id definition language  form  type     category lexical_category
##      <chr>      <chr>    <chr> <chr> <chr>        <chr>            <chr>
## 1  item_81     gristi Croatian    WG  word action_words       predicates
## 2 item_264     puhati Croatian    WG  word action_words       predicates
## 3 item_269    razbiti Croatian    WG  word action_words       predicates
## 4  item_64   donijeti Croatian    WG  word action_words       predicates
## 5 item_153     kupiti Croatian    WG  word action_words       predicates
## 6  item_36    čistiti Croatian    WG  word action_words       predicates
## # ... with 4 more variables: lexical_class <chr>, uni_lemma <chr>,
## #   complexity_category <chr>, num_item_id <dbl>

If you are only looking at total vocabulary size, admins is all you need, since it has both productive and receptive vocabulary sizes calculated. If you are looking at specific items or subsets of items, you need to load instrument data, using the get_instrument_data function. Pass it an instrument language and form, along with a list of items you want to extract (by item_id).

eng_ws_canines <- get_instrument_data(
  language = "English (American)",
  form = "WS",
  items = c("item_26", "item_46")
)
head(eng_ws_canines)
## # A tibble: 6 x 3
##   data_id    value num_item_id
##     <dbl>    <chr>       <dbl>
## 1  129242 produces          26
## 2  129243 produces          26
## 3  129244 produces          26
## 4  129245 produces          26
## 5  129246                   26
## 6  129247 produces          26

By default get_instrument_table returns a data frame with columns of the administration’s data_id, the item’s num_item_id (numerical item_id), and the corresponding value. To include administration information, you can set the administrations argument to TRUE, or pass the result of get_administration_data as administrations (that way you can prevent the administration data from being loaded multiple times). Similarly, you can set the iteminfo argument to TRUE, or pass it result of get_item_data.

Loading the data is fast if you need only a handful of items, but the time scales about linearly with the number of items, and can get quite slow if you need many or all of them. So, it’s a good idea to filter down to only the items you need before calling get_instrument_data.

As an example, let’s say we want to look at the production of animal words on English Words & Sentences over age. First we get the items we want:

animals <- get_item_data("English (American)", "WS") %>%
  filter(category == "animals")

Then we get the instrument data for those items:

animal_data <- get_instrument_data(language = "English (American)",
                                   form = "WS",
                                   items = animals$item_id,
                                   administrations = english_ws_admins)

Finally, we calculate how many animals words each child produces and the median number of animals of each age bin:

animal_summary <- animal_data %>%
  mutate(produces = value == "produces") %>%
  group_by(age, data_id) %>%
  summarise(num_animals = sum(produces, na.rm = TRUE)) %>%
  group_by(age) %>%
  summarise(median_num_animals = median(num_animals, na.rm = TRUE))
  
ggplot(animal_summary, aes(x = age, y = median_num_animals)) +
  geom_point() +
  labs(x = "Age (months)", y = "Median number of animal words producing")