Overview

The childesr package allows you to access data in the childes-db from R. This removes the need to write complex SQL queries in order to get the information you want from the database. This vignette shows some examples of how to use the data loading functions and what the resulting data look like.

There are several different get_ functions that you can use to extract different types of data from the childes-db:

Technical note 1: You do not have to explicitly establish a connection to the childes-db since the childesr functions will manage these connections. But if you would like to establish your own connection, you can pass it as an argument to any of the get_ functions.

Technical note 2: We have tried to optimize the time it takes to get data from the database. But if you try to query and get all of the tokens, it will take a long time.

# load the library
library(childesr)

Get transcripts

The get_transcripts function returns high-level information about the transcripts that are available in the database. You can filter your query to get the transcripts for a specific collection, corpus, or child.

For example, you can run get_transcripts with all the arguments set to NULL to return all of the transcripts in the database.

d_transcripts <- get_transcripts(collection = NULL, 
                                 corpus = NULL, 
                                 child = NULL)

head(d_transcripts)
## # A tibble: 6 x 11
##   transcript_id languages       date         filename corpus_id
##           <int>     <chr>      <chr>            <chr>     <int>
## 1             1       eng 1976-04-21 Clark/shem01.xml         1
## 2             2       eng 1976-04-28 Clark/shem02.xml         1
## 3             3       eng 1976-05-07 Clark/shem03.xml         1
## 4             4       eng 1976-05-21 Clark/shem04.xml         1
## 5             5       eng 1976-05-26 Clark/shem05.xml         1
## 6             6       eng 1976-06-02 Clark/shem06.xml         1
## # ... with 6 more variables: target_child_id <int>,
## #   target_child_age <dbl>, target_child_name <chr>, corpus_name <chr>,
## #   collection_id <int>, collection_name <chr>

If you only want information about a specific collection, such as the English-American transcripts, then you can specify this in the collection argument.

d_eng_na <- get_transcripts(collection = "Eng-NA", 
                            corpus = NULL, 
                            child = NULL)

head(d_eng_na)
## # A tibble: 6 x 11
##   transcript_id languages       date         filename corpus_id
##           <int>     <chr>      <chr>            <chr>     <int>
## 1             1       eng 1976-04-21 Clark/shem01.xml         1
## 2             2       eng 1976-04-28 Clark/shem02.xml         1
## 3             3       eng 1976-05-07 Clark/shem03.xml         1
## 4             4       eng 1976-05-21 Clark/shem04.xml         1
## 5             5       eng 1976-05-26 Clark/shem05.xml         1
## 6             6       eng 1976-06-02 Clark/shem06.xml         1
## # ... with 6 more variables: target_child_id <int>,
## #   target_child_age <dbl>, target_child_name <chr>, corpus_name <chr>,
## #   collection_id <int>, collection_name <chr>

If you know the corpus that you want to analyze, then you can specify this in the corpus argument. The following function call will return information about all of the transcripts in the Brown corpus.

# returns all transcripts in the brown corpus
d_brown_transcripts <- get_transcripts(collection = NULL, 
                                       corpus = "Brown", 
                                       child = NULL)
# print the number of rows
nrow(d_brown_transcripts)
## [1] 214

If you want more than one corpus, then you can pass a multiple corpus names. You can also pass more than one name to the collections and child arguments.

d_many_corpora <- get_transcripts(collection = NULL, 
                                  corpus = c("Brown", "Clark"), 
                                  child = NULL)
# print the number of rows
nrow(d_many_corpora)
## [1] 261

If you want transcript information about a specific child from a corpus, then you pass their name to the child argument. Note that the following function call will not return any of the transcripts from the Brown corpus because the child Shem is not present in that corpus.

d_shem <- get_transcripts(collection = NULL, 
                          corpus = c("Brown", "Clark"), 
                          child = "Shem")
# print the number of rows
nrow(d_shem)
## [1] 47

Get participants

The get_participants function returns background information about the speakers (both the children and the adults) in the database. This includes information about:

  • the speaker’s role in the conversation
  • language
  • sex
  • SES
  • youngest age of transcript
  • oldest age of transcript

Again, if you run the function with all arguments set to NULL (the default), then you get all the background information for all speakers in the database.

d_participants <- get_participants(collection = NULL, 
                                   corpus = NULL, 
                                   child = NULL,
                                   role = NULL, 
                                   age = NULL, 
                                   sex = NULL)
head(d_participants)
## # A tibble: 6 x 18
##      id  code  name         role language   group   sex   ses education
##   <int> <chr> <chr>        <chr>    <chr>   <chr> <chr> <chr>     <chr>
## 1     1   CHI  Shem Target_Child      eng typical  male    UC      <NA>
## 2     2   INV Cindy Investigator      eng    <NA>  <NA>  <NA>      <NA>
## 3     3   MOT  <NA>       Mother      eng    <NA>  <NA>  <NA>      <NA>
## 4     4   FAT  <NA>       Father      eng    <NA>  <NA>  <NA>      <NA>
## 5     5   GUY  <NA>        Adult      eng    <NA>  <NA>  <NA>      <NA>
## 6     6   LEN   Len        Child      eng    <NA>  <NA>  <NA>      <NA>
## # ... with 9 more variables: custom <chr>, corpus_id <int>, max_age <dbl>,
## #   min_age <dbl>, target_child_id <int>, corpus_name <chr>,
## #   collection_id <int>, collection_name <chr>, target_child_name <chr>

The participants function introduces three new arguments: role, age, and sex. The role argument allows you to get information about a specific kind of speaker, such as the “target_child.”

d_target_child <- get_participants(collection = NULL, 
                                   corpus = NULL, 
                                   child = NULL,
                                   role = "target_child", 
                                   age = NULL, 
                                   sex = NULL)
head(d_target_child)
## # A tibble: 6 x 18
##      id  code        name         role language   group    sex   ses
##   <int> <chr>       <chr>        <chr>    <chr>   <chr>  <chr> <chr>
## 1     1   CHI        Shem Target_Child      eng typical   male    UC
## 2    17   CHI Christopher Target_Child      eng typical   male  <NA>
## 3    20   CHI    Margaret Target_Child      eng typical female  <NA>
## 4    23   CHI      Andrew Target_Child      eng typical   male  <NA>
## 5    26   CHI        Erin Target_Child      eng typical female  <NA>
## 6    29   CHI   Christina Target_Child      eng typical female  <NA>
## # ... with 10 more variables: education <chr>, custom <chr>,
## #   corpus_id <int>, max_age <dbl>, min_age <dbl>, target_child_id <int>,
## #   corpus_name <chr>, collection_id <int>, collection_name <chr>,
## #   target_child_name <chr>

The age argument takes a number indicating the age(s) of children (in months) that you want to analyze. you can use this argument in two ways

  1. Pass a single number to information about all participants who have a transcript at that age.
  2. Pass a range of ages to get information about all participants who have transcript within a certain age range.

For example, you can get the participant information for all of the children who had transcripts between the ages of 24 and 36 months.

d_age_range <- get_participants(collection = NULL, 
                                corpus = NULL, 
                                child = NULL,
                                role = NULL, 
                                age = c(24, 36), 
                                sex = NULL)
head(d_age_range)
## # A tibble: 6 x 18
##      id  code        name         role language   group    sex   ses
##   <int> <chr>       <chr>        <chr>    <chr>   <chr>  <chr> <chr>
## 1     1   CHI        Shem Target_Child      eng typical   male    UC
## 2    17   CHI Christopher Target_Child      eng typical   male  <NA>
## 3    20   CHI    Margaret Target_Child      eng typical female  <NA>
## 4    23   CHI      Andrew Target_Child      eng typical   male  <NA>
## 5    26   CHI        Erin Target_Child      eng typical female  <NA>
## 6    29   CHI   Christina Target_Child      eng typical female  <NA>
## # ... with 10 more variables: education <chr>, custom <chr>,
## #   corpus_id <int>, max_age <dbl>, min_age <dbl>, target_child_id <int>,
## #   corpus_name <chr>, collection_id <int>, collection_name <chr>,
## #   target_child_name <chr>

Get tokens

The get_tokens function returns a table with a row for each token based on a set of filtering criteria. The token argument allows you to pass a vector of one or more tokens that you want to analyze.

For example, if you wanted to get all of the production data for a specific token(s), then you could run the following call to get all instances of “dog” and “ball” for Adam in the Brown corpus.

d_adam_prod <- get_tokens(collection = NULL, 
                          corpus = "Brown", 
                          role = "target_child",
                          age = NULL, 
                          sex = NULL, 
                          child = "Adam",
                          token = c("dog", "ball"))
## Getting data from 1 child in 1 corpus ...
# view the structure of the data
str(d_adam_prod)
## Classes 'tbl_df', 'tbl' and 'data.frame':    262 obs. of  20 variables:
##  $ id               : int  760381 760413 760454 760627 760781 760966 761123 761183 761859 762760 ...
##  $ gloss            : chr  "ball" "ball" "ball" "ball" ...
##  $ replacement      : chr  "" "" "" "" ...
##  $ stem             : chr  "ball" "ball" "ball" "ball" ...
##  $ part_of_speech   : chr  "n" "n" "n" "n" ...
##  $ relation         : chr  "2|1|OBJ" "2|1|OBJ" "3|1|OBJ" "3|1|OBJ" ...
##  $ speaker_id       : int  532 532 532 532 532 532 532 532 532 532 ...
##  $ utterance_id     : int  276803 276814 276828 276900 276957 277031 277097 277121 277376 277669 ...
##  $ token_order      : int  2 2 3 3 2 3 2 2 2 2 ...
##  $ corpus_id        : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ transcript_id    : int  1111 1111 1111 1111 1111 1122 1111 1111 1111 1125 ...
##  $ speaker_age      : num  838 838 838 838 838 ...
##  $ speaker_code     : chr  "CHI" "CHI" "CHI" "CHI" ...
##  $ speaker_name     : chr  "Adam" "Adam" "Adam" "Adam" ...
##  $ speaker_role     : chr  "Target_Child" "Target_Child" "Target_Child" "Target_Child" ...
##  $ speaker_sex      : chr  "male" "male" "male" "male" ...
##  $ target_child_id  : int  532 532 532 532 532 532 532 532 532 532 ...
##  $ target_child_age : num  838 838 838 838 838 ...
##  $ target_child_name: chr  "Adam" "Adam" "Adam" "Adam" ...
##  $ target_child_sex : chr  "male" "male" "male" "male" ...
# print the unique tokens
unique(d_adam_prod$gloss)
## [1] "ball" "dog"

Get types

The get_types() function works like the get_tokens() function, returning a table with a row for each type based on set of filtering criteria. The type argument allows you to pass a vector of one or more types that you want to analyze. The main difference is that you now have a single row for each type (i.e., a concept) and a variable count that tracks the number of times that type appeared in a particular transcript.

For example, if you wanted to get all of the production data for a specific type(s), then you could run the following call to get counts of “dog” and “ball” for all of Adam’s transcripts in the Brown corpus.

d_adam_types <- get_types(collection = NULL, 
                          corpus = "Brown", 
                          child = "Adam", 
                          role = "target_child",
                          role_exclude = NULL, 
                          age = NULL, 
                          sex = NULL, 
                          type = c("dog", "ball"))
## Getting data from 1 child in 1 corpus ...
# print the number of times ball appears in the first transcript
c(d_adam_types$gloss[1], d_adam_types$count[1])
## [1] "ball" "3"

Get utterances

The get_utterances function returns a table with a row for each utterance based on user-defined filtering criteria. For example, the following function will get you all of the utterances in the Brown Corpus for the child Adam.

d_adam_utts <- get_utterances(collection = NULL, 
                              corpus = "Brown", 
                              role = NULL,
                              age = NULL, 
                              sex = NULL, 
                              child = "Adam")
## Getting data from 1 child in 1 corpus ...
# view the structure of the data
str(d_adam_utts)
## Classes 'tbl_df', 'tbl' and 'data.frame':    72087 obs. of  19 variables:
##  $ id               : int  276156 276166 276173 276175 276179 276185 276187 276192 276196 276198 ...
##  $ speaker_id       : int  532 532 532 532 532 536 532 532 532 533 ...
##  $ order            : int  1 2 1 3 2 3 4 5 4 6 ...
##  $ transcript_id    : int  1115 1115 1113 1115 1113 1113 1115 1115 1113 1115 ...
##  $ corpus_id        : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ gloss            : chr  "dog bit me bit me dog" "don't dog" "book" "bitten me dog nineteen twelve" ...
##  $ length           : int  6 2 1 5 2 4 1 2 1 3 ...
##  $ relation         : chr  "1|2|SUBJ 2|0|ROOT 3|2|OBJ 4|2|COMP 5|4|OBJ 6|4|OBJ" "1|3|AUX 2|1|NEG 3|0|ROOT" "1|0|INCROOT" "1|0|INCROOT 2|1|OBJ 3|1|XCOMP 4|5|QUANT 5|3|OBJ" ...
##  $ stem             : chr  "dog bite-PAST me bite-PAST me dog" "do not dog" "book" "bite-PASTP me dog nineteen twelve" ...
##  $ part_of_speech   : chr  "n v pro v pro n" "mod v" "n" "part pro v det det" ...
##  $ speaker_age      : num  892 892 866 892 866 ...
##  $ speaker_code     : chr  "CHI" "CHI" "CHI" "CHI" ...
##  $ speaker_name     : chr  "Adam" "Adam" "Adam" "Adam" ...
##  $ speaker_role     : chr  "Target_Child" "Target_Child" "Target_Child" "Target_Child" ...
##  $ speaker_sex      : chr  "male" "male" "male" "male" ...
##  $ target_child_id  : int  532 532 532 532 532 532 532 532 532 532 ...
##  $ target_child_age : num  892 892 866 892 866 ...
##  $ target_child_name: chr  "Adam" "Adam" "Adam" "Adam" ...
##  $ target_child_sex : chr  "male" "male" "male" "male" ...
# print the first five utterances
d_adam_utts$gloss[1:5]
## [1] "dog bit me bit me dog"         "don't dog"                    
## [3] "book"                          "bitten me dog nineteen twelve"
## [5] "read book"

Get speaker statistics

The get_speaker_statistics() function returns a table with a row for each transcript and columns that contain a set of summary statistics for that transcript. The summary statistics include:

  • number of utterances (num_utterances)
  • mean length of utterances (mlu)
  • number of types (num_types)
  • number of tokens (num_tokens)

For example, if we wanted to get the summary statistics for Adam’s production data, we could run the following call.

d_adam_stats <- get_speaker_statistics(collection = NULL, 
                                       corpus = "Brown", 
                                       child = "Adam", 
                                       role = "target_child",
                                       role_exclude = NULL, 
                                       age = NULL, 
                                       sex = NULL)

# get the average mlu across all Adam's transcripts
mean(d_adam_stats$mlu)
## [1] 3.591833