Overview

The childesr package allows you to access data in the childes-db from R. This removes the need to write complex SQL queries in order to get the information you want from the database. This vignette shows some examples of how to use the data loading functions and what the resulting data look like.

There are several different get_ functions that you can use to extract different types of data from the childes-db:

Technical note 1: You do not have to explicitly establish a connection to the childes-db since the childesr functions will manage these connections. But if you would like to establish your own connection, you can do so with connect_to_childes() and pass it as an argument to any of the get_ functions.

Technical note 2: We have tried to optimize the time it takes to get data from the database. But if you try to query and get all of the tokens, it will take a long time.

Get transcripts

The get_transcripts function returns high-level information about the transcripts that are available in the database. You can filter your query to get the transcripts for a specific collection, corpus, or child.

For example, you can run get_transcripts without any arguments to return all of the transcripts in the database.

## Using current database version: '2018.1'.
## # A tibble: 6 x 13
##   transcript_id language date      filename      corpus_id target_child_id
##           <int> <chr>    <chr>     <chr>             <int>           <int>
## 1             1 spa      1999-05-… Hess/d12a1ex…         1              NA
## 2             2 spa      1999-05-… Hess/d12a1ex…         1              NA
## 3             3 spa      1999-05-… Hess/d12a1ex…         1              NA
## 4             4 spa      1999-05-… Hess/d12a1ex…         1              NA
## 5             5 spa      1999-05-… Hess/d12a1ex…         1              NA
## 6             6 spa      1999-05-… Hess/d12a2ex…         1              NA
## # ... with 7 more variables: target_child_age <dbl>,
## #   target_child_name <chr>, target_child_sex <chr>, collection_id <int>,
## #   collection_name <chr>, pid <chr>, corpus_name <chr>

If you only want information about a specific collection, such as the English-American transcripts, then you can specify this in the collection argument.

d_eng_na <- get_transcripts(collection = "Eng-NA")
## Using current database version: '2018.1'.
## # A tibble: 6 x 13
##   transcript_id language date      filename      corpus_id target_child_id
##           <int> <chr>    <chr>     <chr>             <int>           <int>
## 1          2765 eng      1976-04-… Clark/020216…        29            2454
## 2          2766 eng      1976-04-… Clark/020223…        29            2454
## 3          2767 eng      1976-05-… Clark/020302…        29            2454
## 4          2768 eng      1976-05-… Clark/020316…        29            2454
## 5          2769 eng      1976-05-… Clark/020321…        29            2454
## 6          2770 eng      1976-06-… Clark/020328…        29            2454
## # ... with 7 more variables: target_child_age <dbl>,
## #   target_child_name <chr>, target_child_sex <chr>, collection_id <int>,
## #   collection_name <chr>, pid <chr>, corpus_name <chr>

If you know the corpus that you want to analyze, then you can specify this in the corpus argument. The following function call will return information about all of the transcripts in the Brown corpus.

## Using current database version: '2018.1'.
## [1] 214

If you want more than one corpus, then you can pass a multiple corpus names. You can also pass more than one name to the collections and child arguments.

d_many_corpora <- get_transcripts(corpus = c("Brown", "Clark"))
## Using current database version: '2018.1'.
## [1] 261

If you want transcript information about a specific child from a corpus, then you pass their name to the child argument. Note that the following function call will not return any of the transcripts from the Brown corpus because the child Shem is not present in that corpus.

## Using current database version: '2018.1'.
## [1] 47

Get participants

The get_participants function returns background information about the speakers (both the children and the adults) in the database. This includes information about:

  • the speaker’s role in the conversation
  • language
  • sex
  • SES
  • youngest age of transcript
  • oldest age of transcript

Again, if you run the function with no arguments, then you get all the background information for all speakers in the database.

## Using current database version: '2018.1'.
## # A tibble: 6 x 18
##      id code  name   role  language group sex   ses   education custom
##   <int> <chr> <chr>  <chr> <chr>    <chr> <chr> <chr> <chr>     <chr> 
## 1     1 KAR   Karina Adult spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 2     2 DIA   Diana  Child spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 3     3 NAY   Nayely Child spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 4     4 EDG   Edgar  Child spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 5     5 OSC   Oscar  Child spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 6     6 ABR   Abril  Child spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## # ... with 8 more variables: corpus_id <int>, max_age <dbl>,
## #   min_age <dbl>, target_child_id <int>, collection_id <int>,
## #   collection_name <chr>, corpus_name <chr>, target_child_name <chr>

The participants function introduces three new arguments: role, age, and sex. The role argument allows you to get information about a specific kind of speaker, such as the “target_child.”

d_target_child <- get_participants(role = "target_child")
## Using current database version: '2018.1'.
## # A tibble: 6 x 18
##      id code  name    role     language group sex   ses   education custom
##   <int> <chr> <chr>   <chr>    <chr>    <chr> <chr> <chr> <chr>     <chr> 
## 1    25 CHI   Niño_J… Target_… spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 2    28 CHI   Juan    Target_… spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 3    37 CHI   Niño    Target_… spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 4    42 CHI   emilio  Target_… spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 5    63 CHI   Eduard  Target_… spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 6    78 CHI   Brayan  Target_… spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## # ... with 8 more variables: corpus_id <int>, max_age <dbl>,
## #   min_age <dbl>, target_child_id <int>, collection_id <int>,
## #   collection_name <chr>, corpus_name <chr>, target_child_name <chr>

The age argument takes a number indicating the age(s) of children (in months) that you want to analyze. you can use this argument in two ways

  1. Pass a single number to information about all participants who have a transcript at that age.
  2. Pass a range of ages to get information about all participants who have transcript within a certain age range.

For example, you can get the participant information for all of the children who had transcripts between the ages of 24 and 36 months.

## Using current database version: '2018.1'.
## # A tibble: 6 x 18
##      id code  name    role  language group sex   ses   education custom
##   <int> <chr> <chr>   <chr> <chr>    <chr> <chr> <chr> <chr>     <chr> 
## 1     2 DIA   Diana   Child spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 2     3 NAY   Nayely  Child spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 3     4 EDG   Edgar   Child spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 4     5 OSC   Oscar   Child spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 5     6 ABR   Abril   Child spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## 6     7 XOC   Xóchitl Child spa      <NA>  <NA>  <NA>  <NA>      <NA>  
## # ... with 8 more variables: corpus_id <int>, max_age <dbl>,
## #   min_age <dbl>, target_child_id <int>, collection_id <int>,
## #   collection_name <chr>, corpus_name <chr>, target_child_name <chr>

Get tokens

The get_tokens function returns a table with a row for each token based on a set of filtering criteria. The token argument allows you to pass a vector of one or more tokens that you want to analyze.

For example, if you wanted to get all of the production data for a specific token(s), then you could run the following call to get all instances of “dog” and “ball” for Adam in the Brown corpus.

## Using current database version: '2018.1'.
## Getting data from 1 child in 1 corpus ...
## Classes 'tbl_df', 'tbl' and 'data.frame':    265 obs. of  26 variables:
##  $ id               : int  3816766 3816797 3816850 3816998 3817149 3817469 3817520 3818128 3819292 3821303 ...
##  $ gloss            : chr  "ball" "ball" "ball" "ball" ...
##  $ stem             : chr  "ball" "ball" "ball" "ball" ...
##  $ part_of_speech   : chr  "n" "n" "n" "n" ...
##  $ speaker_id       : int  2949 2949 2949 2949 2949 2949 2949 2949 2949 2949 ...
##  $ utterance_id     : int  965272 965276 965284 965326 965364 965448 965466 965624 965984 966620 ...
##  $ token_order      : int  2 2 3 3 2 2 2 2 2 3 ...
##  $ corpus_id        : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ transcript_id    : int  3273 3273 3273 3273 3273 3273 3273 3273 3272 3284 ...
##  $ speaker_code     : chr  "CHI" "CHI" "CHI" "CHI" ...
##  $ speaker_name     : chr  "Adam" "Adam" "Adam" "Adam" ...
##  $ speaker_role     : chr  "Target_Child" "Target_Child" "Target_Child" "Target_Child" ...
##  $ target_child_id  : int  2949 2949 2949 2949 2949 2949 2949 2949 2949 2949 ...
##  $ target_child_age : num  27.6 27.6 27.6 27.6 27.6 ...
##  $ target_child_name: chr  "Adam" "Adam" "Adam" "Adam" ...
##  $ target_child_sex : chr  "male" "male" "male" "male" ...
##  $ utterance_type   : chr  "declarative" "declarative" "declarative" "declarative" ...
##  $ collection_id    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ collection_name  : chr  "Eng-NA" "Eng-NA" "Eng-NA" "Eng-NA" ...
##  $ english          : chr  "" "" "" "" ...
##  $ prefix           : chr  "" "" "" "" ...
##  $ suffix           : chr  "" "" "" "" ...
##  $ num_morphemes    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ language         : chr  "eng" "eng" "eng" "eng" ...
##  $ corpus_name      : chr  "Brown" "Brown" "Brown" "Brown" ...
##  $ clitic           : chr  "" "" "" "" ...
## [1] "ball" "dog"

Get types

The get_types() function works like the get_tokens() function, returning a table with a row for each type based on set of filtering criteria. The type argument allows you to pass a vector of one or more types that you want to analyze. The main difference is that you now have a single row for each type (i.e., a concept) and a variable count that tracks the number of times that type appeared in a particular transcript.

For example, if you wanted to get all of the production data for a specific type(s), then you could run the following call to get counts of “dog” and “ball” for all of Adam’s transcripts in the Brown corpus.

## Using current database version: '2018.1'.
## Getting data from 1 child in 1 corpus ...
## [1] "ball" "3"

Get utterances

The get_utterances function returns a table with a row for each utterance based on user-defined filtering criteria. For example, the following function will get you all of the utterances in the Brown Corpus for the child Adam.

## Using current database version: '2018.1'.
## Getting data from 1 child in 1 corpus ...
## Classes 'tbl_df', 'tbl' and 'data.frame':    73431 obs. of  25 variables:
##  $ id               : int  964592 964598 964606 964617 964627 964633 964643 964649 964652 964657 ...
##  $ speaker_id       : int  2949 2949 2953 2949 2949 2949 2949 2953 2949 2953 ...
##  $ utterance_order  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ transcript_id    : int  3272 3272 3272 3272 3272 3272 3272 3272 3272 3272 ...
##  $ corpus_id        : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gloss            : chr  "play checkers" "big drum" "big drum" "big drum" ...
##  $ num_tokens       : int  2 2 2 2 2 2 1 1 2 3 ...
##  $ stem             : chr  "play checker" "big drum" "big drum" "big drum" ...
##  $ part_of_speech   : chr  "n n" "adj n" "adj n" "adj n" ...
##  $ speaker_code     : chr  "CHI" "CHI" "MOT" "CHI" ...
##  $ speaker_name     : chr  "Adam" "Adam" NA "Adam" ...
##  $ speaker_role     : chr  "Target_Child" "Target_Child" "Mother" "Target_Child" ...
##  $ target_child_id  : int  2949 2949 2949 2949 2949 2949 2949 2949 2949 2949 ...
##  $ target_child_age : num  27.1 27.1 27.1 27.1 27.1 ...
##  $ target_child_name: chr  "Adam" "Adam" "Adam" "Adam" ...
##  $ target_child_sex : chr  "male" "male" "male" "male" ...
##  $ type             : chr  "declarative" "declarative" "question" "declarative" ...
##  $ media_end        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ media_start      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ media_unit       : chr  NA NA NA NA ...
##  $ collection_id    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ collection_name  : chr  "Eng-NA" "Eng-NA" "Eng-NA" "Eng-NA" ...
##  $ num_morphemes    : int  3 2 2 2 2 2 1 1 2 4 ...
##  $ language         : chr  "eng" "eng" "eng" "eng" ...
##  $ corpus_name      : chr  "Brown" "Brown" "Brown" "Brown" ...
## [1] "play checkers" "big drum"      "big drum"      "big drum"     
## [5] "big drum"

Get speaker statistics

The get_speaker_statistics() function returns a table with a row for each transcript and columns that contain a set of summary statistics for that transcript. The summary statistics include:

  • number of utterances (num_utterances)
  • number of types (num_types)
  • number of tokens (num_tokens)
  • number of morphemes (num_morphemes)
  • mean length of utterances in words (mlu_w)
  • mean length of utterances in morphemes (mlu_m)

For example, if we wanted to get the summary statistics for Adam’s production data, we could run the following call.

## Using current database version: '2018.1'.
## [1] 3.567691