Create and count bigrams — calc_bigrams_network • experienceAnalysis

For a given labelled text, create and calculate the most frequently occurring bigrams (no stop words) for the given class(es).

calc_bigrams_network(
  x,
  target_col_name,
  text_col_name,
  filter_class = NULL,
  bigrams_prop
)

Arguments

x	A data frame with one or more columns: the column with the classes (if `target_col_name` is not `NULL`); and the column with the text. Any other columns will be ignored.
target_col_name	A string with the column name of the target variable. Defaults to `NULL`.
text_col_name	A string with the column name of the text variable.
filter_class	A string or vector of strings with the name(s) of the class(es) for which bigrams are to be created and counted. Defaults to `NULL` (all rows).
bigrams_prop	A numeric in (0, 100] indicating the percentage of the most frequent bigrams to keep.

Value

A data frame with three columns: first word of bigram; second word of bigram; and bigram count.

Note

When supplying more than one class in filter_class, the returned data frame will NOT separate the results for the different classes. If separation is desired, then run the function for each class separately or do something like this:

# Assuming that the class and text columns are called "label" and
# "feedback" respectively
x %>%
    split(.$label) %>%
    purrr::map(
        ~ calc_bigrams_network(., target_col_name = NULL,
                               text_col_name = "feedback",
                               filter_class = NULL, bigrams_prop = 50)
    )

Examples

library(experienceAnalysis)
books <- janeaustenr::austen_books() # Jane Austen books
emma <- paste(books[books$book == "Emma", ], collapse = " ") # String with whole book
pp <- paste(books[books$book == "Pride & Prejudice", ], collapse = " ") # String with whole book

# Make data frame with books Emma and Pride & Prejudice
x <- data.frame(
  text = c(emma, pp),
  book = c("Emma", "Pride & Prejudice")
)

# Bigrams for both books
calc_bigrams_network(x, target_col_name = "book", text_col_name = "text",
                     filter_class = NULL, bigrams_prop = 3)
#>       word1       word2     n
#> 1         4           4 16234
#> 2         2           2 13029
#> 3      miss   woodhouse   162
#> 4     frank   churchill   132
#> 5      miss     fairfax   109
#> 6      miss       bates   103
#> 7      lady   catherine   100
#> 8      jane     fairfax    96
#> 9      miss     bingley    72
#> 10     miss      bennet    60
#> 11     john   knightley    56
#> 12     miss       smith    51
#> 13     miss      taylor    40
#> 14      sir     william    38
#> 15       de      bourgh    35
#> 16     miss       darcy    34
#> 17     dear        emma    31
#> 18     dear        miss    31
#> 19    maple       grove    31
#> 20    cried        emma    27
#> 21  harriet       smith    27
#> 22   robert      martin    27
#> 23  colonel     forster    26
#> 24  colonel fitzwilliam    25
#> 25    cried   elizabeth    24
#> 26     dear         sir    23
#> 27     miss       lucas    23
#> 28 thousand      pounds    23
#> 29     dear        jane    22
#> 30  colonel    campbell    21
#> 31     miss          de    20
#> 32    frank churchill's    19
#> 33      box        hill    18
#> 34     lady       lucas    18
#> 35  replied   elizabeth    18

# Bigrams for Emma
calc_bigrams_network(x, target_col_name = "book", text_col_name = "text",
                     filter_class = "Emma", bigrams_prop = 3)
#>      word1       word2     n
#> 1        4           4 16234
#> 2     miss   woodhouse   162
#> 3    frank   churchill   132
#> 4     miss     fairfax   109
#> 5     miss       bates   103
#> 6     jane     fairfax    96
#> 7     john   knightley    56
#> 8     miss       smith    51
#> 9     miss      taylor    40
#> 10    dear        emma    31
#> 11   maple       grove    31
#> 12   cried        emma    27
#> 13 harriet       smith    27
#> 14  robert      martin    27
#> 15    dear        miss    25
#> 16 colonel    campbell    21
#> 17   frank churchill's    19

# Bigrams for Pride & Prejudice
calc_bigrams_network(x, target_col_name = "book", text_col_name = "text",
                     filter_class = "Pride & Prejudice", bigrams_prop = 3)
#>       word1       word2     n
#> 1         2           2 13029
#> 2      lady   catherine   100
#> 3      miss     bingley    72
#> 4      miss      bennet    60
#> 5       sir     william    38
#> 6        de      bourgh    35
#> 7      miss       darcy    34
#> 8   colonel     forster    26
#> 9   colonel fitzwilliam    25
#> 10    cried   elizabeth    24
#> 11     miss       lucas    23
#> 12     miss          de    20
#> 13 thousand      pounds    20
#> 14     lady       lucas    18
#> 15  replied   elizabeth    18