For a given labelled text, return the unigrams or bigrams with the largest TF-IDFs for the given class(es).

calc_tfidf_ngrams(
  x,
  target_col_name,
  text_col_name,
  filter_class = NULL,
  ngrams_type = c("Unigrams", "Bigrams"),
  number_of_ngrams = NULL
)

Arguments

x

A data frame with two columns: the column with the classes; and the column with the text.

target_col_name

A string with the column name of the target variable. It is equivalent to argument document in bind_tf_idf{tidytext}.

text_col_name

A string with the column name of the text variable.

filter_class

A string or vector of strings with the name(s) of the class(es) for which TF-IDFs are to be calculated. Defaults to NULL (all classes).

ngrams_type

A string. Should be "Unigrams" for unigrams and "Bigrams" for bigrams.

number_of_ngrams

Integer. Number of ngrams to return. Defaults to all.

Value

A data frame with six columns: class; n-gram (word or bigram); count; term-frequency; inverse document frequency; and TF-IDF.

Note

Unlike other functions in experienceAnalysis (e.g. calc_net_sentiment_nrc), here it does not make sense to have target_col_name set to NULL- the TF-IDF of an n-gram depends on the number of "documents" containing it (see Silge and Robinson, 2017), so there must be at least two classes (or "documents") to use in the calculations.

When filter_class is not NULL, the TF-IDFs will still be calculated using all classes/documents and then filtered by filter_class.

References

Silge J. & Robinson D. (2017). Text Mining with R: A Tidy Approach. Sebastopol, CA: O’Reilly Media. ISBN 978-1-491-98165-8.

Examples

library(experienceAnalysis) books <- janeaustenr::austen_books() # Jane Austen books emma <- paste(books[books$book == "Emma", ], collapse = " ") # String with whole book pp <- paste(books[books$book == "Pride & Prejudice", ], collapse = " ") # String with whole book # Make data frame with books Emma and Pride & Prejudice x <- data.frame( text = c(emma, pp), book = c("Emma", "Pride & Prejudice") ) # Get a few of the bigram counts, TFs, IDFs and highest TF-IDFs for each book calc_tfidf_ngrams(x, target_col_name = "book", text_col_name = "text", filter_class = NULL, ngrams_type = "Bigrams", number_of_ngrams = 30 ) %>% split(.$book)
#> $Emma #> # A tibble: 15 x 6 #> book ngram n tf idf tf_idf #> <chr> <chr> <int> <dbl> <dbl> <dbl> #> 1 Emma 4 4 16234 0.630 0.693 0.437 #> 2 Emma miss woodhouse 162 0.00629 0.693 0.00436 #> 3 Emma frank churchill 132 0.00512 0.693 0.00355 #> 4 Emma miss fairfax 109 0.00423 0.693 0.00293 #> 5 Emma miss bates 103 0.00400 0.693 0.00277 #> 6 Emma jane fairfax 96 0.00373 0.693 0.00258 #> 7 Emma john knightley 56 0.00217 0.693 0.00151 #> 8 Emma miss smith 51 0.00198 0.693 0.00137 #> 9 Emma miss taylor 40 0.00155 0.693 0.00108 #> 10 Emma dear emma 31 0.00120 0.693 0.000834 #> 11 Emma maple grove 31 0.00120 0.693 0.000834 #> 12 Emma cried emma 27 0.00105 0.693 0.000727 #> 13 Emma harriet smith 27 0.00105 0.693 0.000727 #> 14 Emma robert martin 27 0.00105 0.693 0.000727 #> 15 Emma colonel campbell 21 0.000815 0.693 0.000565 #> #> $`Pride & Prejudice` #> # A tibble: 15 x 6 #> book ngram n tf idf tf_idf #> <chr> <chr> <int> <dbl> <dbl> <dbl> #> 1 Pride & Prejudice 2 2 13029 0.643 0.693 0.445 #> 2 Pride & Prejudice lady catherine 100 0.00493 0.693 0.00342 #> 3 Pride & Prejudice miss bingley 72 0.00355 0.693 0.00246 #> 4 Pride & Prejudice miss bennet 60 0.00296 0.693 0.00205 #> 5 Pride & Prejudice sir william 38 0.00187 0.693 0.00130 #> 6 Pride & Prejudice de bourgh 35 0.00173 0.693 0.00120 #> 7 Pride & Prejudice miss darcy 34 0.00168 0.693 0.00116 #> 8 Pride & Prejudice colonel forster 26 0.00128 0.693 0.000889 #> 9 Pride & Prejudice colonel fitzwilliam 25 0.00123 0.693 0.000855 #> 10 Pride & Prejudice cried elizabeth 24 0.00118 0.693 0.000821 #> 11 Pride & Prejudice miss lucas 23 0.00113 0.693 0.000786 #> 12 Pride & Prejudice miss de 20 0.000987 0.693 0.000684 #> 13 Pride & Prejudice lady lucas 18 0.000888 0.693 0.000615 #> 14 Pride & Prejudice replied elizabeth 18 0.000888 0.693 0.000615 #> 15 Pride & Prejudice lady catherine's 16 0.000789 0.693 0.000547 #>