Calculate TF-IDFs for unigrams or bigrams — calc_tfidf

For a given labelled text, return the unigrams or bigrams with the largest TF-IDFs for the given class(es).

calc_tfidf_ngrams(
  x,
  target_col_name,
  text_col_name,
  filter_class = NULL,
  ngrams_type = c("Unigrams", "Bigrams"),
  number_of_ngrams = NULL
)

Arguments

x	A data frame with two columns: the column with the classes; and the column with the text.
target_col_name	A string with the column name of the target variable. It is equivalent to argument `document` in `bind_tf_idf{tidytext}`.
text_col_name	A string with the column name of the text variable.
filter_class	A string or vector of strings with the name(s) of the class(es) for which TF-IDFs are to be calculated. Defaults to `NULL` (all classes).
ngrams_type	A string. Should be "Unigrams" for unigrams and "Bigrams" for bigrams.
number_of_ngrams	Integer. Number of ngrams to return. Defaults to all.

Value

A data frame with six columns: class; n-gram (word or bigram); count; term-frequency; inverse document frequency; and TF-IDF.

Note

Unlike other functions in experienceAnalysis (e.g. calc_net_sentiment_nrc), here it does not make sense to have target_col_name set to NULL- the TF-IDF of an n-gram depends on the number of "documents" containing it (see Silge and Robinson, 2017), so there must be at least two classes (or "documents") to use in the calculations.

When filter_class is not NULL, the TF-IDFs will still be calculated using all classes/documents and then filtered by filter_class.

References

Silge J. & Robinson D. (2017). Text Mining with R: A Tidy Approach. Sebastopol, CA: O’Reilly Media. ISBN 978-1-491-98165-8.

Examples

library(experienceAnalysis)
books <- janeaustenr::austen_books() # Jane Austen books
emma <- paste(books[books$book == "Emma", ], collapse = " ") # String with whole book
pp <- paste(books[books$book == "Pride & Prejudice", ], collapse = " ") # String with whole book

# Make data frame with books Emma and Pride & Prejudice
x <- data.frame(
  text = c(emma, pp),
  book = c("Emma", "Pride & Prejudice")
)

# Get a few of the bigram counts, TFs, IDFs and highest TF-IDFs for each book
calc_tfidf_ngrams(x, target_col_name = "book", text_col_name = "text",
                  filter_class = NULL,
                  ngrams_type = "Bigrams",
                  number_of_ngrams = 30
) %>%
split(.$book)
#> $Emma
#> # A tibble: 15 x 6
#>    book  ngram                n       tf   idf   tf_idf
#>    <chr> <chr>            <int>    <dbl> <dbl>    <dbl>
#>  1 Emma  4 4              16234 0.630    0.693 0.437   
#>  2 Emma  miss woodhouse     162 0.00629  0.693 0.00436 
#>  3 Emma  frank churchill    132 0.00512  0.693 0.00355 
#>  4 Emma  miss fairfax       109 0.00423  0.693 0.00293 
#>  5 Emma  miss bates         103 0.00400  0.693 0.00277 
#>  6 Emma  jane fairfax        96 0.00373  0.693 0.00258 
#>  7 Emma  john knightley      56 0.00217  0.693 0.00151 
#>  8 Emma  miss smith          51 0.00198  0.693 0.00137 
#>  9 Emma  miss taylor         40 0.00155  0.693 0.00108 
#> 10 Emma  dear emma           31 0.00120  0.693 0.000834
#> 11 Emma  maple grove         31 0.00120  0.693 0.000834
#> 12 Emma  cried emma          27 0.00105  0.693 0.000727
#> 13 Emma  harriet smith       27 0.00105  0.693 0.000727
#> 14 Emma  robert martin       27 0.00105  0.693 0.000727
#> 15 Emma  colonel campbell    21 0.000815 0.693 0.000565
#> 
#> $`Pride & Prejudice`
#> # A tibble: 15 x 6
#>    book              ngram                   n       tf   idf   tf_idf
#>    <chr>             <chr>               <int>    <dbl> <dbl>    <dbl>
#>  1 Pride & Prejudice 2 2                 13029 0.643    0.693 0.445   
#>  2 Pride & Prejudice lady catherine        100 0.00493  0.693 0.00342 
#>  3 Pride & Prejudice miss bingley           72 0.00355  0.693 0.00246 
#>  4 Pride & Prejudice miss bennet            60 0.00296  0.693 0.00205 
#>  5 Pride & Prejudice sir william            38 0.00187  0.693 0.00130 
#>  6 Pride & Prejudice de bourgh              35 0.00173  0.693 0.00120 
#>  7 Pride & Prejudice miss darcy             34 0.00168  0.693 0.00116 
#>  8 Pride & Prejudice colonel forster        26 0.00128  0.693 0.000889
#>  9 Pride & Prejudice colonel fitzwilliam    25 0.00123  0.693 0.000855
#> 10 Pride & Prejudice cried elizabeth        24 0.00118  0.693 0.000821
#> 11 Pride & Prejudice miss lucas             23 0.00113  0.693 0.000786
#> 12 Pride & Prejudice miss de                20 0.000987 0.693 0.000684
#> 13 Pride & Prejudice lady lucas             18 0.000888 0.693 0.000615
#> 14 Pride & Prejudice replied elizabeth      18 0.000888 0.693 0.000615
#> 15 Pride & Prejudice lady catherine's       16 0.000789 0.693 0.000547
#>