calc_tfidf_ngrams.Rd
For a given labelled text, return the unigrams or bigrams with the largest TF-IDFs for the given class(es).
calc_tfidf_ngrams( x, target_col_name, text_col_name, filter_class = NULL, ngrams_type = c("Unigrams", "Bigrams"), number_of_ngrams = NULL )
x | A data frame with two columns: the column with the classes; and the column with the text. |
---|---|
target_col_name | A string with the column name of the target variable.
It is equivalent to argument |
text_col_name | A string with the column name of the text variable. |
filter_class | A string or vector of strings with the name(s) of the
class(es) for which TF-IDFs are to be calculated. Defaults to
|
ngrams_type | A string. Should be "Unigrams" for unigrams and "Bigrams" for bigrams. |
number_of_ngrams | Integer. Number of ngrams to return. Defaults to all. |
A data frame with six columns: class; n-gram (word or bigram); count; term-frequency; inverse document frequency; and TF-IDF.
Unlike other functions in experienceAnalysis
(e.g.
calc_net_sentiment_nrc
), here it does not make
sense to have target_col_name
set to NULL
- the TF-IDF of an n-gram
depends on the number of "documents" containing it (see Silge and
Robinson, 2017), so there must be at least two classes (or "documents")
to use in the calculations.
When filter_class
is not NULL
, the TF-IDFs will still be
calculated using all classes/documents and then filtered by
filter_class
.
Silge J. & Robinson D. (2017). Text Mining with R: A Tidy Approach. Sebastopol, CA: O’Reilly Media. ISBN 978-1-491-98165-8.
library(experienceAnalysis) books <- janeaustenr::austen_books() # Jane Austen books emma <- paste(books[books$book == "Emma", ], collapse = " ") # String with whole book pp <- paste(books[books$book == "Pride & Prejudice", ], collapse = " ") # String with whole book # Make data frame with books Emma and Pride & Prejudice x <- data.frame( text = c(emma, pp), book = c("Emma", "Pride & Prejudice") ) # Get a few of the bigram counts, TFs, IDFs and highest TF-IDFs for each book calc_tfidf_ngrams(x, target_col_name = "book", text_col_name = "text", filter_class = NULL, ngrams_type = "Bigrams", number_of_ngrams = 30 ) %>% split(.$book)#> $Emma #> # A tibble: 15 x 6 #> book ngram n tf idf tf_idf #> <chr> <chr> <int> <dbl> <dbl> <dbl> #> 1 Emma 4 4 16234 0.630 0.693 0.437 #> 2 Emma miss woodhouse 162 0.00629 0.693 0.00436 #> 3 Emma frank churchill 132 0.00512 0.693 0.00355 #> 4 Emma miss fairfax 109 0.00423 0.693 0.00293 #> 5 Emma miss bates 103 0.00400 0.693 0.00277 #> 6 Emma jane fairfax 96 0.00373 0.693 0.00258 #> 7 Emma john knightley 56 0.00217 0.693 0.00151 #> 8 Emma miss smith 51 0.00198 0.693 0.00137 #> 9 Emma miss taylor 40 0.00155 0.693 0.00108 #> 10 Emma dear emma 31 0.00120 0.693 0.000834 #> 11 Emma maple grove 31 0.00120 0.693 0.000834 #> 12 Emma cried emma 27 0.00105 0.693 0.000727 #> 13 Emma harriet smith 27 0.00105 0.693 0.000727 #> 14 Emma robert martin 27 0.00105 0.693 0.000727 #> 15 Emma colonel campbell 21 0.000815 0.693 0.000565 #> #> $`Pride & Prejudice` #> # A tibble: 15 x 6 #> book ngram n tf idf tf_idf #> <chr> <chr> <int> <dbl> <dbl> <dbl> #> 1 Pride & Prejudice 2 2 13029 0.643 0.693 0.445 #> 2 Pride & Prejudice lady catherine 100 0.00493 0.693 0.00342 #> 3 Pride & Prejudice miss bingley 72 0.00355 0.693 0.00246 #> 4 Pride & Prejudice miss bennet 60 0.00296 0.693 0.00205 #> 5 Pride & Prejudice sir william 38 0.00187 0.693 0.00130 #> 6 Pride & Prejudice de bourgh 35 0.00173 0.693 0.00120 #> 7 Pride & Prejudice miss darcy 34 0.00168 0.693 0.00116 #> 8 Pride & Prejudice colonel forster 26 0.00128 0.693 0.000889 #> 9 Pride & Prejudice colonel fitzwilliam 25 0.00123 0.693 0.000855 #> 10 Pride & Prejudice cried elizabeth 24 0.00118 0.693 0.000821 #> 11 Pride & Prejudice miss lucas 23 0.00113 0.693 0.000786 #> 12 Pride & Prejudice miss de 20 0.000987 0.693 0.000684 #> 13 Pride & Prejudice lady lucas 18 0.000888 0.693 0.000615 #> 14 Pride & Prejudice replied elizabeth 18 0.000888 0.693 0.000615 #> 15 Pride & Prejudice lady catherine's 16 0.000789 0.693 0.000547 #>