R : word frequency in dataframe
Alright so within the quick tutorial we’ll calculate phrase frequency and visualize it.
It is comparatively easy activity.
BUT when it comes for stopwords and language totally different from English, there is perhaps some difficulties.
I’ve a dataframe which has subject textual content is russian language.
Step 0 : Set up required libraries
packages.set up("tidyverse") packages.set up("tidytext") packages.set up("tm") library(tidyverse) library(tidytext) library(tm)
Step 1 : Create stopwords dataframe
#create stopwords DF rus_stopwords = knowledge.body(phrase = stopwords("ru"))
Step 2 : Tokenize
new_df <- video %>% unnest_tokens(phrase, textual content) %>% anti_join(rus_stopwords) # - anti_join - functoin to take away stopwords #video - is identify of dataframe #phrase - is identify of latest subject #textual content - is only a filed with our textual content
Step 3 : Depend phrases
frequency_dataframe = new_df %>% depend(phrase) %>% organize(desc(n))
Step 4 (Non-compulsory) : Take solely first 20 gadgets from a dataframe
short_dataframe = head(frequency_dataframe, 20)
Step 5 Visualize with ggplot
ggplot(short_dataframe, aes(x = phrase, y = n, fill = phrase)) + geom_col()
So in my case it appeared appeared like this: