R : word frequency in dataframe


Alright so within the quick tutorial we’ll calculate phrase frequency and visualize it.

It is comparatively easy activity.
BUT when it comes for stopwords and language totally different from English, there is perhaps some difficulties.

I’ve a dataframe which has subject textual content is russian language.

Step 0 : Set up required libraries

packages.set up("tidyverse")
packages.set up("tidytext")
packages.set up("tm")

Step 1 : Create stopwords dataframe

#create stopwords DF
rus_stopwords = knowledge.body(phrase = stopwords("ru"))

Step 2 : Tokenize

new_df <- video %>% unnest_tokens(phrase, textual content) %>% anti_join(rus_stopwords)

# - anti_join  - functoin to take away stopwords
#video - is identify of dataframe
#phrase - is identify of latest subject
#textual content - is only a filed with our textual content

Step 3 : Depend phrases

frequency_dataframe = new_df %>% depend(phrase) %>% organize(desc(n))

Step 4 (Non-compulsory) : Take solely first 20 gadgets from a dataframe

short_dataframe = head(frequency_dataframe, 20)

Step 5 Visualize with ggplot

ggplot(short_dataframe, aes(x = phrase, y = n, fill = phrase)) + geom_col() 

So in my case it appeared appeared like this:

Screenshot 2020-05-05 at 11.50.18.png


Source link

Write a comment