News headlines text analysis | DataScience+

[ad_1]

Within the current tutorial, I present an introductory textual content evaluation of a ABC-news information headlines dataset. I’ll take a look to the commonest phrases therein current and run a sentiment evaluation on these headlines by benefiting from the next sentiment lexicons:

  • NRC
  • Bing
  • AFINN
  • The NRC sentiment lexicon from Saif Mohammad and Peter Turney categorizes phrases into classes of optimistic, destructive, anger, anticipation, disgust, worry pleasure, disappointment, shock and belief.

    The Bing sentiment lexicon from Bing Liu and others categorizes phrases into optimistic or destructive sentiment class.

    The AFINN sentiment lexicon from Finn Arup Nielsen assigns phrases with a rating from -5 to five, with destructive scores indicating destructive sentiment and optimistic scores indicating optimistic sentiment.

    For extra details about these sentiment lexicons, see references listed out on the backside.

    Packages

    I’m going to benefit from the next R packages.

    suppressPackageStartupMessages(library(stringr))
    suppressPackageStartupMessages(library(dplyr))
    suppressPackageStartupMessages(library(tidytext))
    suppressPackageStartupMessages(library(tidyr))
    suppressPackageStartupMessages(library(textdata))
    suppressPackageStartupMessages(library(widyr))
    suppressPackageStartupMessages(library(ggplot2))
    

    Packages variations are herein listed.

    packages <- c("stringr", "dplyr", "tidytext", "tidyr", "textdata", "widyr", "ggplot2")
    model <- lapply(packages, packageVersion)
    version_c <- do.name(c, model)
    information.body(packages=packages, model = as.character(version_c))
    ##   packages model
    ## 1  stringr   1.4.0
    ## 2    dplyr   0.8.4
    ## Three tidytext   0.2.2
    ## 4    tidyr   1.0.2
    ## 5 textdata   0.3.0
    ## 6    widyr   0.1.2
    ## 7  ggplot2   3.2.1
    

    Working on Home windows-10 the next R language model.

    R.model
    ##                _                           
    ## platform       x86_64-w64-mingw32          
    ## arch           x86_64                      
    ## os             mingw32                     
    ## system         x86_64, mingw32             
    ## standing                                     
    ## main          3                           
    ## minor          5.3                         
    ## 12 months           2019                        
    ## month          03                          
    ## day            11                          
    ## svn rev        76217                       
    ## language       R                           
    ## model.string R model 3.5.3 (2019-03-11)
    ## nickname       Nice Reality
    

    Notice

    Earlier than operating this code, be certain to have downloaded the lexicon of the emotions baselines in use by executing the next operations:

    get_sentiments("nrc")
    get_sentiments("bing")
    get_sentiments("afinn")
    

    and accepting all prescriptions as requested by the interactive menu exhibiting up.

    Getting Information

    I then obtain our information dataset containing thousands and thousands of headlines from:

    https://www.kaggle.com/therohk/million-headlines/downloads/million-headlines.zip/7

    Its uncompression produces the abcnews-date-text.csv file. I load it into the news_data dataset and take a look at.

    news_data <- learn.csv("abcnews-date-text.csv", header = TRUE, stringsAsFactors = FALSE)
    dim(news_data)
    ## [1] 1103663       2
    
    head(news_data)
    ##   publish_date                                      headline_text
    ## 1     20030219 aba decides towards group broadcasting licence
    ## 2     20030219     act fireplace witnesses should pay attention to defamation
    ## 3     20030219     a g requires infrastructure safety summit
    ## 4     20030219           air nz employees in aust strike for pay rise
    ## 5     20030219      air nz strike to have an effect on australian travellers
    ## 6     20030219                  formidable olsson wins triple bounce
    
    tail(news_data)
    ##         publish_date                                               headline_text
    ## 1103658     20171231             beautiful photographs from the sydney to hobart yacht
    ## 1103659     20171231 the ashes smiths warners close to miss enliven boxing day take a look at
    ## 1103660     20171231                     timelapse: brisbanes new 12 months fireworks
    ## 1103661     20171231                    what 2017 meant to the youngsters of australia
    ## 1103662     20171231            what the papodopoulos assembly might imply for ausus
    ## 1103663     20171231   who's george papadopoulos the previous trump marketing campaign aide
    

    Token Evaluation

    It’s time to extract the tokens from our dataset. Choose the column named as headline_text and unnesting the phrase tokens decide the next.

    news_df <- news_data %>% choose(headline_text)
    news_tokens <- news_df %>% unnest_tokens(phrase, headline_text)
    head(news_tokens, 10)
    ##             phrase
    ## 1            aba
    ## 1.1      decides
    ## 1.2      towards
    ## 1.3    group
    ## 1.Four broadcasting
    ## 1.5      licence
    ## 2            act
    ## 2.1         fireplace
    ## 2.2    witnesses
    ## 2.3         should
    
    tail(news_tokens, 10)
    ##                   phrase
    ## 1103662.7        ausus
    ## 1103663            who
    ## 1103663.1           is
    ## 1103663.2       george
    ## 1103663.Three papadopoulos
    ## 1103663.4          the
    ## 1103663.5       former
    ## 1103663.6        trump
    ## 1103663.7     marketing campaign
    ## 1103663.8         aide
    

    It’s fascinating to generate and examine a desk reporting what number of instances every token exhibits up throughout the headlines and its proportion with respect the overall.

    news_tokens_count <- news_tokens %>% depend(phrase, type = TRUE) %>% mutate(proportion = n / sum(n))
    

    The highest-10 phrases which seem most.

    head(news_tokens_count, 10)
    ## # A tibble: 10 x 3
    ##    phrase        n proportion
    ##    <chr>   <int>      <dbl>
    ##  1 to     214201    0.0303 
    ##  2 in     135981    0.0192 
    ##  Three for    130239    0.0184 
    ##  Four of      80759    0.0114 
    ##  5 on      73037    0.0103 
    ##  6 over    50306    0.00711
    ##  7 the     49810    0.00704
    ##  Eight police  35984    0.00509
    ##  9 at      31723    0.00449
    ## 10 with    29676    0.00420
    

    And those which seem much less incessantly:

    tail(news_tokens_count, 10)
    ## # A tibble: 10 x 3
    ##    phrase           n  proportion
    ##    <chr>      <int>       <dbl>
    ##  1 zweli          1 0.000000141
    ##  2 zwitkowsky     1 0.000000141
    ##  Three zydelig        1 0.000000141
    ##  Four zygar          1 0.000000141
    ##  5 zygiefs        1 0.000000141
    ##  6 zylvester      1 0.000000141
    ##  7 zynga          1 0.000000141
    ##  Eight zyngier        1 0.000000141
    ##  9 zz             1 0.000000141
    ## 10 zzz            1 0.000000141
    

    There is a matter in having doing that means. The difficulty is that there are phrases which should not have related position in easing the sentiment evaluation, the so referred to as cease phrases. Herein under the cease phrases wihin our dataset are proven.

    information(stop_words)
    head(stop_words, 10)
    ## # A tibble: 10 x 2
    ##    phrase        lexicon
    ##    <chr>       <chr>  
    ##  1 a           SMART  
    ##  2 a's         SMART  
    ##  Three ready        SMART  
    ##  Four about       SMART  
    ##  5 above       SMART  
    ##  6 in accordance   SMART  
    ##  7 accordingly SMART  
    ##  Eight throughout      SMART  
    ##  9 truly    SMART  
    ## 10 after       SMART
    

    To take away cease phrases as required, we benefit from the anti_join operation.

    news_tokens_no_sp <- news_tokens %>% anti_join(stop_words)
    head(news_tokens_no_sp, 10)
    ##            phrase
    ## 1           aba
    ## 2       decides
    ## 3     group
    ## 4  broadcasting
    ## 5       licence
    ## 6           act
    ## 7          fireplace
    ## 8     witnesses
    ## 9         conscious
    ## 10   defamation
    

    Then, counting information tokens once more after having eliminated the cease phrases.

    news_tokens_count <- news_tokens_no_sp %>% depend(phrase, type = TRUE) %>% mutate(proportion = n / sum(n))
    head(news_tokens_count, 10)
    ## # A tibble: 10 x 3
    ##    phrase          n proportion
    ##    <chr>     <int>      <dbl>
    ##  1 police    35984    0.00673
    ##  2 govt      16923    0.00317
    ##  Three courtroom     16380    0.00306
    ##  Four council   16343    0.00306
    ##  5 interview 15025    0.00281
    ##  6 fireplace      13910    0.00260
    ##  7 nsw       12912    0.00242
    ##  Eight australia 12353    0.00231
    ##  9 plan      12307    0.00230
    ## 10 water     11874    0.00222
    
    tail(news_tokens_count)
    ## # A tibble: 6 x 3
    ##   phrase          n  proportion
    ##   <chr>     <int>       <dbl>
    ## 1 zygiefs       1 0.000000187
    ## 2 zylvester     1 0.000000187
    ## Three zynga         1 0.000000187
    ## Four zyngier       1 0.000000187
    ## 5 zz            1 0.000000187
    ## 6 zzz           1 0.000000187
    

    Then, I filter out tokens having greater than 8,000 counts.

    news_token_over8000 <- news_tokens_count %>% filter(n > 8000) %>% mutate(phrase = reorder(phrase, n))
    nrow(news_token_over8000)
    ## [1] 32
    
    head(news_token_over8000, 10) 
    ## # A tibble: 10 x 3
    ##    phrase          n proportion
    ##    <fct>     <int>      <dbl>
    ##  1 police    35984    0.00673
    ##  2 govt      16923    0.00317
    ##  Three courtroom     16380    0.00306
    ##  Four council   16343    0.00306
    ##  5 interview 15025    0.00281
    ##  6 fireplace      13910    0.00260
    ##  7 nsw       12912    0.00242
    ##  Eight australia 12353    0.00231
    ##  9 plan      12307    0.00230
    ## 10 water     11874    0.00222
    
    tail(news_token_over8000, 10) 
    ## # A tibble: 10 x 3
    ##    phrase         n proportion
    ##    <fct>    <int>      <dbl>
    ##  1 day       8818    0.00165
    ##  2 hospital  8815    0.00165
    ##  Three automotive       8690    0.00163
    ##  Four coast     8411    0.00157
    ##  5 calls     8401    0.00157
    ##  6 win       8315    0.00156
    ##  7 lady     8213    0.00154
    ##  Eight killed    8129    0.00152
    ##  9 accused   8094    0.00151
    ## 10 world     8087    0.00151
    

    It’s fascinating to point out the proportion as per-thousands by the use of an histogram plot.

    news_token_over8000 %>%  
      ggplot(aes(phrase, proportion*1000, fill=ceiling(proportion*1000))) +
      geom_col() + xlab(NULL) + coord_flip() + theme(legend.place = "none")
    

    Information Sentiment Evaluation

    On this paragraph, I concentrate on every single headline to guage its particular sentiment as decided by every lexicon. Therefore the output shall decide if every particular headline has obtained optimistic or destructive sentiment.

    head(news_df, 10)
    ##                                         headline_text
    ## 1  aba decides towards group broadcasting licence
    ## 2      act fireplace witnesses should pay attention to defamation
    ## 3      a g requires infrastructure safety summit
    ## 4            air nz employees in aust strike for pay rise
    ## 5       air nz strike to have an effect on australian travellers
    ## 6                   formidable olsson wins triple bounce
    ## 7          antic delighted with document breaking barca
    ## 8   aussie qualifier stosur wastes 4 memphis match
    ## 9        aust addresses un safety council over iraq
    ## 10         australia is locked into conflict timetable opp
    

    I’ll analyse solely the primary 1000 headlines only for computational time causes. The token checklist of such is as follows.

    news_df_subset <- news_df[1:1000,,drop=FALSE]
    tkn_l <- apply(news_df_subset, 1, operate(x) { information.body(headline_text=x, stringsAsFactors = FALSE) %>% unnest_tokens(phrase, headline_text)})
    

    Eradicating the cease phrases from the token checklist.

    single_news_tokens <- lapply(tkn_l, operate(x) {anti_join(x, stop_words)})
    
    str(single_news_tokens, checklist.len = 5)
    ## Listing of 1000
    ##  $ 1   :'information.body':    5 obs. of  1 variable:
    ##   ..$ phrase: chr [1:5] "aba" "decides" "group" "broadcasting" ...
    ##  $ 2   :'information.body':    5 obs. of  1 variable:
    ##   ..$ phrase: chr [1:5] "act" "fireplace" "witnesses" "conscious" ...
    ##  $ 3   :'information.body':    Four obs. of  1 variable:
    ##   ..$ phrase: chr [1:4] "calls" "infrastructure" "safety" "summit"
    ##  $ 4   :'information.body':    7 obs. of  1 variable:
    ##   ..$ phrase: chr [1:7] "air" "nz" "employees" "aust" ...
    ##  $ 5   :'information.body':    6 obs. of  1 variable:
    ##   ..$ phrase: chr [1:6] "air" "nz" "strike" "have an effect on" ...
    ##   [list output truncated]
    

    As we are able to see, to every headline is related an inventory of tokens. The sentiment of a headline is computed as based mostly on the sum of optimistic/destructive rating of every token of.

    single_news_tokens[[1]]
    ##           phrase
    ## 1          aba
    ## 2      decides
    ## 3    group
    ## Four broadcasting
    ## 5      licence
    

    Bing lexicon

    On this paragraph, the computation of the sentiment related to the tokens checklist is proven for Bing lexicon. I first outline a operate named as compute_sentiment() whose goal is to output the positiveness rating of a particular headline.

    compute_sentiment <- operate(d) {
      if (nrow(d) == 0) {
        return(NA)
      }
      neg_score <- d %>% filter(sentiment=="destructive") %>% nrow()
      pos_score <- d %>% filter(sentiment=="optimistic") %>% nrow()
      pos_score - neg_score
    } 
    

    The internal be a part of on bing lexicon of every single headline tokens checklist is given as enter to the compute_sentiment() operate to find out the sentiment rating of every particular headline.

    sentiments_bing <- get_sentiments("bing")
    str(sentiments_bing)
    ## Courses 'tbl_df', 'tbl' and 'information.body':    6786 obs. of  2 variables:
    ##  $ phrase     : chr  "2-faces" "irregular" "abolish" "abominable" ...
    ##  $ sentiment: chr  "destructive" "destructive" "destructive" "destructive" ...
    
    single_news_sentiment_bing <- sapply(single_news_tokens, operate(x) { x %>% inner_join(sentiments_bing) %>% compute_sentiment()})
    

    The result’s a vector of integers every factor worth at i-th place is the sentiment related to the i-th information

    str(single_news_sentiment_bing)
    ##  Named int [1:1000] NA -1 1 -1 -1 2 Zero NA NA NA ...
    ##  - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...
    

    Right here is the abstract, please observe that:

    • the median is destructive
    • NA’s present up
    abstract(single_news_sentiment_bing)
    ##    Min. 1st Qu.  Median    Imply third Qu.    Max.    NA's 
    ##  -3.000  -1.000  -1.000  -0.475   1.000   2.000     520
    

    Gathering the leading to a knowledge body as follows.

    single_news_sentiment_bing_df <- information.body(headline_text=news_df_subset$headline_text, rating = single_news_sentiment_bing)
    head(single_news_sentiment_bing_df, 10)
    ##                                         headline_text rating
    ## 1  aba decides towards group broadcasting licence    NA
    ## 2      act fireplace witnesses should pay attention to defamation    -1
    ## 3      a g requires infrastructure safety summit     1
    ## 4            air nz employees in aust strike for pay rise    -1
    ## 5       air nz strike to have an effect on australian travellers    -1
    ## 6                   formidable olsson wins triple bounce     2
    ## 7          antic delighted with document breaking barca     0
    ## 8   aussie qualifier stosur wastes 4 memphis match    NA
    ## 9        aust addresses un safety council over iraq    NA
    ## 10         australia is locked into conflict timetable opp    NA
    

    NRC lexicon

    On this paragraph, the computation of the sentiment related to the tokens checklist is proven for NRC lexicon. With respect the earlier evaluation based mostly on bing lexicon, some extra pre-processing is required as defined in what follows. First we get the NRC sentiment lexicon and see what are the emotions threin current.

    sentiments_nrc <- get_sentiments("nrc")
    (unique_sentiments_nrc <- distinctive(sentiments_nrc$sentiment))
    ##  [1] "belief"        "worry"         "destructive"     "disappointment"      "anger"        "shock"    
    ##  [7] "optimistic"     "disgust"      "pleasure"          "anticipation"
    

    To have as output a optimistic/destructive sentiment outcome, I outline a mapping of abovelisted sentiments to a optimistic/destructive string outcome as follows.

    compute_pos_neg_sentiments_nrc <- operate(the_sentiments_nrc) {
      s <- distinctive(the_sentiments_nrc$sentiment)
      df_sentiments <- information.body(sentiment = s, 
                                  mapped_sentiment = c("optimistic", "destructive", "destructive", "destructive",
                                                        "destructive", "optimistic", "optimistic", "destructive", 
                                                        "optimistic", "optimistic"))
      ss <- sentiments_nrc %>% inner_join(df_sentiments)
      the_sentiments_nrc$sentiment <- ss$mapped_sentiment
      the_sentiments_nrc
    }
    
    nrc_sentiments_pos_neg_scale <- compute_pos_neg_sentiments_nrc(sentiments_nrc)
    

    Above operate is used to provide the only headline textual content sentiment outcomes. Such result’s given as enter to the compute_sentiment() operate.

    single_news_sentiment_nrc <- sapply(single_news_tokens, operate(x) { x %>% inner_join(nrc_sentiments_pos_neg_scale) %>% compute_sentiment()})
    
    str(single_news_sentiment_nrc)
    ##  Named int [1:1000] 1 -Four 1 2 -2 2 Four NA 5 -2 ...
    ##  - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...
    

    Right here is the abstract, please observe that:

    • the median is the same as zero
    • NA’s present up
    abstract(single_news_sentiment_nrc)
    ##    Min. 1st Qu.  Median    Imply third Qu.    Max.    NA's 
    ## -9.0000 -2.0000  0.0000 -0.3742  2.0000  9.0000     257
    
    single_news_sentiment_nrc_df <- information.body(headline_text=news_df_subset$headline_text, rating = single_news_sentiment_nrc)
    head(single_news_sentiment_nrc_df, 10)
    ##                                         headline_text rating
    ## 1  aba decides towards group broadcasting licence     1
    ## 2      act fireplace witnesses should pay attention to defamation    -4
    ## 3      a g requires infrastructure safety summit     1
    ## 4            air nz employees in aust strike for pay rise     2
    ## 5       air nz strike to have an effect on australian travellers    -2
    ## 6                   formidable olsson wins triple bounce     2
    ## 7          antic delighted with document breaking barca     4
    ## 8   aussie qualifier stosur wastes 4 memphis match    NA
    ## 9        aust addresses un safety council over iraq     5
    ## 10         australia is locked into conflict timetable opp    -2
    

    AFINN lexicon

    On this paragraph, the computation of the sentiment related to the tokens checklist is proven for AFINN lexicon.

    sentiments_afinn <- get_sentiments("afinn")
    colnames(sentiments_afinn) <- c("phrase", "sentiment")
    str(sentiments_afinn)
    ## Courses 'spec_tbl_df', 'tbl_df', 'tbl' and 'information.body': 2477 obs. of  2 variables:
    ##  $ phrase     : chr  "abandon" "deserted" "abandons" "kidnapped" ...
    ##  $ sentiment: num  -2 -2 -2 -2 -2 -2 -3 -3 -3 -3 ...
    ##  - attr(*, "spec")=
    ##   .. cols(
    ##   ..   phrase = col_character(),
    ##   ..   worth = col_double()
    ##   .. )
    

    As we are able to see, the afinn lexicon gives a rating for every token. We simply must sum up every headline tokens rating to acquire the sentiment rating of the headline underneath evaluation.

    single_news_sentiment_afinn_df <- lapply(single_news_tokens, operate(x) { x %>% inner_join(sentiments_afinn)})
    single_news_sentiment_afinn <- sapply(single_news_sentiment_afinn_df, operate(x) { 
          ifelse(nrow(x) > 0, sum(x$sentiment), NA)
      })
    
    str(single_news_sentiment_afinn)
    ##  Named num [1:1000] NA -2 NA -2 -1 6 Three NA NA -2 ...
    ##  - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...
    

    Right here is the abstract, please observe that:

    • the median is destructive
    • NA’s present up
    abstract(single_news_sentiment_afinn)
    ##    Min. 1st Qu.  Median    Imply third Qu.    Max.    NA's 
    ##  -9.000  -3.000  -2.000  -1.148   1.000   7.000     508
    
    single_news_sentiment_afinn_df <- information.body(headline_text=news_df_subset$headline_text, rating = single_news_sentiment_afinn)
    head(single_news_sentiment_afinn_df, 10)
    ##                                         headline_text rating
    ## 1  aba decides towards group broadcasting licence    NA
    ## 2      act fireplace witnesses should pay attention to defamation    -2
    ## 3      a g requires infrastructure safety summit    NA
    ## 4            air nz employees in aust strike for pay rise    -2
    ## 5       air nz strike to have an effect on australian travellers    -1
    ## 6                   formidable olsson wins triple bounce     6
    ## 7          antic delighted with document breaking barca     3
    ## 8   aussie qualifier stosur wastes 4 memphis match    NA
    ## 9        aust addresses un safety council over iraq    NA
    ## 10         australia is locked into conflict timetable opp    -2
    

    Evaluating outcomes

    Having obtained for every information three potential outcomes as sentiment analysis, we want to examine their congruency.
    As congruence we imply the truth that all three lexicons specific the identical optimistic or destructive outcome, in different phrases the identical rating signal indipendently from its magnitude. If NA values are current, the congruence shall be computed till a minimum of two non NA values can be found, in any other case is the same as NA.

    Moreover we compute the ultimate information sentiment as based mostly upon the sum of every lexicon sentiment rating.

    compute_congruence <- operate(x,y,z) {
      v <- c(signal(x), signal(y), signal(z))
      # if just one lexicon studies the rating, we can not test for congruence
      if (sum(is.na(v)) >= 2) {
        return (NA)
      }
      # eradicating NA and nil worth
      v <- na.omit(v)
      v_sum <- sum(v)
      abs(v_sum) == size(v)
    }
    
    compute_final_sentiment <- operate(x,y,z) {
      if (is.na(x) && is.na(y) && is.na(z)) {
        return (NA)
      }
    
      s <- sum(x, y, z, na.rm=TRUE)
      # optimistic sentiments have rating strictly better than zero
      # destructive sentiments have rating strictly lower than zero
      # impartial sentiments have rating equal to zero 
      ifelse(s > 0, "optimistic", ifelse(s < 0, "destructive", "impartial"))
    }
    
    news_sentiments_results <- information.body(headline_text = news_df_subset$headline_text, 
                                          bing_score = single_news_sentiment_bing, 
                                          nrc_score = single_news_sentiment_nrc, 
                                          afinn_score = single_news_sentiment_afinn,
                                          stringsAsFactors = FALSE)
    
    news_sentiments_results <- news_sentiments_results %>% rowwise() %>% 
      mutate(final_sentiment = compute_final_sentiment(bing_score, nrc_score, afinn_score),
             congruence = compute_congruence(bing_score, nrc_score, afinn_score))
    
    head(news_sentiments_results, 40)
    ## Supply: native information body [40 x 6]
    ## Teams: <by row>
    ## 
    ## # A tibble: 40 x 6
    ##    headline_text                           bing_score nrc_score afinn_score final_sentiment congruence
    ##    <chr>                                        <int>     <int>       <dbl> <chr>           <lgl>     
    ##  1 aba decides towards group broadcas~         NA         1          NA optimistic        NA        
    ##  2 act fireplace witnesses should pay attention to de~         -1        -4          -2 destructive        TRUE      
    ##  Three a g requires infrastructure protectio~          1         1          NA optimistic        TRUE      
    ##  Four air nz employees in aust strike for pay ri~         -1         2          -2 destructive        FALSE     
    ##  5 air nz strike to have an effect on australian tra~         -1        -2          -1 destructive        TRUE      
    ##  6 formidable olsson wins triple bounce                2         2           6 optimistic        TRUE      
    ##  7 antic delighted with document breaking b~          0         4           Three optimistic        FALSE     
    ##  Eight aussie qualifier stosur wastes 4 me~         NA        NA          NA <NA>            NA        
    ##  9 aust addresses un safety council ove~         NA         5          NA optimistic        NA        
    ## 10 australia is locked into conflict timetable~         NA        -2          -2 destructive        TRUE      
    ## # ... with 30 extra rows
    

    Is can be helpful to exchange the numeric rating with similar {destructive, impartial, optimistic} scale.

    replace_score_with_sentiment <- operate(v_score) {
      v_score[v_score > 0] <- "optimistic"
      v_score[v_score < 0] <- "destructive"
      v_score[v_score == 0] <- "impartial"
      v_score
    } 
    
    news_sentiments_results$bing_score <- replace_score_with_sentiment(news_sentiments_results$bing_score)
    news_sentiments_results$nrc_score <- replace_score_with_sentiment(news_sentiments_results$nrc_score)
    news_sentiments_results$afinn_score <- replace_score_with_sentiment(news_sentiments_results$afinn_score)
    
    news_sentiments_results[,2:5] <- lapply(news_sentiments_results[,2:5], as.issue)
    
    head(news_sentiments_results, 40)
    ## Supply: native information body [40 x 6]
    ## Teams: <by row>
    ## 
    ## # A tibble: 40 x 6
    ##    headline_text                           bing_score nrc_score afinn_score final_sentiment congruence
    ##    <chr>                                   <fct>      <fct>     <fct>       <fct>           <lgl>     
    ##  1 aba decides towards group broadcas~ <NA>       optimistic  <NA>        optimistic        NA        
    ##  2 act fireplace witnesses should pay attention to de~ destructive   destructive  destructive    destructive        TRUE      
    ##  Three a g requires infrastructure protectio~ optimistic   optimistic  <NA>        optimistic        TRUE      
    ##  Four air nz employees in aust strike for pay ri~ destructive   optimistic  destructive    destructive        FALSE     
    ##  5 air nz strike to have an effect on australian tra~ destructive   destructive  destructive    destructive        TRUE      
    ##  6 formidable olsson wins triple bounce       optimistic   optimistic  optimistic    optimistic        TRUE      
    ##  7 antic delighted with document breaking b~ impartial    optimistic  optimistic    optimistic        FALSE     
    ##  Eight aussie qualifier stosur wastes 4 me~ <NA>       <NA>      <NA>        <NA>            NA        
    ##  9 aust addresses un safety council ove~ <NA>       optimistic  <NA>        optimistic        NA        
    ## 10 australia is locked into conflict timetable~ <NA>       destructive  destructive    destructive        TRUE      
    ## # ... with 30 extra rows
    

    Tabularizations of every lexicon ensuing sentiment and ultimate sentiments are herein proven.

    desk(news_sentiments_results$bing_score, news_sentiments_results$final_sentiment, dnn = c("bing", "ultimate"))
    ##           ultimate
    ## bing       destructive impartial optimistic
    ##   destructive      278      15       14
    ##   impartial        16       6       11
    ##   optimistic        6       7      127
    
    desk(news_sentiments_results$nrc_score, news_sentiments_results$final_sentiment, dnn = c("nrc", "ultimate"))
    ##           ultimate
    ## nrc        destructive impartial optimistic
    ##   destructive      353      10        4
    ##   impartial        18      13        6
    ##   optimistic       25      16      298
    
    desk(news_sentiments_results$afinn_score, news_sentiments_results$final_sentiment, dnn = c("afinn", "ultimate"))
    ##           ultimate
    ## afinn      destructive impartial optimistic
    ##   destructive      326      10       12
    ##   impartial         3       1        6
    ##   optimistic        4       9      121
    

    Tabularization of congruence and ultimate sentiments is herein proven.

    desk(news_sentiments_results$congruence, news_sentiments_results$final_sentiment, dnn = c("congruence", "ultimate"))
    ##           ultimate
    ## congruence destructive impartial optimistic
    ##      FALSE       67      33       45
    ##      TRUE       292       0      132
    

    Conclusions

    We analyzed the information headlines to find out their sentiments whereas benefiting from three sentiments lexicons. We outlined some fundamentals of the methodology for such goal. We additionally had the prospect to match the outcomes obtained by the use of all three lexicons and set forth a ultimate sentiment analysis. In case you are concerned about understanding rather more about textual content evaluation, see ref. [4].

    References

    [1] NRC sentiment lexicon
    [2] BING sentiment lexicon
    [3] AFINN sentiment lexicon
    [4] Text mining with R



    [ad_2]

    Source link

    Write a comment