Vectorization of for loop in sentiment analysis

Question

I'm struggling with for loop in R. I have a following data frame with sentences and two dictionaries with pos and neg words:

library(stringr)
library(plyr)
library(dplyr)
library(stringi)
library(qdap)
library(qdapRegex)
library(reshape2)
library(zoo)

# Create data.frame with sentences
sent <- data.frame(words = c("great just great right size and i love this notebook", "benefits great laptop at the top",
                         "wouldnt bad notebook and very good", "very good quality", "bad orgtop but great",
                         "great improvement for that great improvement bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
               number = c(1,1,1,1,1,1,1), stringsAsFactors=F)

# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
          "extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
          "wouldnt bad")
negWords <- c("hate","bad","not good","horrible")

And now I'm gonna to create replication of origin data frame for big data simulation:

# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
    sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
    sent$words <- paste(c(""), sent$words, c(""), collapse = NULL)
rownames(sent) <- NULL

For my further approach, I'll have to do descending ordering of words in dictionaries with their sentiment score (pos word = 1 and neg word = -1).

# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
wordsDF$words <- paste(c(""), wordsDF$words, c(""), collapse = NULL)
rownames(wordsDF) <- NULL

Then I have a following function with for loop. 1) matching exact words 2) count them 3) compute score 4) remove matched words from sentence for another iteration:

scoreSentence_new <- function(sentence){
  score <- 0
  for(x in 1:nrow(wordsDF)){
    sd <- function(text) {stri_count(text, regex=wordsDF[x,1])} # count matched words
    results <- sapply(sentence, sd, USE.NAMES=F) # count matched words
    score <- (score + (results * wordsDF[x,2])) # compute score
    sentence <- str_replace_all(sentence, wordsDF[x,1], " ") # remove matched words from sentence for next iteration
  }
  score
}

When I call that function

SentimentScore_new <- scoreSentence_new(sent$words)
    sent_new <- cbind(sent, SentimentScore_new)
    sent_new$words <- str_trim(sent_new$words, side = "both")

it resulted into desired output:

                                                                             words user     SentimentScore_new
                             great just great right size and i love this notebook    1                  4
                                                 benefits great laptop at the top    2                  2
                                               wouldnt bad notebook and very good    3                  2
                                                                very good quality    4                  1
                                                             bad orgtop but great    5                  0
 great improvement for that great improvement bad product but overall is not good    6                  0
                                       notebook is not good but i love batterytop    7                  0

In real I'm using dictionaries with pos/neg words about 7.000 words and I have 200.000 sentences. When I used my approach for 1.000 sentences it takes 45 mins. Please, could you anyone help me with some faster approach using of vectorization or parallel solution. Because of my beginner R programming skills I'm in the end of my efforts :-( Thank you very much in advance for any of your advice or solution

I was wondering about something like that:

n <- 1:nrow(wordsDF)
score <- 0

try_1 <- function(ttt) {
sd <- function(text) {stri_count(text, regex=wordsDF[ttt,1])}
results <- sapply(sent$words, sd, USE.NAMES=F)
    score <- (score + (results * wordsDF[ttt,2])) # compute score (count * sentValue)
    sent$words <- str_replace_all(sent$words, wordsDF[ttt,1], " ")
score
}

a <- unlist(sapply(n, try_1))
apply(a,1,sum)

But doesn't work :-(

Why do you include the line sentence <- str_replace_all(sentence, wordsDF[x,1], " ")? I can't see why it's necessary and it's definitely slowing things down — shadowtalker, Mar 10 '15 at 15:58
I would like to take a stab at this. Do you have any example data or should I select some built-in dataset to benchmark speed improvement? — Hack-R, Apr 13 '15 at 13:42

score 1 · Answer 1 · edited Jun 16 '20 at 11:08

looooops

I can imagine it will run slow :) avoid when possible very long for-loops in R. If so, keep the iterations as simple as possible, be careful using slow search functions and do not subset or edit large data structures.

restructure your data

I was talking to a guy, at a party, doing similar analysis. He said they arranged the data in a matrix of nrow sentences(200.000) and ncol(7000) words. Any element count how many times a given word occoured in a given sentence. If you need to do more analysis on the data set later, any operations can be done really fast. The result of your sentiment analysis would be the inner product of any row to a scoring vector(+1 +2 -1, score for word). The matrix would take some ~2Gb space if you use the bigmemory package and choose shorts(max 255 counts) instead of integers. bigmemory goes smoothly with multi-core(foreach, doMC, etc.) and elements can be modified individually much faster than for R matrices. The matrix can be saved to shared memort or HDD. bigmemory is using a C++ data structure which makes it possible to do some really fast operations with Rcpp. Steap laerning curve though.

Sparse matrix would also be a good idea. bigmemory package do not support windows

Maybe get some inspiration here:

https://stackoverflow.com/questions/10233087/sentiment-analysis-using-r

Vectorization of for loop in sentiment analysis

1 Answers1

looooops

restructure your data