I'm struggling with for loop in R. I have a following data frame with sentences and two dictionaries with pos and neg words:
library(stringr)
library(plyr)
library(dplyr)
library(stringi)
library(qdap)
library(qdapRegex)
library(reshape2)
library(zoo)
# Create data.frame with sentences
sent <- data.frame(words = c("great just great right size and i love this notebook", "benefits great laptop at the top",
"wouldnt bad notebook and very good", "very good quality", "bad orgtop but great",
"great improvement for that great improvement bad product but overall is not good", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
number = c(1,1,1,1,1,1,1), stringsAsFactors=F)
# Create pos/negWords
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super","good","benefits great",
"wouldnt bad")
negWords <- c("hate","bad","not good","horrible")
And now I'm gonna to create replication of origin data frame for big data simulation:
# Replicate original data.frame - big data simulation (700.000 rows of sentences)
df.expanded <- as.data.frame(replicate(100000,sent$words))
sent <- coredata(sent)[rep(seq(nrow(sent)),100000),]
sent$words <- paste(c(""), sent$words, c(""), collapse = NULL)
rownames(sent) <- NULL
For my further approach, I'll have to do descending ordering of words in dictionaries with their sentiment score (pos word = 1 and neg word = -1).
# Ordering words in pos/negWords
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
wordsDF$words <- paste(c(""), wordsDF$words, c(""), collapse = NULL)
rownames(wordsDF) <- NULL
Then I have a following function with for loop. 1) matching exact words 2) count them 3) compute score 4) remove matched words from sentence for another iteration:
scoreSentence_new <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
sd <- function(text) {stri_count(text, regex=wordsDF[x,1])} # count matched words
results <- sapply(sentence, sd, USE.NAMES=F) # count matched words
score <- (score + (results * wordsDF[x,2])) # compute score
sentence <- str_replace_all(sentence, wordsDF[x,1], " ") # remove matched words from sentence for next iteration
}
score
}
When I call that function
SentimentScore_new <- scoreSentence_new(sent$words)
sent_new <- cbind(sent, SentimentScore_new)
sent_new$words <- str_trim(sent_new$words, side = "both")
it resulted into desired output:
words user SentimentScore_new
great just great right size and i love this notebook 1 4
benefits great laptop at the top 2 2
wouldnt bad notebook and very good 3 2
very good quality 4 1
bad orgtop but great 5 0
great improvement for that great improvement bad product but overall is not good 6 0
notebook is not good but i love batterytop 7 0
In real I'm using dictionaries with pos/neg words about 7.000 words and I have 200.000 sentences. When I used my approach for 1.000 sentences it takes 45 mins. Please, could you anyone help me with some faster approach using of vectorization or parallel solution. Because of my beginner R programming skills I'm in the end of my efforts :-( Thank you very much in advance for any of your advice or solution
I was wondering about something like that:
n <- 1:nrow(wordsDF)
score <- 0
try_1 <- function(ttt) {
sd <- function(text) {stri_count(text, regex=wordsDF[ttt,1])}
results <- sapply(sent$words, sd, USE.NAMES=F)
score <- (score + (results * wordsDF[ttt,2])) # compute score (count * sentValue)
sent$words <- str_replace_all(sent$words, wordsDF[ttt,1], " ")
score
}
a <- unlist(sapply(n, try_1))
apply(a,1,sum)
But doesn't work :-(
sentence <- str_replace_all(sentence, wordsDF[x,1], " ")
? I can't see why it's necessary and it's definitely slowing things down – shadowtalker Mar 10 '15 at 15:58