1

I'm trying to extract the sentiment of Italian words using reviews that users wrote on Italian amazon. After doing some cleaning (remove punctuation, stop-words, etc.) I used this method to get the sentiment:

s = (tf_1/tf)*(-2) + (tf_2/tf)*(-1) + (tf_4/tf)*(1) + (tf_5/tf)*(2)

Where tf is the term frequency in the whole corpus of reviews, tf_1 is the term frequency in one-star reviews and the like. But the problem is that, overall, most reviews are 4 and 5 stars (more than 80%). So a lot of neutral words end up with positive sentiments. How can I normalize this effect? I tried the following method, but the result got worse:

s = ((tf_1/n1)*(-2) + (tf_2/n2)*(-1) + (tf_4/n4)*(1) + (tf_5/n5)*(2)) * (n/tf)

Where n is the number of total reviews, n1 is the number of reviews with one star and the like.

0 Answers0