Getting Twitter sentiment of emoticons and emoji

A data scientist friend and I talked briefly about how hard it is to parse text into words. He mentioned that he felt that not enough attention were paid to emoticons or emoji in the Twitter sentiment analysis papers he reads.

In this context, a sentiment score is a measure of emotion associated with a word or phrase. Negative scores mean the word is associated with negative emotions, and positive scores mean the word is associated with positive emotions.

Coincidentally, I’d been taking a data science course, and the assignment I was working on concerned Twitter sentiment analysis. It wasn’t too hard to adopt the homework to estimate sentiment for non-alphanumeric characters. The regular expression I used to grab emoticons and emoji was:

[^A-Za-z0-9\s]+

Clearly, the above regex captures more than emoticons and emoji — it captures valid words in non-ASCII languages and punctuation. Nonetheless, I thought it’d be an interesting start.

For calculating the score of nonalphanumeric sentiments, the formula I used was:

let pos = [cumulative score of positive sentiment words of the tweet which the term appeared in]
let neg = [cumulative score of negative sentiment words of the tweet which the term appeared in]
let count = [number of times that terms appeared in the document]
let sentiment = pos / count - neg / count

Where the sentiment words and scores are taken from the AFINN list.

I did a quick and dirty run through about 6K English tweets collected from the Twitter sprinkler API. Not a representative sample by any means, but again, my aim was to get a quick estimate, not a scientific paper. The results are below — terms which appeared fewer than 10 times are not published here.

Term Sentiment Score Occurrences Positive Score Negative Score
@ 1.41 2629 6337 -2630
. 1.19 2585 5816 -2733
: 0.99 1746 3770 -2033
/ 1.14 1470 3014 -1341
:// 1.07 1421 2871 -1349
0.61 1119 2503 -1816
# 1.74 753 1795 -483
, 1.22 692 1723 -877
_ 1.56 421 1013 -355
! 2.59 356 1111 -190
1.67 315 835 -310
1.27 265 538 -201
0.57 255 498 -353
? 1.55 197 482 -177
; 0.97 188 447 -265
& 1.13 151 400 -230
0.60 122 234 -161
!! 2.17 114 352 -105
( 1.09 76 133 -50
) 1.01 68 112 -43
_: -0.28 58 95 -111
@_ 0.19 57 117 -106
.. 0.75 52 115 -76
2.11 46 131 -34
? 0.33 43 84 -70
;& 0.66 38 53 -28
❤️ 3.24 37 129 -9
* 0.79 34 65 -38
!!! 2.73 30 100 -18
🙂 2.93 27 98 -19
? 4.04 25 106 -5
?? 1.58 24 64 -26
[ 0.95 22 40 -19
] 0.64 22 36 -22
$ 1.95 22 62 -19
??? 0.05 21 22 -21
…. -0.05 21 30 -31
.” 0.05 19 23 -22
? 4.78 18 86 0
5.00 18 90 0
| 2.41 17 44 -3
?? -1.12 16 25 -43
0.27 15 33 -29
__ 1.29 14 34 -16
% 1.43 14 26 -6
? 2.54 13 42 -9
0.69 13 24 -15
+ 2.25 12 33 -6
? 4.33 12 54 -2
:… 2.45 11 32 -5
? 0.18 11 20 -18
??? 4.00 11 44 0
0.36 11 20 -16
? -0.40 10 11 -15
__: 0.90 10 23 -14
? 2.80 10 32 -4
This entry was posted in Programming and tagged . Bookmark the permalink. Follow any comments here with the RSS feed for this post. Post a comment or leave a trackback.

Leave a Reply

Your email address will not be published. Required fields are marked *

Your email address will never be published.