Using tf-idf to measure emoji sentiment

My data scientist friend suggested two changes to the emoji/emoticon script I wrote: sort the list by score, and use tf-idf to calculate the significance of a detected emoticon and emoji and filter on that.

tf-idf stands for “term frequency, inverse document frequency.” The idea is that if a term appears a lot in a document (tweet), then that term must be important. But if the term also appears in a lot of documents, then it must be not-so-important. Here’s the pseudocode to calculate a tf-idf score for each given word in a document:

let tf = number of times the word appears in the document / number of words in a document
let n_containing = the number of documents where the word appears
let idf = math.log(number of documents) / (1 + n_containing)
let tf-idf = tf * idf

To calculate the average tf-idf of a word for all tweets, I took the median of the tf-idf score for tweets that did contain that particular term.

To adopt the definition of a word for purposes of counting emojis as words in a tweet, I used the following regular expression and counted the occurrence of each within the tweet:

\S+

I expanded the tweet list to 16K, and then filtered for terms that appeared more than 20 times and had an average tf-idf of 0.01. The only pre-processing I did within the tweet was to substitute any whitespace, including newlines, for a single space.

Here’s the new list — much nicer! Thanks for the suggestions, Michael!

Term Sentiment Score TD-IDF Occurrences Positive Score Negative Score
5.32 0.040 28 151 -2
5.09 0.023 22 114 -2
? 4.58 0.059 26 119 -0
4.52 0.099 21 99 -4
? 3.73 0.032 37 138 -0
? 3.50 0.026 42 173 -26
? 3.39 0.012 101 358 -16
? 3.07 0.027 30 112 -20
‘. 2.56 0.016 32 90 -8
? 2.44 0.022 50 145 -23
😉 2.36 0.035 22 52 -0
?? 2.23 0.044 22 63 -14
£ 2.18 0.026 28 67 -6
?. 2.00 0.020 29 58 -0
? 2.00 0.020 29 58 -0
? 2.00 0.040 58 116 -0
? 2.00 0.020 29 58 -0
? 2.00 0.020 29 58 -0
? 2.00 0.020 29 58 -0
? 2.00 0.040 58 116 -0
= 1.72 0.031 29 89 -39
? -1.58 0.023 31 47 -96
| 1.56 0.012 79 175 -52
.@ 1.52 0.017 40 96 -35
@__ 1.46 0.033 28 79 -38
“@ 1.41 0.023 27 65 -27
?! 1.39 0.026 28 67 -28
? -1.37 0.035 30 22 -63
~ 1.36 0.024 39 84 -31
1.21 0.017 38 77 -31
? -1.18 0.042 22 26 -52
!!!! 1.08 0.036 36 75 -36
?? 1.06 0.020 51 120 -66
? 1.05 0.049 20 33 -12
?? -0.99 0.014 70 97 -166
__: -0.98 0.023 41 47 -87
] 0.94 0.013 67 114 -51
? -0.87 0.037 23 26 -46
[ 0.79 0.014 61 96 -48
__ 0.70 0.016 60 129 -87
??? -0.66 0.013 85 101 -157
🙁 0.61 0.026 31 64 -45
??? 0.55 0.051 20 44 -33
.” 0.51 0.018 37 53 -34
* 0.51 0.013 184 295 -202
? -0.42 0.023 52 77 -99
….. 0.41 0.020 37 68 -53
? -0.39 0.043 23 46 -55
???? 0.33 0.041 21 40 -33
? 0.29 0.017 51 84 -69
? -0.29 0.039 31 52 -61
;& -0.25 0.016 116 126 -155
0.25 0.033 36 57 -48
-0.22 0.021 27 44 -50
?” 0.03 0.018 30 48 -47
Posted in Programming | Tagged | Leave a comment

Getting Twitter sentiment of emoticons and emoji

A data scientist friend and I talked briefly about how hard it is to parse text into words. He mentioned that he felt that not enough attention were paid to emoticons or emoji in the Twitter sentiment analysis papers he reads.

In this context, a sentiment score is a measure of emotion associated with a word or phrase. Negative scores mean the word is associated with negative emotions, and positive scores mean the word is associated with positive emotions.

Coincidentally, I’d been taking a data science course, and the assignment I was working on concerned Twitter sentiment analysis. It wasn’t too hard to adopt the homework to estimate sentiment for non-alphanumeric characters. The regular expression I used to grab emoticons and emoji was:

[^A-Za-z0-9\s]+

Clearly, the above regex captures more than emoticons and emoji — it captures valid words in non-ASCII languages and punctuation. Nonetheless, I thought it’d be an interesting start.

For calculating the score of nonalphanumeric sentiments, the formula I used was:

let pos = [cumulative score of positive sentiment words of the tweet which the term appeared in]
let neg = [cumulative score of negative sentiment words of the tweet which the term appeared in]
let count = [number of times that terms appeared in the document]
let sentiment = pos / count - neg / count

Where the sentiment words and scores are taken from the AFINN list.

I did a quick and dirty run through about 6K English tweets collected from the Twitter sprinkler API. Not a representative sample by any means, but again, my aim was to get a quick estimate, not a scientific paper. The results are below — terms which appeared fewer than 10 times are not published here.

Term Sentiment Score Occurrences Positive Score Negative Score
@ 1.41 2629 6337 -2630
. 1.19 2585 5816 -2733
: 0.99 1746 3770 -2033
/ 1.14 1470 3014 -1341
:// 1.07 1421 2871 -1349
0.61 1119 2503 -1816
# 1.74 753 1795 -483
, 1.22 692 1723 -877
_ 1.56 421 1013 -355
! 2.59 356 1111 -190
1.67 315 835 -310
1.27 265 538 -201
0.57 255 498 -353
? 1.55 197 482 -177
; 0.97 188 447 -265
& 1.13 151 400 -230
0.60 122 234 -161
!! 2.17 114 352 -105
( 1.09 76 133 -50
) 1.01 68 112 -43
_: -0.28 58 95 -111
@_ 0.19 57 117 -106
.. 0.75 52 115 -76
2.11 46 131 -34
? 0.33 43 84 -70
;& 0.66 38 53 -28
❤️ 3.24 37 129 -9
* 0.79 34 65 -38
!!! 2.73 30 100 -18
🙂 2.93 27 98 -19
? 4.04 25 106 -5
?? 1.58 24 64 -26
[ 0.95 22 40 -19
] 0.64 22 36 -22
$ 1.95 22 62 -19
??? 0.05 21 22 -21
…. -0.05 21 30 -31
.” 0.05 19 23 -22
? 4.78 18 86 0
5.00 18 90 0
| 2.41 17 44 -3
?? -1.12 16 25 -43
0.27 15 33 -29
__ 1.29 14 34 -16
% 1.43 14 26 -6
? 2.54 13 42 -9
0.69 13 24 -15
+ 2.25 12 33 -6
? 4.33 12 54 -2
:… 2.45 11 32 -5
? 0.18 11 20 -18
??? 4.00 11 44 0
0.36 11 20 -16
? -0.40 10 11 -15
__: 0.90 10 23 -14
? 2.80 10 32 -4
Posted in Programming | Tagged | Leave a comment