Using tf-idf to measure emoji sentiment

My data scientist friend suggested two changes to the emoji/emoticon script I wrote: sort the list by score, and use tf-idf to calculate the significance of a detected emoticon and emoji and filter on that.

tf-idf stands for “term frequency, inverse document frequency.” The idea is that if a term appears a lot in a document (tweet), then that term must be important. But if the term also appears in a lot of documents, then it must be not-so-important. Here’s the pseudocode to calculate a tf-idf score for each given word in a document:

let tf = number of times the word appears in the document / number of words in a document
let n_containing = the number of documents where the word appears
let idf = math.log(number of documents) / (1 + n_containing)
let tf-idf = tf * idf

To calculate the average tf-idf of a word for all tweets, I took the median of the tf-idf score for tweets that did contain that particular term.

To adopt the definition of a word for purposes of counting emojis as words in a tweet, I used the following regular expression and counted the occurrence of each within the tweet:

\S+

I expanded the tweet list to 16K, and then filtered for terms that appeared more than 20 times and had an average tf-idf of 0.01. The only pre-processing I did within the tweet was to substitute any whitespace, including newlines, for a single space.

Here’s the new list — much nicer! Thanks for the suggestions, Michael!

Term Sentiment Score TD-IDF Occurrences Positive Score Negative Score
5.32 0.040 28 151 -2
5.09 0.023 22 114 -2
? 4.58 0.059 26 119 -0
4.52 0.099 21 99 -4
? 3.73 0.032 37 138 -0
? 3.50 0.026 42 173 -26
? 3.39 0.012 101 358 -16
? 3.07 0.027 30 112 -20
‘. 2.56 0.016 32 90 -8
? 2.44 0.022 50 145 -23
😉 2.36 0.035 22 52 -0
?? 2.23 0.044 22 63 -14
£ 2.18 0.026 28 67 -6
?. 2.00 0.020 29 58 -0
? 2.00 0.020 29 58 -0
? 2.00 0.040 58 116 -0
? 2.00 0.020 29 58 -0
? 2.00 0.020 29 58 -0
? 2.00 0.020 29 58 -0
? 2.00 0.040 58 116 -0
= 1.72 0.031 29 89 -39
? -1.58 0.023 31 47 -96
| 1.56 0.012 79 175 -52
.@ 1.52 0.017 40 96 -35
@__ 1.46 0.033 28 79 -38
“@ 1.41 0.023 27 65 -27
?! 1.39 0.026 28 67 -28
? -1.37 0.035 30 22 -63
~ 1.36 0.024 39 84 -31
1.21 0.017 38 77 -31
? -1.18 0.042 22 26 -52
!!!! 1.08 0.036 36 75 -36
?? 1.06 0.020 51 120 -66
? 1.05 0.049 20 33 -12
?? -0.99 0.014 70 97 -166
__: -0.98 0.023 41 47 -87
] 0.94 0.013 67 114 -51
? -0.87 0.037 23 26 -46
[ 0.79 0.014 61 96 -48
__ 0.70 0.016 60 129 -87
??? -0.66 0.013 85 101 -157
:( 0.61 0.026 31 64 -45
??? 0.55 0.051 20 44 -33
.” 0.51 0.018 37 53 -34
* 0.51 0.013 184 295 -202
? -0.42 0.023 52 77 -99
….. 0.41 0.020 37 68 -53
? -0.39 0.043 23 46 -55
???? 0.33 0.041 21 40 -33
? 0.29 0.017 51 84 -69
? -0.29 0.039 31 52 -61
;& -0.25 0.016 116 126 -155
0.25 0.033 36 57 -48
-0.22 0.021 27 44 -50
?” 0.03 0.018 30 48 -47
This entry was posted in Programming and tagged . Bookmark the permalink. Follow any comments here with the RSS feed for this post. Post a comment or leave a trackback.

Leave a Reply

Your email address will not be published. Required fields are marked *

Your email address will never be published.