Weibo senti 100k is very likely labelled by the emoticons #1

ThiagoSousa · 2018-08-06T08:53:49Z

I downloaded this dataset(ChineseNlpCorpus/datasets/weibo_senti_100k) to train a model for chinese sentiment analysis. Upon treating this dataset I observed that 100% of the posts contain emoticons. Here is the distribution of the top10 emoticons according to the positive and negative polarity:

1013 emoticons in total. They are: [('泪', 44489), ('哈哈', 40510), ('嘻嘻', 22370), ('抓狂', 17262), ('鼓掌', 15923), ('爱你', 12685), ('怒', 12011), ('衰', 10466), ('晕', 9440), ('偷笑', 8375)]

710 emoticons in the positive set. They are: [('哈哈', 35764), ('嘻嘻', 20115), ('鼓掌', 14836), ('爱你', 11349), ('偷笑', 5223), ('太开心', 3820), ('可爱', 3809), ('心', 2122), ('赞', 1991), ('给力', 1976)]

695 emoticons in the negative set. They are: [('泪', 43248), ('抓狂', 16643), ('怒', 11830), ('衰', 10202), ('晕', 9022), ('哈哈', 4746), ('偷笑', 3152), ('蜡烛', 2887), ('汗', 2456), ('嘻嘻', 2255)]

I trained a very simple model to classify and I obtained 98% of accuracy in 2 epochs. Therefore, the emoticons have a strong bias in the classification. It led me to conclude that this dataset is not manually annotated. Probably whoever annotated the dataset manually classified some frequent emoticons and use them to tag the posts. Just saying for anyone who want to gather this data, you'd probably like to clean the emoticons out of it to avoid bias.

Peace!

OYE93 · 2019-01-04T06:12:22Z

lol, the findings are really interesting! @ThiagoSousa

jinhuakst · 2019-01-15T11:59:44Z

@ThiagoSousa Yeah. Thank you for your comments.

arsentiii · 2019-08-02T07:45:22Z

thx for your work.

easywaytodo · 2019-10-24T06:48:58Z

could I use it in bert and how I should do the preprocessing for the data? are emoticons out of vocabulary?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weibo senti 100k is very likely labelled by the emoticons #1

Weibo senti 100k is very likely labelled by the emoticons #1

ThiagoSousa commented Aug 6, 2018

OYE93 commented Jan 4, 2019

jinhuakst commented Jan 15, 2019

arsentiii commented Aug 2, 2019

easywaytodo commented Oct 24, 2019

Weibo senti 100k is very likely labelled by the emoticons #1

Weibo senti 100k is very likely labelled by the emoticons #1

Comments

ThiagoSousa commented Aug 6, 2018

OYE93 commented Jan 4, 2019

jinhuakst commented Jan 15, 2019

arsentiii commented Aug 2, 2019

easywaytodo commented Oct 24, 2019