Using reddit data to detect sarcasm in the comments. The aim of our project is to detect if the comments in Reddit threads are sarcastic or not. A sarcastic expression or comment can be defined as one that is caustic, bitter, or cutting. Sarcasm detection is an arduous task, as it’s largely dependent on context, prior knowledge and the tone in which the sentence was spoken or written. It is crucial to know what exactly sarcasm is since its borders are not exactly well defined unlike in sentiment analysis where the sentiment categories are very clearly defined (”love” objectively has a positive sentiment, ”hate” a negative sentiment no matter who you ask or what language you speak).
This dataset contains 1.3 million Sarcastic comments from the Internet commentary website Reddit. The dataset was generated by scraping comments from Reddit containing the (sarcasm) tag. This tag is often used by Redditors to indicate that their comment is in jest and not meant to be taken seriously, and is generally a reliable indicator of sarcastic comment content.This is a balanced dataset. Attribute Information:
- label: If comment is Sarcastic or not
- comment: The comment for which we need to determine if its sarcastic or not
- author: Author of the comment
- subreddit: The subreddit in which the comment was posted
- score: The net of upvote and downvotes
- ups: The number of upvotes
- downs: The number of downvotes
- date: The date comment was posted
- created utc: The timestamp when the comment was posted.
- parent comment: The parent comment to which the comment was posted as a response.
The below table shows that we started with using Logistic Regression with TF-IDF which gave us an accuracy of 68.32%. However, no improvement was observed with other models except FastText and Bi-directional LSTM with one_hot.
After using the models listed in above table, we found that the "Bidirectional-LSTM with one_hot" performs the best and results in an accuracy of 72.15%. Hence, moving forward with this model we have shown the confusion matrix below based on it.