Fake News Detection with Multinomial Naive Bayes and K-Nearest Neighbors on Twitter focused on COVID in Brazil
- Extract tweets in portuguese related to the COVID-19 pandemic for further inspection
- Use a pre-determined dataset for training both algorithms
- Clean the training dataset
- Prepare a model for the dataset in each of the algorithms
- Use the prepared model to inspect the previously extracted tweets
- Determine accuracy and precision on both for research comparisons
- SciKit Learn
- Numpy
- Pandas
- Seaborn
- NLTK
- Tweepy
It is essential to have access to the Twitter API to execute this projetct. In tweets_retrieval, we have set all the code for retrieval and you should create a txt file with all the necessary keys, in order:
consumer_key
consumer_secret
access_token
access_secret
bearer_token
Run real_tweets.ipynb, to retrieve live tweets about the Covid context. To change the language retrieved, change the 'pt' to the corresponding BCP 47 language identifier on this line.
if json_response['data']['lang'] != 'pt':
See more information about the twet Lang Operator here
You can also change the context annotation of the tweets query, to retrieve tweets from others subjects. You can stop executing (CTRL + C on the terminal executing the script) once you have enough tweets.
Before you execute fake_news_MNB.ipnyb, you need to run
pip install pandas && pip install numpy
after that, before you can use the Tweepy object, you need to create an auth object, like that:
auth = tweepy.OAuthHandler(your_consumer_key, your_consumer_secret)
auth.set_access_token(you_access_token, your_access_secret)
With this, the setup process is finished and you can execute the MNB and the KNN files to see and compare the results of the algorithms.
It is also important to declare that the training dataframe was only arranged in the MNB algorithm file. In KNN, we used the final CSV with all of our necessary data.