Add files for Week 20

robertopreste · Aug 14, 2018 · b7cef3c · b7cef3c
1 parent 4b18807
commit b7cef3c
Show file tree

Hide file tree

Showing 9 changed files with 3,523 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -4,3 +4,5 @@
 .Ruserdata
 *.Rproj
 .DS_Store
+Week_20/data
+Week_20/russian-troll-tweets
diff --git a/Week_20/Week_20.Rmd b/Week_20/Week_20.Rmd
@@ -0,0 +1,174 @@
+---
+title: "TidyTuesday Week 20 - Russian Troll Tweets"
+author: "Roberto Preste"
+date: "`r Sys.Date()`"
+output: github_document
+---
+
+```{r}
+library(tidyverse)
+library(lubridate)
+```
+
+___ 
+
+The original data are available on fivethirtyeight's [GitHub](https://github.com/fivethirtyeight/russian-troll-tweets) and are detailed in a specific [article](https://fivethirtyeight.com/features/why-were-sharing-3-million-russian-troll-tweets/).  
+
+```{r}
+tweet_1 <- read_csv("russian-troll-tweets/IRAhandle_tweets_1.csv", col_types = "ccccccciiicciic")
+tweet_2 <- read_csv("russian-troll-tweets/IRAhandle_tweets_2.csv", col_types = "ccccccciiicciic") # issues
+tweet_3 <- read_csv("russian-troll-tweets/IRAhandle_tweets_3.csv", col_types = "ccccccciiicciic")
+tweet_4 <- read_csv("russian-troll-tweets/IRAhandle_tweets_4.csv", col_types = "ccccccciiicciic")
+tweet_5 <- read_csv("russian-troll-tweets/IRAhandle_tweets_5.csv", col_types = "ccccccciiicciic")
+tweet_6 <- read_csv("russian-troll-tweets/IRAhandle_tweets_6.csv", col_types = "ccccccciiicciic") # issues
+tweet_7 <- read_csv("russian-troll-tweets/IRAhandle_tweets_7.csv", col_types = "ccccccciiicciic")
+tweet_8 <- read_csv("russian-troll-tweets/IRAhandle_tweets_8.csv", col_types = "ccccccciiicciic") # issues
+tweet_9 <- read_csv("russian-troll-tweets/IRAhandle_tweets_9.csv", col_types = "ccccccciiicciic") # issues
+```
+
+Let's merge all the dataset together.  
+
+```{r}
+tweets <- bind_rows(list(tweet_1, tweet_2, tweet_3, tweet_4, tweet_5, tweet_6, tweet_7, tweet_8, tweet_9))
+```
+
+___
+
+## Data exploration  
+
+First of all, let's find the top 20 tweeters and distinguish them based on their assigned account category.  
+
+```{r}
+top_20_handles <- tweets %>% 
+    group_by(author, account_category) %>% 
+    summarise(n = n()) %>% 
+    arrange(desc(n)) 
+top_20_handles <- top_20_handles[1:20, ]
+```
+
+```{r, fig.width=4, dpi=200}
+top_20_handles %>% 
+    ggplot(aes(x = reorder(author, n), y = n)) + 
+    geom_col(aes(fill = account_category)) + 
+    coord_flip() + 
+    labs(x = "Authors", y = "Tweets", title = "Top 20 tweeters", fill = "Category: ") + 
+    theme(legend.position = "bottom")
+```
+
+The top tweeters are commercial, (fake) news feeds and Right trolls.  
+
+However, even though these are the most active account categories, we can see that the first two groups represent only a small part of the total IRA handles.  
+
+```{r}
+categ_count <- tweets %>% 
+    select(author, account_category) %>% 
+    distinct() %>% 
+    group_by(account_category) %>% 
+    summarise(n = n())
+```
+
+```{r, fig.width=4, dpi=200}
+categ_count %>% 
+    ggplot(aes(x = reorder(account_category, n), y = n)) + 
+    geom_col(aes(fill = account_category)) + 
+    coord_flip() + 
+    labs(x = "Category", y = "Accounts", title = "Number of handles per category") + 
+    # theme(legend.position = "bottom")
+    guides(fill = FALSE)
+```
+
+We might want to know the tweeting frequency of these top 20 accounts, to check if there is some tweeting trend.  
+
+```{r}
+top_tweets <- tweets %>% 
+    filter(author %in% top_20_handles$author) %>% 
+    separate(col = publish_date, into = c("pub_date", "pub_time"), sep = " ") %>% 
+    mutate(pub_date = as_date(pub_date, tz = "UTC", format = "%d/%m/%Y"))
+```
+
+```{r}
+top_tweets_grouped <- top_tweets %>% 
+    group_by(pub_date, account_category) %>% 
+    summarise(n = n()) %>% 
+    filter(!is.na(pub_date))
+```
+
+```{r, fig.width=4, dpi=200}
+top_tweets_grouped %>% 
+    ggplot(aes(x = pub_date, y = n)) + 
+    geom_line(aes(color = account_category)) + 
+    labs(x = "Date", y = "Tweets", color = "", title = "Tweeting activity", subtitle = "Top 20 handles")
+```
+
+From this plot we can see that, among the top 20 tweeters, the commercial ones posted a number of tweets 4 times greater than the other categories, but suddenly stopped tweeting right before 2016.  
+Let's check if this is true taking into account all handles, not just the top 20.  
+
+```{r}
+tweets_dates <- tweets %>% 
+    separate(col = publish_date, into = c("pub_date", "pub_time"), sep = " ") %>% 
+    mutate(pub_date = as_date(pub_date, tz = "UTC", format = "%d/%m/%Y"))
+```
+
+```{r}
+tweets_dates_grouped <- tweets_dates %>% 
+    filter(pub_date >= "2015/01/01", pub_date <= "2018/01/01") %>% 
+    group_by(pub_date, account_category) %>% 
+    summarise(n = n()) %>% 
+    filter(!is.na(pub_date))
+```
+
+```{r, fig.width=4, dpi=200}
+tweets_dates_grouped %>% 
+    ggplot(aes(x = pub_date, y = n)) + 
+    geom_line(aes(color = account_category)) + 
+    labs(x = "Date", y = "Tweets", title = "Tweeting activity", subtitle = "All handles") + 
+    facet_wrap(~ account_category, nrow = 4) + 
+    guides(color = FALSE)
+```
+
+What we found previously might hold true concerning commercial tweeters: their number of tweets suddenly drops since 2016. However, we can also discover some more interesting insights:  
+* fearmonger accounts (those spreading fake crisis news) seem to disappear right before 2017 (luckily, I would add);  
+* Left trolls had a peak in their activity in mid 2016: I'm no expert in American politics, but this seem to overlap with the Democrats presidential primaries;  
+* Right trolls, although having a fairly constant tweeting rate initially, seem to show some sort of exponential increase starting from the first months of 2016;  
+* all the other categories do not show any particular trend.  
+In addition, almost all categories have an almost identical monthly trend in tweets.  
+
+
+Let's see how the number of followers and followed accounts changed for each of these categories over time.  
+
+```{r}
+tweets_foll_grouped <- tweets_dates %>% 
+    filter(pub_date >= "2015/01/01", pub_date <= "2018/01/01") %>% 
+    group_by(pub_date, account_category) %>% 
+    summarise(followers = sum(followers), following = sum(following)) %>% 
+    filter(!is.na(pub_date))
+```
+
+```{r, fig.width=4, dpi=200}
+tweets_foll_grouped %>% 
+    ggplot(aes(x = pub_date)) + 
+    geom_line(aes(y = log(followers), color = "red")) + 
+    geom_line(aes(y = log(following), color = "blue")) + 
+    labs(x = "Date", y = "Accounts (log)", title = "Followers and Followed accounts") + 
+    facet_wrap(~ account_category, nrow = 4) + 
+    scale_colour_manual(name = "", values = c("red" = "red", "blue" = "blue"), labels = c("Followed Accounts", "Followers")) + 
+    theme(legend.position = "bottom")
+```
+
+The log transformation allows to better appreciate fluctuations in these numbers. The number of followers and followed accounts is mostly equal within each category; however, particular trends can be found characterizing 2015, 2016 and 2017, with 2016 being the year with less variation in these numbers, among all categories.  
+
+___
+
+### Disclaimer  
+
+These are just some basic insights, created for simple data exploration and visualization purposes. Furthermore, some of the results shown come from filtering the data to some extent.  
+No conclusions should be drawn from what is reported here; more appropriate analysis are being conducted by more qualified people than me, like [Darren Linvill](https://www.clemson.edu/cbshs/faculty-staff/profiles/darrenl) and [Patrick Warren](http://pwarren.people.clemson.edu/) and you should get in touch with them if interested in these data.  
+
+All in all, this was an interesting dataset to work with and I'm keen to come back to it in the future, to explore it more deeply with some text and sentiment analysis techniques.  
+
+___
+
+```{r}
+sessionInfo()
+```
+