Twitter Scraper

Finding the Scraper

Twitter data is quite essential for OGBV detection project. Explored many Github repositories for twitter scrapers,tried a few of them which might work.

https://github.com/bisguzar/twitter-scraper
- not working,refactoring required,as it was an old project
Official twitter API
- limits scrolls while browsing the user timeline, which means that with .Profile
  
  or with .Favorites you will be able to get ~3200 tweets.
- can only fetch tweets with hashtags,keywords and replies from the past 7 days.
https://github.com/taspinar/twitterscraper
- similar issue,it was an old project and doesn’t work with the updated twitter API
https://github.com/Altimis/Scweet
- limited functionality
  - fetches quite few data fields
- requires chrome browser driver
- using xpath for scraping the data
Omni-Sci tweet map
- only fetches tweets since March 2021
- less responsive
https://github.com/twintproject/twint
- robust API
- no limits on number of API requests
  - but should maintain an interval for session of requests (~30 mins)
    - did some hacks to bypass the intervals
- can used for mass scraping handles,hashtags,keywords & replies

Using Twint

Features

can be used as Command Line Tool
as well as in the python script,using Config

Currently,the scraper has been implemented using Config in the script,for more flexibility and readability.

Tweaks made on Twint

Few changes were made on twint to make it work (found using their issues tab)

Though the twint is available on *pip,*it was suggested by the authors to install using github source.
For since and until functionality to work
- Modified line 93 in url.py in twint folder
  - https://github.com/twintproject/twint/issues/1266

Data Collection Strategy

Scraped tweets using different search types
- Hashtags,Keywords,Handles,Replies
2018,2019,2020,2021 were the years in which data was scraped.
The since and until functionality in Twint initially hasn't worked in the latest version,following few tweaks mentioned in their issues page (above section),helped until work fine.
Haven't collected some tweets with certain keywords which scraps many irrelavant tweets.
Found few instances where a certain keyword also scraps tweets whose username is same as the given keyword.
Replies are scraped to the trolled accounts only.
Initially only few important data fields were choosen to scrap,but later decided to scrap all the fields fetched by twint (except cashtags field)
New fields were also created using the scraped meta data
- content_type - helps to identify whether the tweet contains text,image,video or gif; found using thumbnail field
Additional fields were added while uploading the metadata to mongo
- type - whether the tweet is scraped using hashtag,handle,keyword or reply
- search - search term used while scraping
- timestampofscraping - timestamp at which scraping is done

How the scraper collects the data

Any of the search types is given as an input to the script.
Year is also given as an input,in which the scraper fetches the data
Scraper iterates for all the dates in the given input year (i.e from Jan 1 to Dec 31),with the date assigned to until paramater
Now the scraper fetches the tweets until that date and continues to next date and so on.
It stores the tweets meta data in a json file,which is used for uploading to mongo-db

Handles

Many handles were created in last 2 years ('21,'20)
Given the username to the scraper,it scraps the tweets made by that user.

Hashtags

English language has more hashtags compared to Hindi,Tamil.
Less in number compared to other search types

Keywords

More in number compared to other search types
scraped in similar manner as hashtags

Replies

Found a couple of ways to scrape replies to tweets

using twarc
- given a conversation id,it tries to fetch replies in the conversation,BUT only fetches,if the conversation is in the past 7 days.
using tweepy
- given username,it fetches replies to that user,AGAIN,it only fetches if the conversation is in the past 7 days.
- ALL the required data fields can be fetched
using twint
- Found lately that,there's an parameter in Twint Config,which helps to find the replies to a given user.
- Given a user handle,it fetches the tweets/replies which were made to that user handle
- Along with replies,the scraper also scraps mentions and quotes to the userhandle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Twitter Scraper

Finding the Scraper

Using Twint

Features

Tweaks made on Twint

Data Collection Strategy

How the scraper collects the data

Handles

Hashtags

Keywords

Replies

Uli Wiki

Contribution Pathways

Setup Guides

Learning

Clone this wiki locally