-
-
Notifications
You must be signed in to change notification settings - Fork 29
Twitter Scraper
Twitter data is quite essential for OGBV detection project. Explored many Github repositories for twitter scrapers,tried a few of them which might work.
-
https://github.com/bisguzar/twitter-scraper
- not working,refactoring required,as it was an old project
- Official twitter API
-
limits scrolls while browsing the user timeline, which means that with
.Profile
or with
.Favorites
you will be able to get ~3200 tweets. -
can only fetch tweets with hashtags,keywords and replies from the past 7 days.
-
-
https://github.com/taspinar/twitterscraper
- similar issue,it was an old project and doesn’t work with the updated twitter API
-
https://github.com/Altimis/Scweet
- limited functionality
- fetches quite few data fields
- requires chrome browser driver
- using xpath for scraping the data
- limited functionality
- Omni-Sci tweet map
- only fetches tweets since March 2021
- less responsive
-
https://github.com/twintproject/twint
- robust API
- no limits on number of API requests
- but should maintain an interval for session of requests (~30 mins)
- did some hacks to bypass the intervals
- but should maintain an interval for session of requests (~30 mins)
- can used for mass scraping handles,hashtags,keywords & replies
- can be used as Command Line Tool
- as well as in the python script,using
Config
Currently,the scraper has been implemented using Config in the script,for more flexibility and readability.
Few changes were made on twint to make it work (found using their issues tab)
- Though the twint is available on *pip,*it was suggested by the authors to install using github source.
- For since and until functionality to work
- Modified line 93 in url.py in twint folder
-
Scraped tweets using different search types
- Hashtags,Keywords,Handles,Replies
-
2018,2019,2020,2021 were the years in which data was scraped.
-
The since and until functionality in Twint initially hasn't worked in the latest version,following few tweaks mentioned in their issues page (above section),helped until work fine.
-
Haven't collected some tweets with certain keywords which scraps many irrelavant tweets.
-
Found few instances where a certain keyword also scraps tweets whose username is same as the given keyword.
-
Replies are scraped to the trolled accounts only.
-
Initially only few important data fields were choosen to scrap,but later decided to scrap all the fields fetched by twint (except cashtags field)
-
New fields were also created using the scraped meta data
- content_type - helps to identify whether the tweet contains text,image,video or gif; found using thumbnail field
-
Additional fields were added while uploading the metadata to mongo
- type - whether the tweet is scraped using hashtag,handle,keyword or reply
- search - search term used while scraping
- timestampofscraping - timestamp at which scraping is done
- Any of the search types is given as an input to the script.
- Year is also given as an input,in which the scraper fetches the data
- Scraper iterates for all the dates in the given input year (i.e from Jan 1 to Dec 31),with the date assigned to until paramater
- Now the scraper fetches the tweets until that date and continues to next date and so on.
- It stores the tweets meta data in a json file,which is used for uploading to mongo-db
- Many handles were created in last 2 years ('21,'20)
- Given the username to the scraper,it scraps the tweets made by that user.
- English language has more hashtags compared to Hindi,Tamil.
- Less in number compared to other search types
- More in number compared to other search types
- scraped in similar manner as hashtags
Found a couple of ways to scrape replies to tweets
-
using twarc
- given a conversation id,it tries to fetch replies in the conversation,BUT only fetches,if the conversation is in the past 7 days.
-
using tweepy
- given username,it fetches replies to that user,AGAIN,it only fetches if the conversation is in the past 7 days.
- ALL the required data fields can be fetched
-
using twint
- Found lately that,there's an parameter in Twint Config,which helps to find the replies to a given user.
- Given a user handle,it fetches the tweets/replies which were made to that user handle
- Along with replies,the scraper also scraps mentions and quotes to the userhandle.
- About Us
- Our Team
- Contributing to Uli
- Code of Conduct
- Internal Communications
- FAQs
- Curated Issues and Proposals for beginners
- Contributing Code
- Monitoring Issues and Triaging
- Helping review PRs
- Helping with QA
- Helping with Translations
- Sponsor Tattle
- 16 Days of Activism
- Mitigating Harms of Digitally Manipulated Images
- Setup Uli on Windows for Chrome
- Setup Uli on Windows for Chrominum Browsers (Brave, Kiwi etc)
- Setup Uli on Windows for Firefox
- Setup Uli on Windows for Firefox for Android
- Setup Uli on Linux for Chrome
- Setup Uli on Linux for Firefox
- Setup Uli on Linux for Firefox for Android
- Setup Uli on Linux for for Chromium Browsers(Kiwi, Brave etc)