Skip to content

Twitter Scraper

Rahul Dev edited this page Jan 7, 2022 · 3 revisions

Finding the Scraper

Twitter data is quite essential for OGBV detection project. Explored many Github repositories for twitter scrapers,tried a few of them which might work.

  • https://github.com/bisguzar/twitter-scraper
    • not working,refactoring required,as it was an old project
  • Official twitter API
    • limits scrolls while browsing the user timeline, which means that with .Profile

      or with .Favorites you will be able to get ~3200 tweets.

    • can only fetch tweets with hashtags,keywords and replies from the past 7 days.

  • https://github.com/taspinar/twitterscraper
    • similar issue,it was an old project and doesn’t work with the updated twitter API
  • https://github.com/Altimis/Scweet
    • limited functionality
      • fetches quite few data fields
    • requires chrome browser driver
    • using xpath for scraping the data
  • Omni-Sci tweet map
    • only fetches tweets since March 2021
    • less responsive
  • https://github.com/twintproject/twint
    • robust API
    • no limits on number of API requests
      • but should maintain an interval for session of requests (~30 mins)
        • did some hacks to bypass the intervals
    • can used for mass scraping handles,hashtags,keywords & replies

Using Twint

Features

  • can be used as Command Line Tool
  • as well as in the python script,using Config

Currently,the scraper has been implemented using Config in the script,for more flexibility and readability.

Tweaks made on Twint

Few changes were made on twint to make it work (found using their issues tab)


Data Collection Strategy

  • Scraped tweets using different search types

    • Hashtags,Keywords,Handles,Replies
  • 2018,2019,2020,2021 were the years in which data was scraped.

  • The since and until functionality in Twint initially hasn't worked in the latest version,following few tweaks mentioned in their issues page (above section),helped until work fine.

  • Haven't collected some tweets with certain keywords which scraps many irrelavant tweets.

  • Found few instances where a certain keyword also scraps tweets whose username is same as the given keyword.

  • Replies are scraped to the trolled accounts only.

  • Initially only few important data fields were choosen to scrap,but later decided to scrap all the fields fetched by twint (except cashtags field)

  • New fields were also created using the scraped meta data

    • content_type - helps to identify whether the tweet contains text,image,video or gif; found using thumbnail field
  • Additional fields were added while uploading the metadata to mongo

    • type - whether the tweet is scraped using hashtag,handle,keyword or reply
    • search - search term used while scraping
    • timestampofscraping - timestamp at which scraping is done

How the scraper collects the data

  • Any of the search types is given as an input to the script.
  • Year is also given as an input,in which the scraper fetches the data
  • Scraper iterates for all the dates in the given input year (i.e from Jan 1 to Dec 31),with the date assigned to until paramater
  • Now the scraper fetches the tweets until that date and continues to next date and so on.
  • It stores the tweets meta data in a json file,which is used for uploading to mongo-db

Handles

  • Many handles were created in last 2 years ('21,'20)
  • Given the username to the scraper,it scraps the tweets made by that user.

Hashtags

  • English language has more hashtags compared to Hindi,Tamil.
  • Less in number compared to other search types

Keywords

  • More in number compared to other search types
  • scraped in similar manner as hashtags

Replies

Found a couple of ways to scrape replies to tweets

  • using twarc

    • given a conversation id,it tries to fetch replies in the conversation,BUT only fetches,if the conversation is in the past 7 days.
  • using tweepy

    • given username,it fetches replies to that user,AGAIN,it only fetches if the conversation is in the past 7 days.
    • ALL the required data fields can be fetched
  • using twint

    • Found lately that,there's an parameter in Twint Config,which helps to find the replies to a given user.
    • Given a user handle,it fetches the tweets/replies which were made to that user handle
    • Along with replies,the scraper also scraps mentions and quotes to the userhandle.
Clone this wiki locally