twittercrawler
is a simple Python crawler on top of the popular Twython package. The main objective during development was to provide an API that ease Twitter data collection for events that span across multiple days. The key features of this package are as follows:
- collect tweets over several days (online or offline)
- respect Twitter API rate limits during search
- search for people
- collect friend or follower network
- easily export search results to multiple output channels (File, Socket, Kafka queues)
git clone https://github.com/ferencberes/twitter-crawler.git
cd twitter-crawler
python setup.py install
NOTE: If you want to push the collected data to Kafka queues then you need to execute a few additional steps.
You must provide your Twitter API credentials to collect data from Twitter. First, generate your Twitter API keys on the Twitter developer portal. Then, choose from the available options to configure your crawler.
- Set the following environmental variables:
export API_KEY="YOUR_API_KEY";
export API_SECRET="YOUR_API_SECRET";
export ACCESS_TOKEN="YOUR_ACCESS_TOKEN";
export ACCESS_TOKEN_SECRET="YOUR_ACCESS_TOKEN_SECRET";
- Authenticate your crawler:
from twittercrawler.crawlers import StreamCrawler
crawler = StreamCrawler()
crawler.authenticate()
...
- Create a JSON file (e.g. "api_key.json") in the root folder with the following content:
{
"api_key":"YOUR_API_KEY",
"api_secret":"YOUR_API_SECRET",
"access_token":"YOUR_ACCESS_TOKEN",
"access_token_secret":"YOUR_ACCESS_TOKEN_SECRET"
}
- Authenticate your crawler:
from twittercrawler.crawlers import StreamCrawler
crawler = StreamCrawler()
crawler.authenticate("PATH_TO_API_KEY_JSON")
...
- Initialize and authenticate the crawler:
from twittercrawler.crawlers import StreamCrawler
stream = StreamCrawler()
stream.authenticate("PATH_TO_API_KEY_JSON")
- Connect a FileWriter that will export the collected tweets:
from twittercrawler.data_io import FileWriter
stream.connect_output([FileWriter("stream_results.txt")])
- Set search parameters:
search_params = {
"q":"#bitcoin OR #ethereum OR blockchain",
"result_type":"recent",
"lang":"en",
"count":100
}
stream.set_search_arguments(search_args=search_params)
- Initialize a termination function that will collect tweets from the last 5 minutes:
from twittercrawler.search import get_time_termination
import datetime
now = datetime.datetime.now()
time_str = (now-datetime.timedelta(seconds=300)).strftime("%a %b %d %H:%M:%S +0000 %Y")
time_terminator = get_time_termination(time_str)
- Run search:
- First, tweets from the last 5 minutes are collected
- Then, new tweets are collected for every 15 seconds
try:
stream.search(15, time_terminator)
except:
raise
finally:
stream.close()
With a few modifications (e.g. socket programming) the collected Twitter data can be transformed into a graph stream.
- Load collected data into a Pandas dataframe
from twittercrawler.data_io import FileReader
results_df = FileReader("stream_results.txt").read()
print(results_df.head())
In this package you can find crawlers for various Twitter data collection tasks. Before executing the provided sample scripts make sure to prepare your Twitter API keys.
Before executing the provided tests make sure to prepare your Twitter API keys.
python setup.py test