A scraper to retrieve the conversation in tweets for a particular twitter user.
“Twitter Scraper” (or simply “scraper”) is available under MIT License. It consists of two steps- Step 1 and Step 2.
Both these steps are to run one after the other in order. The behaviour is controlled with a properties file, named application.properties
.
This is the configuration file and specifies many important properties. For all available options, please see this file.
Property | Type | Details |
---|---|---|
target.username |
String, required | This specifies the twitter-handle to scrape conversations for. |
target.step |
int, required | Possible values are 1 or 2 . This tells the scraper the step to run. |
concurrent-threads |
int, required | Should be greater than 0. Number of threads to run to fetch conversations. This is in-effect when step 2 is running. |
This step fetches tweetIds from the specified. More will be updated later.
This step fetches the conversations for the tweetIds fetched in the first step. More will be updated later.
Following software are needed to run the built JAR.
- Oracle JRE 1.8
To build from the source, one needs following pieces of software installed.
- git client
- maven 3
- Oracle JDK 1.8
To install these on an EC2 instance run following commands in order.
sudo apt-get update
sudo apt-get install git
sudo apt-get install maven
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get install oracle-java8-installer
sudo nano /etc/environment
And then append this line at the bottom of the opened file and save it.
JAVA_HOME=”/usr/lib/jvm/java-8-oracle”
And then run following command.
sudo source /etc/environment
Now following command should be working.
echo $JAVA_HOME
And one can check the installed JDK by running following command.
java -version
To get the scraper on EC2 instance (or any other Ubuntu machine), follow the steps below. These steps are needed to run only once.
- Decide a directory to put the “twitter-scraper” repository. Let’s call this directory
base_directory
. Issue following command to change directory to this base directory. Please replace “base_directory” term with the actual path of this chosen directory.
cd base_directory
- Following steps assume that the commands are being run from the
base_directory
. Issue following command.
git clone https://github.com/clayfish/twitter-scraper
- Now you have clone (source-code) of the scraper on you machine. Notice git command does not need any username/password.
It consists of three high-level steps.
- Update the code
- Compile the code
- Run the built JAR file
To update the code with latest changes run following commands.
cd base_directory/twitter-scraper
git pull
Before compiling the code, please check if application.properties
is configured as per the needs. Please refer to this section for more information about the configuration. To compile the code run the following commands.
cd base_directory/twitter-scraper
mvn package
These commands will create base_directory/twitter-scraper/bin/twitter-scraper-0.1.0.one-jar.jar
file.
To run the built JAR file, you need to run following commands.
cd base_directory
java -jar twitter-scraper/bin/twitter-scraper-0.1.0.one-jar.jar
It will start the scraper and it will start working. To stop the scraper, simply hit control
+ c
(Mac) or ctrl
+ c
(Windows/Linux). If you want to exit from the terminal while the scraper still runs, please use following commands instead of the commands written above.
cd base_directory
java -jar twitter-scraper/bin/twitter-scraper-0.1.0.one-jar.jar &
Currently all the logs are directed to terminal hence running it as daemon, doesn’t spare you from the constant logs on the terminal but you can close the terminal without closing the scraper.