Tweets collection and indexation infrastructure

Collect massive amounts of tweets using the Twitter streaming APIs and index them with ElasticSearch.

Deal with indexing delay using a RabbitMQ queue.

We combine Twitter Sample and Filter streaming APIs to maximize the number of tweets.

Summary:

Getting started
Order of OAuth keys
Check if streamer is running
Check if tweets are indexed
Text files
Turn off the stack
Deploy a large number of streamers
Paper

Getting started

Resources needed

We run this infrastructure on a 250GB RAM, 16 cores RAID 5 server. You may not be able to run intensive indexation jobs on a desktop computer.

HowTo

Clone this repo:

  git clone https://github.com/ina-foss/stream-index-massive-tweets.git
  cd stream-index-massive-tweets

Install Docker and docker-compose. Minimum Docker version: 17.12.1.
Create a Twitter developper account and a Twitter App. Then get the app's OAuth access tokens. If you have several accounts, you can get several tokens. Copy them in config.py within the ACCESS list.
Setup other parameters in config.py : KEYWORDS the keywords you want to track, LANG the language of the tweets.
Increase the system limits on mmap counts if needed:
```
  sudo sysctl -w vm.max_map_count=262144
```

Add current user to docker group

  sudo groupadd docker
  sudo usermod -aG docker $USER

Build the docker infrastructure

  docker-compose build
  docker swarm init
  docker network create --driver overlay net0 --attachable
  mkdir -p data/es_nodes/master data/es_nodes/slave1 data/es_nodes/slave2

Deploy docker stack

  docker stack deploy stream-index --compose-file docker-compose.yml --with-registry-auth

The stack may take some minutes to be fully deployed. Check if services are deployed using

docker service ls

In column REPLICAS, all lines should indicate 1/1.

Order of OAuth keys

The order in which you copy OAuth access keys in the config.yml is important: the first key will be used by service streamer_0, which is always streaming from the Sample API. If you don't want to use this API, leave the first key in ACCESS empty and copy your access key(s) in the next position(s) in the ACCESS list.

You may also remove the services streamer_0 and sample_runner from the docker-compose.yml file, since they will not be used. However, the stack will not fail if you don't.

Check if streamer is running

To check if a streamer is running, you can simply curl it using its port number (5050 for sample streamer, 5051, 5052, etc. for the next streamers):

curl localhost:5050

The response should look like:

{"key": 0, "lang": null, "track": null}

for sample streamer, and

{"key": 2, "lang": "fr", "track": ["je", "tu", "il", "elle", "nous"]}

for the other streamers.

Check if tweets are indexed

You can visualize the indexed tweets using Kibana. Type localhost:5656 in your browser. If it is the first time you connect on Kibana, you are redirected to the Configure an index pattern page. Type tweets-index* as index name and choose created_at as time-field name. You can then type localhost:5656/app/kibana#/discover in your browser to see the collected tweets.

Text files

The tweets are both indexed and stored in json.gz files in

stream-index-massive-tweets/data/tweets/$year/$month/$day/$hour/$minutes

Turn off the stack

docker stack rm stream-index

Deploy a large number of streamers

The current configuration provides 2 streaming servers streamer_0 and streamer_1. It needs 2 different Twitter Developer access tokens in the config.py file.

If you have more access tokens, you can add them in config.py and add more streaming servers (streamer_2, streamer_3, etc.) similar to streamer_1 in the docker-compose.yml file, then redeploy the stack.

You may then want to increase the number of consumers to speed-up indexation:

docker service scale stream-index_consumer=2

Paper

The reason why we use stop-words to collect tweets is explained in:

Mazoyer, B., Cagé, J., Hudelot, C., & Viaud, M.-L. (2018). “Real-time collection of reliable and representative tweets datasets related to news events”. In “Proceedings of the First International Workshop on Analysis of Broad Dynamic Topics over Social Media (BroDyn 2018)”.

Please cite the paper if you use this infrastructure for research purpose.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
python_flask		python_flask
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clean_url.py		clean_url.py
config.py		config.py
consumer.py		consumer.py
diagramme_architecture_logo.png		diagramme_architecture_logo.png
docker-compose.yml		docker-compose.yml
indexer.py		indexer.py
mapping.json		mapping.json
server.py		server.py
start_filter.py		start_filter.py
start_sample.py		start_sample.py
streamer.py		streamer.py
thread.py		thread.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tweets collection and indexation infrastructure

Summary:

Getting started

Resources needed

HowTo

Order of OAuth keys

Check if streamer is running

Check if tweets are indexed

Text files

Turn off the stack

Deploy a large number of streamers

Paper

About

Releases

Packages

Contributors 2

Languages

License

ina-foss/stream-index-massive-tweets

Folders and files

Latest commit

History

Repository files navigation

Tweets collection and indexation infrastructure

Summary:

Getting started

Resources needed

HowTo

Order of OAuth keys

Check if streamer is running

Check if tweets are indexed

Text files

Turn off the stack

Deploy a large number of streamers

Paper

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages