Skip to content
This repository has been archived by the owner on Aug 8, 2023. It is now read-only.

A game where you need to guess whether a tweet comes from a human, or from a neural network language model trained on a category of tweets.

Notifications You must be signed in to change notification settings

model-zoo/twitter-turing-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twitter Turing Test

A game where you need to guess whether a tweet comes from a human, or from a neural network language model trained on a category of tweets.

Deployed on Model Zoo

Overview

This project uses the following open source projects for developing our models:

Models are deployed to Model Zoo for a realtime HTTP endpoint.

The frontend React application is a wrapper around a dataset of static tweets and the Model Zoo HTTP endpoints. See frontend documentation for more details.

How to train and deploy your own language model

Preparing the dataset

The script download-tweets.sh contains our methodology for scraping a twitter dataset with twint. The script accepts a single argument $name, and searches for a file sources/$name.txt which should specify a list of twitter handles to scrape from. The scraping script applies some crude heuristics to attempt to filter out tweets that (1) are replies to other tweets and (2) that have links in them. These heuristics aren't perfect, but I found they worked good enough.

In preparing several datasets with twint, I found that Twitter was often blacklisting our IP a few minutes into the scraping process. To get around this, I used scrapoxy on EC2 to scrape from five different EC2 proxy instances at once.

Several datasets have been prepared and released publically on AWS that you are free to use:

Dataset Sources File AWS S3 URL
Venture Capital https://github.com/model-zoo/twitter-turing-test/blob/master/sources/vc.txt s3://modelzoo-datasets/text-generation/vc
Republicans https://github.com/model-zoo/twitter-turing-test/blob/master/sources/republicans.txt s3://modelzoo-datasets/text-generation/republicans
Democrats https://github.com/model-zoo/twitter-turing-test/blob/master/sources/democrats.txt s3://modelzoo-datasets/text-generation/democrats
COVID-19 https://github.com/model-zoo/twitter-turing-test/blob/master/sources/covid19.txt s3://modelzoo-datasets/text-generation/covid19

Thank you to minimaxir/download-tweets-ai-textgen for supplying a list of republican and democrat twitter handles.

Training the model

The script train.py includes code to load pretrained weights and fine-tune the model, largely adapted from the Hugging Face Language Modeling example. Each model was trained on a single K80 GPU. All of the models were trained for a single epoch except for COVID-19, which was a slightly smaller dataset and trained for two epochs. Model training took somewhere between 6 - 16 hours for each model.

The training process could be improved further with hyperparameter optimization and additional experimentation. This is left as an exercise to the reader :)

Deploying the model

The model is automatically deployed after training using the Model Zoo transformers support and a few lines of code:

textgen = pipeline("text-generation", model=model, tokenizer=tokenizer)
modelzoo.transformers.deploy(
    textgen,
    model_name="..."
    resources_config=modelzoo.ResourcesConfig(memory_mb=2048, cpu_units=1024),
)

Please ensure you run $ modelzoo auth to create or login to a Model Zoo account in your training environment. Alternatively, you can set the environment variable MODELZOO_API_KEY accordingly.

See modelzoo.transformers.deploy for more details.