TF Stylometry

Stylometry demo in TensorFlow

By Thomas Wood, https://www.fastdatascience.com

You need:

Python 3 (I recommend Anaconda)
Tensorflow 1.4+
GenSim word vectors file GoogleNews-vectors-negative300.bin (current links are https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit and https://github.com/mmihaltz/word2vec-GoogleNews-vectors, this is a huge file so unfortunately I can't host it, please let me know if links break.)

Instructions

This is how to run on the basic toy example of Anne, Charlotte and Emily Brontë's works which are in the folder data/raw.

You may want to put your own texts that you're interested in classifying into the folder, however I was only able to store works in the repo that are already out of copyright.

Download the GenSim word vectors file from one of the above links
Launch Jupyter Notebook
Open Preprocess_data_1_Determine_vocabulary.ipynb and change the absolute path to the path to your downloaded word vectors file. Run the notebook. It will write some gz files to the data folder and also write the preprocessed (tokenised) texts to data/processed.
Kill the Jupyter kernel if it's still running otherwise you'll run out of memory.
Open and run Preprocess_data_2_Convert_texts_to_token_IDs.ipynb. Again, kill the kernel at the end.
Open and run Train.ipynb. I suggest to run it for about 30 minutes.
Make a note of the last file in folder runs/checkpoints. This is your model at the point that you stopped it training.
Correct the path given to saver.restore inside Execute.ipynb to point to the latest model. Run Execute.ipynb. The output is an array of probabilities representing the likelihood that Anne, Charlotte and Emily Brontë wrote the given text (in alphabetical order).

array([[0.43884012, 0.35928553, 0.20187436]], dtype=float32)

To run as a webserver edit author_inference.py to point to the correct model and run

python webserver.py

and go to localhost:5000 in your browser.

Acknowledgement

I've taken the demo training data from https://github.com/mikekestemont/pystyl, originally from the Gutenberg Project

I based the text classification CNN on https://github.com/dennybritz/cnn-text-classification-tf.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
js		js
output		output
templates/layouts		templates/layouts
.gitignore		.gitignore
Execute.ipynb		Execute.ipynb
Preprocess_data_1_Determine_vocabulary.ipynb		Preprocess_data_1_Determine_vocabulary.ipynb
Preprocess_data_2_Convert_texts_to_token_IDs.ipynb		Preprocess_data_2_Convert_texts_to_token_IDs.ipynb
README.md		README.md
Train.ipynb		Train.ipynb
app.yaml		app.yaml
author_identification_grapher.py		author_identification_grapher.py
author_inference.py		author_inference.py
flask_app.py		flask_app.py
main.py		main.py
main_test.py		main_test.py
requirements.txt		requirements.txt
text_cnn.py		text_cnn.py
webserver.py		webserver.py
webserver2.py		webserver2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TF Stylometry

Instructions

Acknowledgement

About

Releases

Packages

Languages

woodthom2/tf_stylometry

Folders and files

Latest commit

History

Repository files navigation

TF Stylometry

Instructions

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages