Sefaria Topics aims to leverage Artificial Intelligence to find semantical connections between various topics through our entire corpus of text!
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
You will need three files to run this project
- sefaria-export_prefix_refs.txt
- cleaned_docs_for_doc2vec.txt
- Hebrew_Wiki_Dicta.txt
All other necessary files are included included within this GitHub Repo.
There are a total of three files that need to be run, in a particular order, to produce and test the doc2vec model
- create_docs_for_doc2vec.py
- Combs through the entire Sefaria Corpus to clean, preprocess and prepare the text to be trained in a Doc2Vec model. Stopwords, punctuation and other trivial information need be removed as well as Docs need to be defined. Additionally, multiple word phrases are also dealt with in this file. Lastly, there is the option to include Hebrew Wikipedia into the corpus as well (via a boolean which set in the Constants file)
- Doc2Vec.py
- Trains a Doc2Vec Model using the Docs created in the previous file
- Doc2Vec_test_model.py
- Tests the model on a predefined list of key topics.
- Noah Santacruz - Project Manager / Chief Data Scientist
- Joshua Goldmeier - Data Scientist