Sefaria Topics

Sefaria Topics aims to leverage Artificial Intelligence to find semantical connections between various topics through our entire corpus of text!

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

You will need three files to run this project

sefaria-export_prefix_refs.txt
cleaned_docs_for_doc2vec.txt
Hebrew_Wiki_Dicta.txt

All other necessary files are included included within this GitHub Repo.

How to Run the Project

There are a total of three files that need to be run, in a particular order, to produce and test the doc2vec model

create_docs_for_doc2vec.py
- Combs through the entire Sefaria Corpus to clean, preprocess and prepare the text to be trained in a Doc2Vec model. Stopwords, punctuation and other trivial information need be removed as well as Docs need to be defined. Additionally, multiple word phrases are also dealt with in this file. Lastly, there is the option to include Hebrew Wikipedia into the corpus as well (via a boolean which set in the Constants file)
Doc2Vec.py
- Trains a Doc2Vec Model using the Docs created in the previous file
Doc2Vec_test_model.py
- Tests the model on a predefined list of key topics.

Authors

Noah Santacruz - Project Manager / Chief Data Scientist
Joshua Goldmeier - Data Scientist

License

GNU

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
sugyot		sugyot
.gitignore		.gitignore
Constants.py		Constants.py
Doc2Vec English.ipynb		Doc2Vec English.ipynb
Doc2Vec.ipynb		Doc2Vec.ipynb
Doc2Vec.py		Doc2Vec.py
Doc2Vec_test_model.py		Doc2Vec_test_model.py
Hebrew Spellcheck.ipynb		Hebrew Spellcheck.ipynb
RAKE Hebrew.ipynb		RAKE Hebrew.ipynb
README.md		README.md
Word2Vec.ipynb		Word2Vec.ipynb
Word2Vec.py		Word2Vec.py
all_hebrew_inv.json		all_hebrew_inv.json
all_phrases.txt		all_phrases.txt
create_docs_for_doc2vec.py		create_docs_for_doc2vec.py
hebrew_spellcheck.py		hebrew_spellcheck.py
hebrew_stopwords.txt		hebrew_stopwords.txt
high_conf_links.json		high_conf_links.json
level_3_wo_overlaps.json		level_3_wo_overlaps.json
local_settings_example.py		local_settings_example.py
phrases_break_on_stopwords.txt		phrases_break_on_stopwords.txt
restart_word2vec.sh		restart_word2vec.sh
select_phrases.txt		select_phrases.txt
test_topics.json		test_topics.json
test_words.json		test_words.json
word2vec_trainer.yaml		word2vec_trainer.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sefaria Topics

Getting Started

Prerequisites

How to Run the Project

Authors

License

About

Releases

Packages

Languages

Sefaria/Sefaria-Topics

Folders and files

Latest commit

History

Repository files navigation

Sefaria Topics

Getting Started

Prerequisites

How to Run the Project

Authors

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages