README

Introduction

In recent times, the novel corona virus has spread throughout the globe and put the world under lock down. In response, medical professionals have begun looking for solutions to this worldwide pandemic through all potential means, new and old. However, progress has been slow due to the lack of knowledge about the virus. Currently, there is an abundance of information in the form of academic papers for all topics, including those related to COVID-19, but being able to filter through this information quickly has been difficult. A multitude of organizations have come together to release CORD-19, an easy to parse data set containing thousands of research papers that are potentially connected to COVID-19. \par

The goal of this project was to explore the contents of the data set and establish similarities and connections between academic papers, as well as what kinds of topics are present in CORD-19. Various topic modeling techniques were used to establish the number of topics within the data set, as well as how coherent these topics were. The results of the project was an increase over the benchmark topic coherence score to .663, with a relative standard deviation of the topic coherence of 0.144.

Data

The machine learning field that the solution is being designed for is under that of natural language processing and natural language understanding. For this project, the CORD-19 (version 8) was chosen. The data set was created by the US Government and large technology companies, and distributed through Kaggle in an easy to parse format.

Implementation

For the models themselves, the hyper parameters passed through did not change much from the early models. The most important parameter for topic modeling is the number of topics, and like many other unsupervised clustering algorithms, the best way to determine the correct amount of clusters is to generate many models with different cluster amounts and compare them.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
images		images
models		models
.expElogbeta.npy		.expElogbeta.npy
.gitignore		.gitignore
.id2word		.id2word
.state		.state
Benchmark.ipynb		Benchmark.ipynb
CORD-metadata-explore.ipynb		CORD-metadata-explore.ipynb
MLND-Proposal.docx		MLND-Proposal.docx
MLND-Proposal.pdf		MLND-Proposal.pdf
Mirijanyan-MLND_Capstone.pdf		Mirijanyan-MLND_Capstone.pdf
Organize_Fulltext.ipynb		Organize_Fulltext.ipynb
README.md		README.md
Streamline_Modeling.ipynb		Streamline_Modeling.ipynb
Write_Up_Figures.ipynb		Write_Up_Figures.ipynb
gensim-testing.ipynb		gensim-testing.ipynb
process_tools.py		process_tools.py
~$ND-Proposal.docx		~$ND-Proposal.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Introduction

Data

Implementation

About

Releases

Packages

Languages

kmirijan/ML-COVID-CORD

Folders and files

Latest commit

History

Repository files navigation

README

Introduction

Data

Implementation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages