Skip to content

Latest commit

 

History

History
31 lines (23 loc) · 1.51 KB

README.md

File metadata and controls

31 lines (23 loc) · 1.51 KB

Building a Search Engine

Warning work in progress!

A basic search engine that helps you index a corpus to search and rank the document data set. Built using Python and object-oriented programming principles to make the project extendable and maintainable.

Features:

  • Inverted Index - to improve search times.
  • Results Ranking - with term frequency–inverse document frequency (TF-IDF) to order results by relevance.
  • Query Expansion - to automatically add additional query terms (like synonyms) to improve results relevancy (see my testing analysis).
  • Result Evaluation - test and compare results with human-evaluated relevancy scores to gauge performance.

This started out as a course project, and I'm currently working on building this out further and adding more features to it. I'm planning to build out a front-end web interface so I can demo this project better. I will also be adding additional functionality to build on the project.

ToDo:

  • Spit up files and organize into packages.
  • Write Documentation!
  • Finish implementing stop words functionality.
  • Build a frontend web interface to the demo project.
  • Result snippet generation.
  • Implement advanced search operators (OR, NOT).
  • Improve query normalization.
  • Ranking improvements.
  • Add caching and on-demand loading to improve memory efficiency.

I hope to writing some more conprehensive documentation for this project in the near future.

Stay tuned :)