Skip to content

A simple library for calculating the distance between two documents through the cosine similarity algorithm

Notifications You must be signed in to change notification settings

adrianosferreira/document-distance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Distance - Cosine Similarity

Build Status Build Status Total Downloads License

Document Distance / Similarity is measured based on the content overlap between documents.

One of the most common algorithms to solve this particular problem is the cosine similarity - a vector based similarity measure. That's what this library is about.

The cosine distance of two documents is defined by the angle between their feature vectors which are, in our case, word frequency vectors. The word frequency distribution of a document is a mapping from words to their frequency count.

Cosine Similarity

Installation

It's recommended that you use Composer to install this library.

$ composer require adrianoferreira/document-distance:dev-master

Usage

Calculating similarity percentage between two remote files:

echo ( new \AdrianoFerreira\DD\File( 'http://test.com/test.txt', 'http://test.com/test2.txt' ) )->getPercent();

Calculating arc size between two local files:

echo ( new \AdrianoFerreira\DD\File( __DIR__ . 'test.txt', __DIR__ . 'test2.txt' ) )->getArcSize();

Calculating similarity percentage between two arbitrary strings:

echo ( new \AdrianoFerreira\DD\Text( 'test 123 456', 'test 678 000' ) )->getPercent();

Calculating arc size between arbitrary strings:

echo ( new \AdrianoFerreira\DD\Text( 'test 123 456', 'test 678 000' ) )->getArcSize();

References

This implementation is based in a MIT document: https://courses.csail.mit.edu/6.006/fall11/rec/rec02.pdf

About

A simple library for calculating the distance between two documents through the cosine similarity algorithm

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages