msa-string-similarity

Various algorithms to mesure the similarity of N strings.

Built with Harry, licensed under the GPLv3. From the Harry documentation :

The tool supports several common distance and kernel functions for strings as well as some excotic similarity measures. The focus of Harry lies on implicit similarity measures, that is, comparison functions that do not give rise to an explicit vector space. Examples of such similarity measures are the Levenshtein distance, the Jaro-Winkler distance or the spectrum kernel.

Quick start

Execute the microservice container with the following command :

docker run -ti -p 9906:80 msagency/msa-string-similarity

Examples

If no algorithm is specified, the program will use dist_levenshtein by default. From wikipedia :

In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

curl -XPOST 'localhost:9906/similarity' \
-H 'Content-Type: application/json' \
-d '[ "string1", "string2" ]'

[[0.0, 1.0], [1.0, 0.0]]

The result is a matrix of the computed similarity values, in JSON.

To get the list of supported algorithms, get the /similarity/algorithms url :

curl http://localhost:9906/similarity/algorithms

{
  "algorithms": [
    {
      "algorithm": "kern_subsequence",
      "name": "Subsequence kernel",
      "reference": "Lodhi, Saunders, Shawe-Taylor, Cristianini, and Watkins...",
      "reference-url": "/similarity/references/kern_subsequence.pdf"
    },
    {
      "algorithm": "dist_damerau",
      "name": "Damerau-Levenshtein distance for strings",
      "reference": "Damerau. A technique for computer detection...",
      "reference-url": "/similarity/references/dist_damerau.pdf"
    },
  ...

To change the algorithm, just add the algorithm name as a parameter to the request :

curl -XPOST 'localhost:9906/similarity?algorithm=dist_hamming' \
-H 'Content-Type: application/json' \
-d '[ "this is a string", "this is another string" ]'

[[0.0, 12.0], [12.0, 0.0]]

Another example, but this time with the Jaro–Winkler distance and the granularity parameter. See Bytes, Bits and Tokens in the Harry documentation.

curl -XPOST 'localhost:9906/similarity?algorithm=dist_jarowinkler&granularity=bits' \
-H 'Content-Type: application/json' \
-d '[ "this is a test string", "this is also a test string" ]'

[[0.0, 0.0035714285913854837], [0.0035714285913854837, 0.0]]

Endpoints

POST /similarity : computes the similarity between N strings
GET /similarity/algorithms : list the supported algorithms
GET /similarity/references/:algorithm : documentation available for a given algorithm

Standard endpoints

GET /ms/version : returns the version number
GET /ms/name : returns the name
GET /ms/readme.md : returns the readme (this file)
GET /ms/readme.html : returns the readme as html
GET /swagger/swagger.json : returns the swagger api documentation
GET /swagger/#/ : returns swagger-ui displaying the api documentation
GET /nginx/stats.json : returns stats about Nginx
GET /nginx/stats.html : returns a dashboard displaying the stats from Nginx

About

A project by the Microservices Agency.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
harry @ e0dc142		harry @ e0dc142
scripts		scripts
.gitmodules		.gitmodules
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
NAME		NAME
README.md		README.md
VERSION		VERSION
app.py		app.py
nginx.conf		nginx.conf
postman.json		postman.json
requirements.txt		requirements.txt
run.sh		run.sh
swagger.json		swagger.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

msa-string-similarity

Quick start

Examples

Endpoints

Standard endpoints

About

About

Releases

Packages

Languages

License

TheMicroservicesAgency/msa-string-similarity

Folders and files

Latest commit

History

Repository files navigation

msa-string-similarity

Quick start

Examples

Endpoints

Standard endpoints

About

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages