Skip to content

audy/nearproteins

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nearproteins

Store for finding similar amino acid sequences using Locality Sensitive Hashing and Approximate Nearest Neighbors.

Installation

Python, with dependencies:

  • Annoy
  • BioPython
pip install -r requirements.txt

Redis

Instructions

  1. Load proteins into database
$ ./load-proteins < data/proteins.fasta
  1. Query database
$ ./query-proteins < data/proteins.fasta # returns JSON for each record

Python API

Very basic. I plan to add more configuration.

Loading data into store

from Bio import SeqIO
import nearproteins

store = nearproteins.SimilarStringStore()

store.engine.clean_all_buckets()

records = SeqIO.parse(handle, 'fasta')

for record in records:
    store.add(str(record.seq), record.id)

Retrieving records from store

# returns array of vectors, match IDs, similarities
results = store.get(str(record.seq)) 

Use as a server

You can query and add records to the database using simple sockets.

$ ./server # start the server, listens on port 1234

In another window...

$ nc 127.0.0.1 1234 # connect
SET 1 AUSTIN
SET 2 BOSTON
GET AUSTIN
{"1": 0.0}
GET BOSTON
{"2": 0.0}

You can use this to build a simple client in another language such as Ruby

require 'dna'
require 'socket'
require 'json'

HOSTNAME = '127.0.0.1'
PORT = '1234'

socket = TCPSocket.open HOSTNAME, PORT

File.open('proteins.fasta') do |handle|
  records = Dna.new handle, :format => :fasta

  records.each do |record|
    socket.puts "GET #{record.sequence}"
    resp = JSON.parse(socket.gets)
    p resp
  end
end

About

🔍 📖 experimenting w/ LSHs and protein sequences

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages