-
Notifications
You must be signed in to change notification settings - Fork 28
pqai snippet
This service contains code for extracting most relevant passages from the full text of search results. These passages are meant to allow easy assessment of the result for the users.
In the context of prior-art searching, it is generally not possible to assess the relevance of a search result by only referring to its title and abstract, the fields most commonly shown on result pages.
This service contains ML models that extract passages of text (also called snippets) most likely to be relevant for a given query from the full text of the document.
These passages can be extracted, for example, by splitting the text of the documents into sentences and then ranking them using a text similarity model.
For long queries containing multiple sentences and paragraphs, the service also provides a sentence-level snippets, thereby creating a "mapping" between the query elements and the document.
root/
|-- core/
|-- snippet.py
|-- highlighter.py
|-- subsent_span_extractor.py
|-- encoder_srv.py
|-- reranking_srv.py
|-- assets/ # files (e.g. ML models) used by core modules
|-- tests/
|-- test_snippet.py
|-- test_subsent_span_extractor.py
|-- test_srvs.py # Tests for dependency services
|-- test_server.py # Tests for the REST API
|-- main.py # Defines the REST API
|
|-- requirements.txt # List of Python dependencies
|
|-- Dockerfile # Docker files
|-- docker-compose.yml
|
|-- env # .env file template
|-- deploy.sh # Script for setting up service on local
This module provides the following classes:
-
SnippetExtractor
: provides functionality for extracting most relevant passages for a query from a given document through its.extract_snippet()
method. Typical usage is as follows:from core.snippet import SnippetExtractor query = "fluid formation sampling" text = "A fluid sampling system retrieves a formation fluid sample from a formation surrounding a wellbore extending along a wellbore axis, wherein the formation has a virgin fluid and a contaminated fluid therein. The system includes a sample inlet, a first guard inlet positioned adjacent to the sample inlet and spaced from the sample inlet in a first direction along the wellbore axis, and a second guard inlet positioned adjacent to the sample inlet and spaced from the sample inlet in a second, opposite direction along the wellbore axis." snippet = SnippetExtractor.extract_snippet(query, text) print(snippet)
For a complex query with multiple elements, it provides element-wise snippets through its
.map()
method. Typical usage is as follows:from core.snippet import SnippetExtractor longquery = 'A method of sampling formation fluids. The method includes lowering a sampling apparatus into a borewell.' text = "A fluid sampling system retrieves a formation fluid sample from a formation surrounding a wellbore extending along a wellbore axis, wherein the formation has a virgin fluid and a contaminated fluid therein. The system includes a sample inlet, a first guard inlet positioned adjacent to the sample inlet and spaced from the sample inlet in a first direction along the wellbore axis, and a second guard inlet positioned adjacent to the sample inlet and spaced from the sample inlet in a second, opposite direction along the wellbore axis." mapping = SnippetExtractor.map(longquery, text) print(mapping)
Note that both of these are static methods and do not require
SnippetExtractor
to be instantiated. -
CombinationalMapping
: creates mappings for multiple documents against a single query. It is effectively same as runningSnippetExtractor.map
individually against each document but the implementation ofCombinationalMapping.map
is optimized for performance. -
SubsentSnippetExtractor
: used bySnippetExtractor
to find word sequences from a sentences that are likely to make sense without neighboring context.Within the thresholds of a minimum and maximum sequence length, it returns all possible sequences ranked in the order of their likelihood of making independent sense. Under the hood, the model uses a trained neural network for this task.
It is not used directly by any module other than
SnippetExtractor
and thus can be considered as an implementation level detail of the same.
The assets required to run this service are stored in the /assets
directory.
When you clone the Github repository, the /assets
directory will have nothing but a README file. You will need to download actual asset files as a zip archive from the following link:
https://https://s3.amazonaws.com/pqai.s3/public/assets-pqai-reranker.zip
After downloading, extract the zip file into the /assets
directory.
(alternatively, you can also use the deploy.sh
script to do this step automatically - see next section)
The assets contain the following files/directories:
-
span_extractor_dictionary.json
: term:index mapping for vocabulary -
span_extractor_model.hdf5
: Model weights for the sub-sentence span extractor -
span_extractor_vectors.txt
: Term embeddings for sub-sentence span extractor model -
span_extractor_vocab.json
: Vocabulary for sub-sentence span extractor model -
stopwords.txt
: list of patent-specific stopwords
Prerequisites
The following deployment steps assume that you are running a Linux distribution and have Git and Docker installed on your system.
Setup
The easiest way to get this service up and running on your local system is to follow these steps:
-
Clone the repository
git clone https://github.com/pqaidevteam/pqai-[service].git
-
Using the
env
template in the repository, create a.env
file and set the environment variables.cd pqai-[service] cp env .env nano .env
-
Run
deploy.sh
script.chmod +x deploy.sh bash ./deploy.sh
This will create a docker image and run it as a docker container on the port number you specified in the .env
file.
Alternatively, after following steps (1) and (2) above, you can use the command python main.py
to run the service in a terminal.
This service is dependent on the following other services:
- pqai-encoder
- pqai-reranker
The following services depend on this service:
- pqai-gateway