pqai snippet

PQAI Snippet

This service contains code for extracting most relevant passages from the full text of search results. These passages are meant to allow easy assessment of the result for the users.

In the context of prior-art searching, it is generally not possible to assess the relevance of a search result by only referring to its title and abstract, the fields most commonly shown on result pages.

This service contains ML models that extract passages of text (also called snippets) most likely to be relevant for a given query from the full text of the document.

These passages can be extracted, for example, by splitting the text of the documents into sentences and then ranking them using a text similarity model.

For long queries containing multiple sentences and paragraphs, the service also provides a sentence-level snippets, thereby creating a "mapping" between the query elements and the document.

Code structure

root/
  |-- core/
       |-- snippet.py
       |-- highlighter.py
       |-- subsent_span_extractor.py
       |-- encoder_srv.py
       |-- reranking_srv.py
  |-- assets/					# files (e.g. ML models) used by core modules
  |-- tests/
        |-- test_snippet.py
        |-- test_subsent_span_extractor.py
        |-- test_srvs.py		# Tests for dependency services
        |-- test_server.py		# Tests for the REST API
  |-- main.py					# Defines the REST API
  |
  |-- requirements.txt			# List of Python dependencies
  |
  |-- Dockerfile				# Docker files
  |-- docker-compose.yml
  |
  |-- env						# .env file template
  |-- deploy.sh					# Script for setting up service on local

Core modules

Snippet

This module provides the following classes:

SnippetExtractor: provides functionality for extracting most relevant passages for a query from a given document through its .extract_snippet() method. Typical usage is as follows:

from core.snippet import SnippetExtractor

query = "fluid formation sampling"
text = "A fluid sampling system retrieves a formation fluid sample from a formation surrounding a wellbore extending along a wellbore axis, wherein the formation has a virgin fluid and a contaminated fluid therein. The system includes a sample inlet, a first guard inlet positioned adjacent to the sample inlet and spaced from the sample inlet in a first direction along the wellbore axis, and a second guard inlet positioned adjacent to the sample inlet and spaced from the sample inlet in a second, opposite direction along the wellbore axis."
snippet = SnippetExtractor.extract_snippet(query, text)
print(snippet)

For a complex query with multiple elements, it provides element-wise snippets through its .map() method. Typical usage is as follows:

from core.snippet import SnippetExtractor

longquery = 'A method of sampling formation fluids. The method includes lowering a sampling apparatus into a borewell.'
text = "A fluid sampling system retrieves a formation fluid sample from a formation surrounding a wellbore extending along a wellbore axis, wherein the formation has a virgin fluid and a contaminated fluid therein. The system includes a sample inlet, a first guard inlet positioned adjacent to the sample inlet and spaced from the sample inlet in a first direction along the wellbore axis, and a second guard inlet positioned adjacent to the sample inlet and spaced from the sample inlet in a second, opposite direction along the wellbore axis."

mapping = SnippetExtractor.map(longquery, text)
print(mapping)

Note that both of these are static methods and do not require SnippetExtractor to be instantiated.

CombinationalMapping: creates mappings for multiple documents against a single query. It is effectively same as running SnippetExtractor.map individually against each document but the implementation of CombinationalMapping.map is optimized for performance.
SubsentSnippetExtractor: used by SnippetExtractor to find word sequences from a sentences that are likely to make sense without neighboring context.

Within the thresholds of a minimum and maximum sequence length, it returns all possible sequences ranked in the order of their likelihood of making independent sense. Under the hood, the model uses a trained neural network for this task.

It is not used directly by any module other than SnippetExtractor and thus can be considered as an implementation level detail of the same.

Highlighter

Sub-sentence Span Extractor

Assets

The assets required to run this service are stored in the /assets directory.

When you clone the Github repository, the /assets directory will have nothing but a README file. You will need to download actual asset files as a zip archive from the following link:

https://https://s3.amazonaws.com/pqai.s3/public/assets-pqai-reranker.zip

After downloading, extract the zip file into the /assets directory.

(alternatively, you can also use the deploy.sh script to do this step automatically - see next section)

The assets contain the following files/directories:

span_extractor_dictionary.json: term:index mapping for vocabulary
span_extractor_model.hdf5: Model weights for the sub-sentence span extractor
span_extractor_vectors.txt: Term embeddings for sub-sentence span extractor model
span_extractor_vocab.json: Vocabulary for sub-sentence span extractor model
stopwords.txt: list of patent-specific stopwords

Deployment

Prerequisites

The following deployment steps assume that you are running a Linux distribution and have Git and Docker installed on your system.

Setup

The easiest way to get this service up and running on your local system is to follow these steps:

Clone the repository

git clone https://github.com/pqaidevteam/pqai-[service].git

Using the env template in the repository, create a .env file and set the environment variables.
```
cd pqai-[service]
cp env .env
nano .env
```
Run deploy.sh script.
```
chmod +x deploy.sh
bash ./deploy.sh
```