Open Source REST-API for Named Entity Extraction, Normalization, Reconciliation, Recommendation, Named Entity Disambiguation and Named Entity Linking
https://opensemanticsearch.org/doc/datamanagement/named_entity_recognition
REST API and Python library for search, suggestion, recommendation, normalization, reconciliation, named entity extraction, named entity linking & named entity disambiguation of named entities like persons, organizations and places for (semi)automatic semantic tagging, analysis & semantic enrichment of documents by linked data knowledge graph like SKOS thesaurus, Wikidata or RDF ontologies, SQL database(s) or spreadsheets like CSV, TSV or Excel table(s).
By integration of Open Standards for structured data formats (SKOS, RDF, JSON) and REST APIs (HTTP, REST) and entity linking/disambiguation/reconciliation (Open Refine Reconciliation Service API specs) and Open Source tools for natural language processing and text analytics this Free Software provides an Open Refine Reconciliation Service API (extended with automatic entity extraction, so you can post a full text instead of entity queries) for your own SKOS thesaurus, RDF ontologies and lists of names from lists, spreadsheets or databases as an independent service which can run on your own server or laptop, so you have not to send sensitive content data or names to external cloud service, and you can independent setup additional / own named entities or names.
After configuration (see section "Web User Interfaces (UI) for configuration and management of named entities") of your named entities from ontologies, thesaurus, database(s) or lists of names (see section "Import entities") you can call the REST API with a plain text / full-text (or even document files like PDF or Word, see section "Rich document formats") as parameter (see section "REST API request parameters") to extract and/or recommend Named Entities like persons, organizations or places and link them (see section "JSON response") in/to your Linked Data Knowledge Graph Database, Linked Open Data and Semantic Web or as facets for faceted search or navigation by Solr or Elastic Search.
Links plain text names/labels to ID/URI and normalizes alias or alternate label to preferred label and recommends all found entities for disambiguation and reconciliation.
The Entity Linker based Named Entity Linking and Normalization REST-API in entity_rest_api provides normalized entities in Open Refine Reconciliation API standard result format (Specification: https://reconciliation-api.github.io/specs/latest/).
Additional to the specified Open Refine Reconciliation API query parameters, you can POST a full text / context additionally or instead of entity queries of yet extracted entities or structured data field with entities (which in other Open Refine Reconciliation Service APIs are required), so the context will not only be used for disambiguation scoring but entities will be extracted automatically from the text.
Automatic named entity extraction of all known/imported entities / names / labels (extension only in Open Semantic Entity Search API, not Open Refine Reconciliation Service API query standard / not supported by other Open Refine Reconciliation Services / APIs):
HTTP POST a plain text as parameter "text" to http://localhost/search-apps/entity_rest_api/reconcile so all known entities will be extracted automatically.
Example for posting a full text to the REST-API by Python requests:
import requests
text = "Mr. Jon Doe lives in Berlin."
openrefine_server = 'http://localhost/search-apps/entity_rest_api/reconcile'
params = {'text': text}
r = requests.post(openrefine_server, params=params)
results = r.json()
print(results)
Query for IDs and normalized data of your entity labels (Open Refine Reconciliation Service API query standard)
Query for entities (structured by Open Refine Reconciliation Service API query standard specified in https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#query-request):
http://localhost/search-apps/entity_rest_api/reconcile?queries={ "q0" : { "query" : "Jon Doe" }, "q1" : { "query" : "Berlin" } }
Optionally you can HTTP POST a context as parameter "text" like in the example before to provide more context to improve scoring of disambiguation (scoring of ambiguous entities by context not implemented yet).
In both cases the response (structured by Open Refine Reconciliation Service API response standard specified in https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#query-response) is a JSON literal object with the same keys as in the request
{
"q0" : {
"result" : { ... }
},
"q1" : {
"result" : { ... }
}
}
Each result consists of a JSON literal object with the structure
{
"result" : [
{
"id" : ... string, URI or database ID ...
"name" : ... string ...
"type" : ... array of strings ...
"score" : ... double ...
"match" : ... boolean, true if the service is quite confident about the match ...
},
... more results ...
]
}
The REST-API uses the Python and Dictionary Matcher based and Apache Solr search index powered module Entity_Linker from the library entity_linking for Entity Linking and Normalization.
Example:
from entity_linking.entity_linker import Entity_Linker
linker = Entity_Linker()
# extract and normalize/link all known entities/names/labels
results = linker.entities( text = "Mr. Jon Doe lives in Berlin." )
print (results)
# normalize/link only the queried entities
results = linker.entities( queries = { 'q1': {'query': 'Berlin'}, 'q2': { 'query': 'Jon Doe' } } )
print (results)
For extraction of named entities from full-text the Entity Linker extracts named entities by dictionaries / lists of names / labels stored in the Solr index / core for named entities.
Apache Lucene/Solr powered dictionary based named entity extraction of indexed names and labels in the full text is done by Solr Text Tagger https://lucene.apache.org/solr/guide/7_4/the-tagger-handler.html by Finite State Transducer (FST) algorithm.
To add a named entity to the entities index and dictionary use the method add() with the parameter "id" for the unique ID, URI or URL, "preferred_label" for the normalized name / preferred label and "prefLabels" (higher score) and/or "labels" with all aliases or alternate labels/names.
from entity_manager.manager import Entity_Manager
entity_manager = Entity_Manager()
entity_manager.add( id = "https://entities/1", types=['Person'], preferred_label = "Jon Doe", prefLabels = ["Jon Doe"], labels = ["J. Doe", "Doe, Jon", "Doe, J."] )
Web user interfaces for setup/configuration of dictionaries, lists of names, thesauri and ontologies in Django based Open Semantic Search Apps (Git repository: https://github.com/opensemanticsearch/open-semantic-search-apps):
Web user interface for management of a SKOS thesaurus https://opensemanticsearch.org/doc/datamanagement/thesaurus
Ontologies Manager is a web user interface to import RDF ontologies or SPARQL results https://opensemanticsearch.org/doc/datamanagement/ontologies
Import Linked Open Data https://opensemanticsearch.org/doc/datamanagement/opendata
Example: Import lists of names from Wikidata https://opensemanticsearch.org/doc/datamanagement/opendata/wikidata
Until implementation of the SQL database importer command line tool and UI to import named entities from database(s), that are not available as SKOS thesaurus or RDF ontology and can be imported by the other user interfaces, use the module Entity Manager.
How to import lists of names from plain text files with one name per line:
Web User Interface:
The web user interface Ontologies Manager can not only import RDF ontologies, but also lists of names from plain text files https://opensemanticsearch.org/doc/datamanagement/ontologies
Data format (Example: politicians.txt):
Barack Obama
Vladimir Putin
Angela Merkel
Command line tool:
entity_import_list.py politicians.txt --type Politician
Python:
from entity_import.entity_import_list import Entity_Importer_List
list_importer = Entity_Importer_List()
list_importer.import_entities(filename="polititians.txt", types=['Person','Politician'])
For named entity recognition, named entity extraction and named entity linking and disambiguation of entities from other file formats like PDF documents, Word documents, scanned Documents (needing OCR) and many other file formats you can use Open Semantic ETL tools and user interfaces for crawling filesystems, using Apache Tika for text extraction, Tesseract for OCR and many other Open Source tools for data enrichment and analysis and calling this Open Semantic Entity Search API for named entity extraction and named entity linking to your Linked Data Knowledge Graph and the Semantic Web.
If you do not want to use the preconfigured Debian packages providing all components out of the box, you have to install the following dependencies:
Python 3.x
Apache Solr (>= 7.4) https://lucene.apache.org/solr/ (preconfigured Solr core / schema in this Git repository in /src/solr)
Open Semantic ETL https://opensemanticsearch.org/etl (Git repository: https://github.com/opensemanticsearch/open-semantic-etl)
Solr Ontology Tagger (Git repository: https://github.com/opensemanticsearch/solr-ontology-tagger)