-
Notifications
You must be signed in to change notification settings - Fork 29
JATE architecture
JATE 2.0 builds on top of Apache Solr (currently 7.2.1). It uses Solr's analyzer chain to index input document collections, which extracts candidate terms, generates basic statistical information, and save them in specifically defined fields. Thus the majority of existing Solr text processing libraries can be used with JATE 2.0 in a plug-and-play fashion. After the indexing process is complete, candidate terms are collected from the index, and individual ATE algorithms are called upon to score and rank the candidates. The output can then either be written back into the index, or exported to a different format.
The figure below shows the general architecture of JATE 2.0 and its workflow, which consists of four phases: (1) data pre- processing; (2) term candidate extraction and indexing; (3) candidate scoring and filtering and (4) final term indexing and export.
Data pre-processing parses an input document to raw text content and performs text normalization in order to reduce 'noise' in irregular data. Users may configure their Solr instance to benefit from the powerful Solr Content Extraction Library to extract textual content from files of different formats, such as plain text, HTML, PDF, and Microsoft Office.
Further, a recommended practice is to apply Solr char filters for character-level text normalization. For example, HTMLStripCharFilterFactory detects HTML entities with corresponding characters. This can reduce errors in downstream processors (e.g., PoS tagger) in the analyzer. This can be configured as part of the analyzer in the Solr schema, usually placed before tokenization.
Candidate extraction: the pre-processed text content then passes through the candidate extraction component that extracts and normalizes term candidates from each document. This is realized as part of the Solr document indexing process, by defining JATE-specific 'analyzers'. An analyzer is composed of a series of processors ('tokenizers' and 'filters') to form a pipeline that examines the text content from each document, generates a token stream and records statistical information (a.k.a Inverted Indexing). Depending on individual needs, users may assemble customized analyzers for term candidate generation. This analyzer is then applied to a 'field' defined in the solr schema, hence breaking text content into term candidates and store them in the field.
Despite the large collection of text processing libraries already available with Solr, JATE 2.0 further extends it by supporting three types of term candidate extraction. These include: (1) a PoS pattern based chunker that extracts candidates based on user specified patterns; (2) a token N-gram extractor that extends the built-in one; (3) a noun phrase chunker. All of these are implemented with the capability to normalize noisy terms, such as removing leading and trailing stop words, and non-alphanumeric tokens. In addition, JATE 2.0 also implements an English lemmatizer, which is a recommended replacement of stemmers that can sometimes be too aggressive for ATE. These offer great flexibility enabling almost any customized term candidate extraction.
An example configuration of the analyzer used by JATE 2.0 is shown in the digram below.
Scoring, ranking and filtering: after the indexing process, candidate terms for the entire corpus are collected from the index and processed by the subsequent filtering components. Here different ATE algorithms can be configured to score and rank the candidates, and making the final selection. Users can configure to filter candidates that are too short/long (character length), very infrequent (minimum frequency), and select only top ranked terms.
Most of ATE algorithm (i.e., for scoring & ranking) are complex and requires corpus-level statistics of every sub-component of multi-word candidate terms or context statistics (e.g., 'TermEx' algorithm). So, in JATE2.0, we design to have a supplement Ngram analyser chain to index all the Ngrams along with candidate term indexing anlayser chain. For each candidate sequence of n continuous words w_n, the ngram field indexes the every individual term unit (token) frequency f(w_1) ... f(w_n) and the whole multiple word candidate frequency f(w_1 ... w_n). The follow digram presents an close-up detail of candidate extraction and indexing solution for supporting subsequent 'Candidate Filtering' process.