-
Notifications
You must be signed in to change notification settings - Fork 29
input
JATE2.0 properties serves as an integration configuration file to enable JATE2.0 work with Apache Solr.
Solr schema.xml file is main file to configure your custom text processing pipeline. Basically, two pipelines (i.e., Solr fieldType) are needed for JATE2.0 to process your documents and generate data into two fields including n-gram fields (see example setting for jate_text_2_ngrams
) and candidate terms fields (see example setting for jate_text_2_terms
). N-gram processing pipeline serves to store exhaustive possibilities of terms collocations and corresponding frequency info, which will be retrieved by JATE2.0 to calculate and rank candidate terms stored in candidate terms field.
To feed into ATE algorithm, text should be processed by various NLP components. A generic pipeline is consist of tokenisation, sentence splitting, normalistaion, stemming/lemmatisation, term candidate generation and stop words filtering. The order and performance of NLP components affect the accuracy of final result significantly.
Noisy in text and variations of term surface forms are two of major challenges for ATE. Normalisation processing can convert different forms of textual representation to a normalised form and reduce the vocabulary size. For example, U.S.A can be normalised into 'USA'. St. Louis can be converted to 'Saint Louis'. There are normally two methods for text normalisation.
- Rule-based (regex filter, lower case)
- Dictionary-based (e.g., synonyms normalisation: car -> "automobile, vehicle")
Solr has many filters can be easily configured for the normalisation fitting for both of two methods. For example, charFilter
can be configured with your domain or corpus specific regex rules to filter/normalise your text. Typically, we can use the PatternReplaceCharFilterFactory
to cleanse html unicode characters (e.g., periods, hyphens) and entities in your html text. Additionally, you can also configure a MappingCharFilter
to change one string to another. Typically, we can use solr.MappingCharFilterFactory
to converts characters above ASCII. Pleas see more details in [Solr CharFilterFactories] (https://cwiki.apache.org/confluence/display/solr/CharFilterFactories). In addition, solr.SynonymFilterFactory
can be configured with your domain-specific synonym dictionary to conflate your multiple equivalent term surface forms into one single representation.
Stemming is also a kind of text normalisation which can reduce inflected or derived words into their root form, e.g., forgotte -> forget, ladies -> lady. But, stemmers are mostly language-specific. Also, the risk of using stemming is the loss of precise meaning of the words (e.g., lay -> lie). You can configure solr.EnglishMinimalStemFilterFactory
or solr.PorterStemFilterFactory
in your pipeline. We implemented a jate.EnglishLemmatisationFilterFactory
based on Dragon Tool Kit English lemmatiser and our experiment indicates that the using proper lemmatisation can improve accuracy.
Stopwords filtering can help to remove non-informative words and reduce vocabulary size significantly (refer to Zipf's law). Stopwords filtering are usually performed after candidate term extraction in the pipeline. More complex filtering strategy can be applied to improve recall, e.g., strip trailing stop words, removing leading symbolic tokens, etc. Please see example setting in our jate.OpenNLPRegexChunkerFactory
.
Contrastive corpus can be used for the contrastive analysis to study and measure the domains-specificity of term candidates. A typical ATE algorithm is weirdness. We are using BNC corpus as a generic reference corpus for contrastive scoring and ranking. You can provide your own corpus by optional runtime parameter(-r), which is essentially a word frequency list. A alternative BNC corpus for contrastive ATE algorithms is compiled by Adam Kilgarriff, available via https://www.kilgarriff.co.uk/bnc-readme.html
<TO BE CONTINUED>