-
Notifications
You must be signed in to change notification settings - Fork 173
Text_Indexing_Configuration_Options
ConfigurableAnalyzerFactory class can be used with the blazegraph properties file to specify which Analyzers are used for which languages. Languages are specified by the language tag on RDF literals, which conform with RFC 5646. Within blazegraph plain literals are assigned to the default locale's language. The blazegraph properties are used to map language ranges, as specified by RFC 4647 to classes which extend Analyzer. Supported classes included all the natural language specific classes from Lucene, and also:
- PatternAnalyzer
- TermCompletionAnalyzer
- KeywordAnalyzer
- SimpleAnalyzer
- StopAnalyzer
- WhitespaceAnalyzer
- StandardAnalyzer
More generally any subclass of Analyzer that has at least one constructor matching:
is usable. If the class has a static method named getDefaultStopSet()
then this is assumed to do what it says on the can; some of the Lucene
analyzers store their default stop words elsewhere, and such stopwords
are usable by this class. If no stop word set can be found, and there is
a constructor without stopwords and a constructor with stopwords, then
the former is assumed to use a default stop word set. Configuration is
by means of the blazegraph properties file. All relevant properties
start com.bigdata.search.ConfigurableAnalyzerFactory
which we
abbreviate to c.b.s.C
in this documentation. Properties from
Options apply to the
factory. Other properties, from
AnalyzerOptions start
with c.b.s.C.analyzer.
language-range
where
language-range
conforms with the extended language range
construct from RFC 4647
section 2.2. There is
an issue that bigdata does not allow '*' in property names, and we use
the character '_' to substitute for '*' in extended language ranges in
property names. These are used to specify an analyzer for the given
language range. If no analyzer is specified for the language range *
then the
StandardAnalyzer
is used. Given any specific language, then the analyzer matching the
longest configured language range, measured in number of subtags is
returned by {@link #getAnalyzer(String, boolean)} In the event of a
tie, the alphabetically first language range is used. The algorithm to
find a match is "Extended Filtering" as defined in section 3.3.2 of
RFC 4647. Some
useful analyzers are as follows:
KeywordAnalyzer | This treats every lexical value as a single search token |
WhitespaceAnalyzer | This uses whitespace to tokenize |
PatternAnalyzer | This uses a regular expression to tokenize |
TermCompletionAnalyzer | This uses up to three regular expressions to specify multiple tokens for each word, to address term completion use cases. |
EmptyAnalyzer | This suppresses the functionality, by treating every expression as a stop word. |
There are in addition the language specific analyzers that are included
by using the option
Options#NATURAL_LANGUAGE_SUPPORT.
By setting this option to true
, then all the known Lucene Analyzers
for natural languages are used for a range of language tags. These
settings may then be overridden by the settings of the user.
Specifically the following properties are loaded, prior to loading the
user's specification (with c.b.s.C expanding to
com.bigdata.search.ConfigurableAnalyzerFactory)
c.b.s.C.analyzer._.like=eng
c.b.s.C.analyzer.por.analyzerClass=org.apache.lucene.analysis.br.BrazilianAnalyzer
c.b.s.C.analyzer.pt.like=por
c.b.s.C.analyzer.zho.analyzerClass=org.apache.lucene.analysis.cn.ChineseAnalyzer
c.b.s.C.analyzer.chi.like=zho
c.b.s.C.analyzer.zh.like=zho
c.b.s.C.analyzer.jpn.analyzerClass=org.apache.lucene.analysis.cjk.CJKAnalyzer
c.b.s.C.analyzer.ja.like=jpn
c.b.s.C.analyzer.kor.like=jpn
c.b.s.C.analyzer.ko.like=kor
c.b.s.C.analyzer.ces.analyzerClass=org.apache.lucene.analysis.cz.CzechAnalyzer
c.b.s.C.analyzer.cze.like=ces
c.b.s.C.analyzer.cs.like=ces
c.b.s.C.analyzer.dut.analyzerClass=org.apache.lucene.analysis.nl.DutchAnalyzer
c.b.s.C.analyzer.nld.like=dut
c.b.s.C.analyzer.nl.like=dut
c.b.s.C.analyzer.deu.analyzerClass=org.apache.lucene.analysis.de.GermanAnalyzer
c.b.s.C.analyzer.ger.like=deu
c.b.s.C.analyzer.de.like=deu
c.b.s.C.analyzer.gre.analyzerClass=org.apache.lucene.analysis.el.GreekAnalyzer
c.b.s.C.analyzer.ell.like=gre
c.b.s.C.analyzer.el.like=gre
c.b.s.C.analyzer.rus.analyzerClass=org.apache.lucene.analysis.ru.RussianAnalyzer
c.b.s.C.analyzer.ru.like=rus
c.b.s.C.analyzer.tha.analyzerClass=org.apache.lucene.analysis.th.ThaiAnalyzer
c.b.s.C.analyzer.th.like=tha
c.b.s.C.analyzer.eng.analyzerClass=org.apache.lucene.analysis.standard.StandardAnalyzer
c.b.s.C.analyzer.en.like=eng
If specified this is the fully qualified name of a subclass of Analyzer that has appropriate constructors. This is set implicitly if some of the options below are selected (for example PATTERN). For each configured language range, if it is not set, either explicitly or implicitly, then LIKE must be specified. |
|
The value of this property is a language range, for which an analyzer is defined. Treat this language range in the same way as the specified language range. Loops are not permitted. If this option is specified for a language range, then no other option is permitted. |
|
The value of this property is one of:
|
|
The value of the pattern parameter to [http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/miscellaneous/PatternAnalyzer.html?is-external=true#PatternAnalyzer(org.apache.lucene.util.Version,%20java.util.regex.Pattern,%20boolean,%20java.util.Set) PatternAnalyzer(Version, Pattern, boolean, Set)] (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified. |
|
The value of the wordBoundary parameter to [https://www.blazegraph.com/docs/api/com/bigdata/search/TermCompletionAnalyzer.html#TermCompletionAnalyzer(java.util.regex.Pattern,%20java.util.regex.Pattern,%20java.util.regex.Pattern,%20boolean) TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)] (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified. |
|
The value of the subWordBoundary parameter to [https://www.blazegraph.com/docs/api/com/bigdata/search/TermCompletionAnalyzer.html#TermCompletionAnalyzer(java.util.regex.Pattern,%20java.util.regex.Pattern,%20java.util.regex.Pattern,%20boolean) TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)] (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified. The default sub-word boundary is a pattern that never matches, i.e. there are no sub-word boundaries. |
|
The value of the softHyphens parameter to [https://www.blazegraph.com/docs/api/com/bigdata/search/TermCompletionAnalyzer.html#TermCompletionAnalyzer(java.util.regex.Pattern,%20java.util.regex.Pattern,%20java.util.regex.Pattern,%20boolean) TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)] (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified. |
|
The value of the alwaysRemoveSoftHypens parameter to [https://www.blazegraph.com/docs/api/com/bigdata/search/TermCompletionAnalyzer.html#TermCompletionAnalyzer(java.util.regex.Pattern,%20java.util.regex.Pattern,%20java.util.regex.Pattern,%20boolean) TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)] (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified. Default value is |
Use the following parameters in the blazegraph properties file to completely disable stopwords:
com.bigdata.search.FullTextIndex.analyzerFactoryClass=com.bigdata.search.ConfigurableAnalyzerFactory
com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.eng.analyzerClass=org.apache.lucene.analysis.standard.StandardAnalyzer
com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.eng.stopwords=none
com.bigdata.search.ConfigurableAnalyzerFactory.analyzer._.like=eng