Text_Indexing_Configuration_Options

ConfigurableAnalyzerFactory class can be used with the blazegraph properties file to specify which Analyzers are used for which languages. Languages are specified by the language tag on RDF literals, which conform with RFC 5646. Within blazegraph plain literals are assigned to the default locale's language. The blazegraph properties are used to map language ranges, as specified by RFC 4647 to classes which extend Analyzer. Supported classes included all the natural language specific classes from Lucene, and also:

More generally any subclass of Analyzer that has at least one constructor matching:

no arguments
Version
Version, Set

is usable. If the class has a static method named getDefaultStopSet() then this is assumed to do what it says on the can; some of the Lucene analyzers store their default stop words elsewhere, and such stopwords are usable by this class. If no stop word set can be found, and there is a constructor without stopwords and a constructor with stopwords, then the former is assumed to use a default stop word set. Configuration is by means of the blazegraph properties file. All relevant properties start com.bigdata.search.ConfigurableAnalyzerFactory which we abbreviate to c.b.s.C in this documentation. Properties from Options apply to the factory. Other properties, from AnalyzerOptions start with c.b.s.C.analyzer.language-range where language-range conforms with the extended language range construct from RFC 4647 section 2.2. There is an issue that bigdata does not allow '*' in property names, and we use the character '_' to substitute for '*' in extended language ranges in property names. These are used to specify an analyzer for the given language range. If no analyzer is specified for the language range * then the StandardAnalyzer is used. Given any specific language, then the analyzer matching the longest configured language range, measured in number of subtags is returned by {@link #getAnalyzer(String, boolean)} In the event of a tie, the alphabetically first language range is used. The algorithm to find a match is "Extended Filtering" as defined in section 3.3.2 of RFC 4647. Some useful analyzers are as follows:


KeywordAnalyzer	This treats every lexical value as a single search token
WhitespaceAnalyzer	This uses whitespace to tokenize
PatternAnalyzer	This uses a regular expression to tokenize
TermCompletionAnalyzer	This uses up to three regular expressions to specify multiple tokens for each word, to address term completion use cases.
EmptyAnalyzer	This suppresses the functionality, by treating every expression as a stop word.

There are in addition the language specific analyzers that are included by using the option Options#NATURAL_LANGUAGE_SUPPORT. By setting this option to true, then all the known Lucene Analyzers for natural languages are used for a range of language tags. These settings may then be overridden by the settings of the user. Specifically the following properties are loaded, prior to loading the user's specification (with c.b.s.C expanding to com.bigdata.search.ConfigurableAnalyzerFactory)

c.b.s.C.analyzer._.like=eng
c.b.s.C.analyzer.por.analyzerClass=org.apache.lucene.analysis.br.BrazilianAnalyzer
c.b.s.C.analyzer.pt.like=por
c.b.s.C.analyzer.zho.analyzerClass=org.apache.lucene.analysis.cn.ChineseAnalyzer
c.b.s.C.analyzer.chi.like=zho
c.b.s.C.analyzer.zh.like=zho
c.b.s.C.analyzer.jpn.analyzerClass=org.apache.lucene.analysis.cjk.CJKAnalyzer
c.b.s.C.analyzer.ja.like=jpn
c.b.s.C.analyzer.kor.like=jpn
c.b.s.C.analyzer.ko.like=kor
c.b.s.C.analyzer.ces.analyzerClass=org.apache.lucene.analysis.cz.CzechAnalyzer
c.b.s.C.analyzer.cze.like=ces
c.b.s.C.analyzer.cs.like=ces
c.b.s.C.analyzer.dut.analyzerClass=org.apache.lucene.analysis.nl.DutchAnalyzer
c.b.s.C.analyzer.nld.like=dut
c.b.s.C.analyzer.nl.like=dut
c.b.s.C.analyzer.deu.analyzerClass=org.apache.lucene.analysis.de.GermanAnalyzer
c.b.s.C.analyzer.ger.like=deu
c.b.s.C.analyzer.de.like=deu
c.b.s.C.analyzer.gre.analyzerClass=org.apache.lucene.analysis.el.GreekAnalyzer
c.b.s.C.analyzer.ell.like=gre
c.b.s.C.analyzer.el.like=gre
c.b.s.C.analyzer.rus.analyzerClass=org.apache.lucene.analysis.ru.RussianAnalyzer
c.b.s.C.analyzer.ru.like=rus
c.b.s.C.analyzer.tha.analyzerClass=org.apache.lucene.analysis.th.ThaiAnalyzer
c.b.s.C.analyzer.th.like=tha
c.b.s.C.analyzer.eng.analyzerClass=org.apache.lucene.analysis.standard.StandardAnalyzer
c.b.s.C.analyzer.en.like=eng

List of Options

analyzerClass	If specified this is the fully qualified name of a subclass of Analyzer that has appropriate constructors. This is set implicitly if some of the options below are selected (for example PATTERN). For each configured language range, if it is not set, either explicitly or implicitly, then LIKE must be specified.
like	The value of this property is a language range, for which an analyzer is defined. Treat this language range in the same way as the specified language range. Loops are not permitted. If this option is specified for a language range, then no other option is permitted.
stopwords	The value of this property is one of: `none` - This analyzer is used without stop words. `default` - Use the default setting for stopwords for this analyzer. It is an error to set this value on some analyzers such as SimpleAnalyzer that do not support stop words. A fully qualified class name ... of a subclass of Analyzer which has a static method `getDefaultStopSet()`, in which case, the returned set of stop words is used. If the analyzerClass does not support stop words then any value other than STOPWORDS_VALUE_NONE is an error. If the analyzerClass does support stop words then the default value is STOPWORDS_VALUE_DEFAULT
pattern	The value of the pattern parameter to [http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/miscellaneous/PatternAnalyzer.html?is-external=true#PatternAnalyzer(org.apache.lucene.util.Version,%20java.util.regex.Pattern,%20boolean,%20java.util.Set) PatternAnalyzer(Version, Pattern, boolean, Set)] (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified.
wordBoundary	The value of the wordBoundary parameter to [https://www.blazegraph.com/docs/api/com/bigdata/search/TermCompletionAnalyzer.html#TermCompletionAnalyzer(java.util.regex.Pattern,%20java.util.regex.Pattern,%20java.util.regex.Pattern,%20boolean) TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)] (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified.
subWordBoundary	The value of the subWordBoundary parameter to [https://www.blazegraph.com/docs/api/com/bigdata/search/TermCompletionAnalyzer.html#TermCompletionAnalyzer(java.util.regex.Pattern,%20java.util.regex.Pattern,%20java.util.regex.Pattern,%20boolean) TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)] (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified. The default sub-word boundary is a pattern that never matches, i.e. there are no sub-word boundaries. `Pattern.compile("(?!)")`
softHyphens	The value of the softHyphens parameter to [https://www.blazegraph.com/docs/api/com/bigdata/search/TermCompletionAnalyzer.html#TermCompletionAnalyzer(java.util.regex.Pattern,%20java.util.regex.Pattern,%20java.util.regex.Pattern,%20boolean) TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)] (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified.
alwaysRemoveSoftHyphens	The value of the alwaysRemoveSoftHypens parameter to [https://www.blazegraph.com/docs/api/com/bigdata/search/TermCompletionAnalyzer.html#TermCompletionAnalyzer(java.util.regex.Pattern,%20java.util.regex.Pattern,%20java.util.regex.Pattern,%20boolean) TermCompletionAnalyzer#TermCompletionAnalyzer(Pattern, Pattern, Pattern, boolean)] (Note the Pattern#UNICODE_CHARACTER_CLASS flag is enabled). It is an error if a different analyzer class is specified. Default value is `false`

Disable Stopwords

Use the following parameters in the blazegraph properties file to completely disable stopwords:

com.bigdata.search.FullTextIndex.analyzerFactoryClass=com.bigdata.search.ConfigurableAnalyzerFactory
com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.eng.analyzerClass=org.apache.lucene.analysis.standard.StandardAnalyzer
com.bigdata.search.ConfigurableAnalyzerFactory.analyzer.eng.stopwords=none
com.bigdata.search.ConfigurableAnalyzerFactory.analyzer._.like=eng

Introduction

Text_Indexing_Configuration_Options

List of Options

Disable Stopwords

Clone this wiki locally