Skip to content

Commit

Permalink
solr 6.5.1 README
Browse files Browse the repository at this point in the history
  • Loading branch information
bendemott committed Feb 9, 2018
1 parent 1aa8e56 commit 9ddd50a
Showing 1 changed file with 28 additions and 1 deletion.
29 changes: 28 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ GERMAN COMPOUND SPLITTER
A German decomposition filter that breaks complex words into other forms, written in **JAVA**.

**features 4 components:**
* CompoundSplitter build class for composing FST from dictionaries.
* CompoundSplitter class, for splitting german words.
* Lucene TokenFilter, and TokenFilterFactory that generates split tokens.
* Lucene Analyzer that encompasses TokenFilter for test cases.

Expand All @@ -29,18 +31,28 @@ A German decomposition filter that breaks complex words into other forms, writte
Versicherungskaufmann [versicherung, kauf, mann]
Anwendungsbetreuer [anwendung, betreuer]

## Lucene, Solr, ElasticSearch Analysis
The lucene analyzer that comes with this package is a Graph filter, which means if used on the query-side of the analysis chain it will properly generate query graphs for overlapping terms. This is accomplished by correctly setting the ``posIncrement`` and ``posLength`` attributes of each token.
For this to work in **Solr**, you must not split on whitespace by setting the ``sow=false``. This has the unfortunate downside of preventing phrase queries from working correctly, so you will be unable to use the ``pf``, ``pf2`` parameters to generate phrase queries.

For more information about Token Graphs in modern versions of lucene, solr, or elasticsearch see the following:
- https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/
- https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch

**Solr Analyzer Configuration:**
```xml
<fieldType name="field" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.de.compounds.GraphGermanCompoundAnalyzer"/>
</fieldType>
```

**Solr TokenFilterFactory Configuration:**

```xml
<filter class="org.apache.lucene.analysis.de.compounds.GraphGermanCompoundTokenFilterFactory"
minWordSize="5"
onlyLongestMatch="false"
preserveOriginal="true" />
```

**TokenFilter Parameters:**
Expand Down Expand Up @@ -132,8 +144,22 @@ All tests should be run as part of the build process and be discovered by the ma
**Example Test Case** using ``BaseTokenStreamTestCase``

```java
public void testPassthrough() throws Exception {
Analyzer analyzer = new GraphGermanCompoundAnalyzer();
final String input = "eins zwei drei";

assertAnalyzesTo(analyzer, input, // input - (THE INPUT STRING TO TOKENIZE)
new String[] {"eins", "zwei", "drei"}, // output - (THE EXPECTED OUTPUT TOKENS)
new int[] {0, 5, 10}, // startOffsets
new int[] {4, 9, 14}, // endOffsets
new String[] {WORD, WORD, WORD}, // types
new int[] {1, 1, 1}, // positionIncrements
new int[] {1, 1, 1}); // posLengths

}
```

**Debug Output on Test Failure**
A custom class (``TokenStreamDebug``) that is part of the test modules will output debugging information to help you understand the token output. Below is an example failure and the debug output that is generated from the TokenSream:

original: Finanzgrundsatzangelegenheiten
Expand All @@ -153,6 +179,7 @@ A custom class (``TokenStreamDebug``) that is part of the test modules will outp
start-end: 1:[0-30], 1:[0-30], 2:[0-30], 3:[0-30]
term 0 expected:<[finanz]> but was:<[Finanzgrundsatzangelegenheiten]>

## Future Improvements
There are many improvements that could be made to the splitter, the algorithm is fairly simple as it stands right now.
* statistical / probabilistic splitter - use a statistical data-set in the decomposition.
* next-word awareness - when we look-ahead at the next word in the stream, we can decide if a word should be decomposed.
Expand Down Expand Up @@ -184,4 +211,4 @@ It is temporarily not used because I've tried this, for example:
abbrenne 45

this is Google 1-gram corpus from 1980-200(7?) and it is clear that this prefix is not a compound
forming one at all.
forming one at all.

0 comments on commit 9ddd50a

Please sign in to comment.