Class DictionaryCompoundWordTokenFilter
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter
- All Implemented Interfaces:
Closeable,AutoCloseable,Unwrappable<TokenStream>
A
TokenFilter that decomposes compound words found in many
Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
CompoundWordTokenFilterBase.CompoundTokenNested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State -
Field Summary
Fields inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokensFields inherited from class org.apache.lucene.analysis.TokenFilter
inputFields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY -
Constructor Summary
ConstructorsConstructorDescriptionDictionaryCompoundWordTokenFilter(TokenStream input, CharArraySet dictionary) Creates a newDictionaryCompoundWordTokenFilterDictionaryCompoundWordTokenFilter(TokenStream input, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatchIgnoreSubwords) Creates a newDictionaryCompoundWordTokenFilterDictionaryCompoundWordTokenFilter(TokenStream input, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch, boolean onlyLongestMatchIgnoreSubwords) Deprecated. -
Method Summary
Modifier and TypeMethodDescriptionprotected voidDecomposes the currentCompoundWordTokenFilterBase.termAttand placesCompoundWordTokenFilterBase.CompoundTokeninstances in theCompoundWordTokenFilterBase.tokenslist.Methods inherited from class org.apache.lucene.analysis.compound.CompoundWordTokenFilterBase
incrementToken, resetMethods inherited from class org.apache.lucene.analysis.TokenFilter
close, end, unwrapMethods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Constructor Details
-
DictionaryCompoundWordTokenFilter
Creates a newDictionaryCompoundWordTokenFilter- Parameters:
input- theTokenStreamto processdictionary- the word dictionary to match against.
-
DictionaryCompoundWordTokenFilter
@Deprecated public DictionaryCompoundWordTokenFilter(TokenStream input, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch, boolean onlyLongestMatchIgnoreSubwords) Deprecated.Creates a newDictionaryCompoundWordTokenFilter- Parameters:
input- theTokenStreamto processdictionary- the word dictionary to match against.minWordSize- only words longer than this get processedminSubwordSize- only subwords longer than this get to the output streammaxSubwordSize- only subwords shorter than this get to the output streamonlyLongestMatch- deprecated, use parameter onlyLongestMatchIgnoreSubwords insteadonlyLongestMatchIgnoreSubwords- Subwords are igored, e.g. if a word contains 'schwein', only the longer word 'schwein' will be extracted, the subword 'wein' will be ignored. Supersede parameter onlyLongestMatch
-
DictionaryCompoundWordTokenFilter
public DictionaryCompoundWordTokenFilter(TokenStream input, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatchIgnoreSubwords) Creates a newDictionaryCompoundWordTokenFilter- Parameters:
input- theTokenStreamto processdictionary- the word dictionary to match against.minWordSize- only words longer than this get processedminSubwordSize- only subwords longer than this get to the output streammaxSubwordSize- only subwords shorter than this get to the output streamonlyLongestMatchIgnoreSubwords- Subwords are igored, e.g. if a word contains 'schwein', only the longer word 'schwein' will be extracted, the subword 'wein' will be ignored. Supersede parameter onlyLongestMatch
-
-
Method Details
-
decompose
protected void decompose()Description copied from class:CompoundWordTokenFilterBaseDecomposes the currentCompoundWordTokenFilterBase.termAttand placesCompoundWordTokenFilterBase.CompoundTokeninstances in theCompoundWordTokenFilterBase.tokenslist. The original token may not be placed in the list, as it is automatically passed through this filter.- Specified by:
decomposein classCompoundWordTokenFilterBase
-