java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.util.SegmentingTokenizerBase
org.apache.lucene.analysis.th.ThaiTokenizer
- All Implemented Interfaces:
Closeable,AutoCloseable
Tokenizer that use
BreakIterator to tokenize Thai text.
WARNING: this tokenizer may not be supported by all JREs. It is known to work with Sun/Oracle and Harmony JREs. If your application needs to be fully portable, consider using ICUTokenizer instead, which uses an ICU Thai BreakIterator that will always be available.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final booleanTrue if the JRE supports a working dictionary-based breakiterator for Thai.private final OffsetAttributeprivate static final BreakIterator(package private) intprivate static final BreakIteratorused for breaking the text into sentences(package private) intprivate final CharTermAttributeprivate final BreakIteratorprivate final CharArrayIteratorFields inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
buffer, BUFFERMAX, offsetFields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY -
Constructor Summary
ConstructorsConstructorDescriptionCreates a new ThaiTokenizerThaiTokenizer(AttributeFactory factory) Creates a new ThaiTokenizer, supplying the AttributeFactory -
Method Summary
Modifier and TypeMethodDescriptionprotected booleanReturns true if another word is availableprotected voidsetNextSentence(int sentenceStart, int sentenceEnd) Provides the next input sentence for analysisMethods inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
end, incrementToken, isSafeEnd, resetMethods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPointMethods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Field Details
-
DBBI_AVAILABLE
public static final boolean DBBI_AVAILABLETrue if the JRE supports a working dictionary-based breakiterator for Thai. If this is false, this tokenizer will not work at all! -
proto
-
sentenceProto
used for breaking the text into sentences -
wordBreaker
-
wrapper
-
sentenceStart
int sentenceStart -
sentenceEnd
int sentenceEnd -
termAtt
-
offsetAtt
-
-
Constructor Details
-
ThaiTokenizer
public ThaiTokenizer()Creates a new ThaiTokenizer -
ThaiTokenizer
Creates a new ThaiTokenizer, supplying the AttributeFactory
-
-
Method Details
-
setNextSentence
protected void setNextSentence(int sentenceStart, int sentenceEnd) Description copied from class:SegmentingTokenizerBaseProvides the next input sentence for analysis- Specified by:
setNextSentencein classSegmentingTokenizerBase
-
incrementWord
protected boolean incrementWord()Description copied from class:SegmentingTokenizerBaseReturns true if another word is available- Specified by:
incrementWordin classSegmentingTokenizerBase
-