Class HMMChineseTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.util.SegmentingTokenizerBase
org.apache.lucene.analysis.cn.smart.HMMChineseTokenizer
- All Implemented Interfaces:
Closeable,AutoCloseable
Tokenizer for Chinese or mixed Chinese-English text.
The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final OffsetAttributeprivate static final BreakIteratorused for breaking the text into sentencesprivate final CharTermAttributeprivate final TypeAttributeprivate final WordSegmenterFields inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
buffer, BUFFERMAX, offsetFields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY -
Constructor Summary
ConstructorsConstructorDescriptionCreates a new HMMChineseTokenizerHMMChineseTokenizer(AttributeFactory factory) Creates a new HMMChineseTokenizer, supplying the AttributeFactory -
Method Summary
Modifier and TypeMethodDescriptionprotected booleanReturns true if another word is availablevoidreset()This method is called by a consumer before it begins consumption usingTokenStream.incrementToken().protected voidsetNextSentence(int sentenceStart, int sentenceEnd) Provides the next input sentence for analysisMethods inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
end, incrementToken, isSafeEndMethods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPointMethods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Field Details
-
sentenceProto
used for breaking the text into sentences -
termAtt
-
offsetAtt
-
typeAtt
-
wordSegmenter
-
tokens
-
-
Constructor Details
-
HMMChineseTokenizer
public HMMChineseTokenizer()Creates a new HMMChineseTokenizer -
HMMChineseTokenizer
Creates a new HMMChineseTokenizer, supplying the AttributeFactory
-
-
Method Details
-
setNextSentence
protected void setNextSentence(int sentenceStart, int sentenceEnd) Description copied from class:SegmentingTokenizerBaseProvides the next input sentence for analysis- Specified by:
setNextSentencein classSegmentingTokenizerBase
-
incrementWord
protected boolean incrementWord()Description copied from class:SegmentingTokenizerBaseReturns true if another word is available- Specified by:
incrementWordin classSegmentingTokenizerBase
-
reset
Description copied from class:TokenStreamThis method is called by a consumer before it begins consumption usingTokenStream.incrementToken().Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call
super.reset(), otherwise some internal state will not be correctly reset (e.g.,Tokenizerwill throwIllegalStateExceptionon further usage).- Overrides:
resetin classSegmentingTokenizerBase- Throws:
IOException
-