Class ICUTokenizerFactory
java.lang.Object
org.apache.lucene.analysis.AbstractAnalysisFactory
org.apache.lucene.analysis.TokenizerFactory
org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory
- All Implemented Interfaces:
ResourceLoaderAware
Factory for
ICUTokenizer. Words are broken across script boundaries, then segmented
according to the BreakIterator and typing provided by the DefaultICUTokenizerConfig.
To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>
</fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated
list of code:rulefile pairs in the following format: four-letter ISO 15924 script code,
followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn")
and Cyrillic (script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>
</fieldType>- Since:
- 3.1
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final booleanprivate ICUTokenizerConfigprivate final booleanstatic final StringSPI name(package private) static final Stringprivate final IntObjectHashMap<String> Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion -
Constructor Summary
ConstructorsConstructorDescriptionDefault ctor for compatibility with SPIICUTokenizerFactory(Map<String, String> args) Creates a new ICUTokenizerFactory -
Method Summary
Modifier and TypeMethodDescriptioncreate(AttributeFactory factory) Creates a TokenStream of the specified input using the given AttributeFactoryvoidinform(ResourceLoader loader) Initializes this component with the provided ResourceLoader (used for loading classes, files, etc).private com.ibm.icu.text.BreakIteratorparseRules(String filename, ResourceLoader loader) Methods inherited from class org.apache.lucene.analysis.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizersMethods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
defaultCtorException, get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
Field Details
-
NAME
SPI name- See Also:
-
RULEFILES
- See Also:
-
tailored
-
config
-
cjkAsWords
private final boolean cjkAsWords -
myanmarAsWords
private final boolean myanmarAsWords
-
-
Constructor Details
-
ICUTokenizerFactory
Creates a new ICUTokenizerFactory -
ICUTokenizerFactory
public ICUTokenizerFactory()Default ctor for compatibility with SPI
-
-
Method Details
-
inform
Description copied from interface:ResourceLoaderAwareInitializes this component with the provided ResourceLoader (used for loading classes, files, etc).- Specified by:
informin interfaceResourceLoaderAware- Throws:
IOException
-
parseRules
private com.ibm.icu.text.BreakIterator parseRules(String filename, ResourceLoader loader) throws IOException - Throws:
IOException
-
create
Description copied from class:TokenizerFactoryCreates a TokenStream of the specified input using the given AttributeFactory- Specified by:
createin classTokenizerFactory
-