Class AbstractDictionary
java.lang.Object
org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary
- Direct Known Subclasses:
BigramDictionary,WordDictionary
SmartChineseAnalyzer abstract dictionary implementation.
Contains methods for dealing with GB2312 encoding.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intDictionary data contains 6768 Chinese characters with frequency statistics.static final intLast Chinese Character in GB2312 (87 * 94).static final intFirst Chinese Character in GB2312 (15 * 94) Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptiongetCCByGB2312Id(int ccid) Transcode from GB2312 ID to UnicodeshortgetGB2312Id(char ch) Transcode from Unicode to GB2312longhash1(char c) 32-bit FNV Hash Functionlonghash1(char[] carray) 32-bit FNV Hash Functioninthash2(char c) djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c.inthash2(char[] carray) djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c.
-
Field Details
-
GB2312_FIRST_CHAR
public static final int GB2312_FIRST_CHARFirst Chinese Character in GB2312 (15 * 94) Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation.- See Also:
-
GB2312_CHAR_NUM
public static final int GB2312_CHAR_NUMLast Chinese Character in GB2312 (87 * 94). Characters in GB2312 are arranged in a grid of 94 * 94, 88-94 are unassigned.- See Also:
-
CHAR_NUM_IN_FILE
public static final int CHAR_NUM_IN_FILEDictionary data contains 6768 Chinese characters with frequency statistics.- See Also:
-
-
Constructor Details
-
AbstractDictionary
AbstractDictionary()
-
-
Method Details
-
getCCByGB2312Id
Transcode from GB2312 ID to UnicodeGB2312 is divided into a 94 * 94 grid, containing 7445 characters consisting of 6763 Chinese characters and 682 symbols. Some regions are unassigned (reserved).
- Parameters:
ccid- GB2312 id- Returns:
- unicode String
-
getGB2312Id
public short getGB2312Id(char ch) Transcode from Unicode to GB2312- Parameters:
ch- input character in Unicode, or character in Basic Latin range.- Returns:
- position in GB2312
-
hash1
public long hash1(char c) 32-bit FNV Hash Function- Parameters:
c- input character- Returns:
- hashcode
-
hash1
public long hash1(char[] carray) 32-bit FNV Hash Function- Parameters:
carray- character array- Returns:
- hashcode
-
hash2
public int hash2(char c) djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.- Parameters:
c- character- Returns:
- hashcode
-
hash2
public int hash2(char[] carray) djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.- Parameters:
carray- character array- Returns:
- hashcode
-