Class UAX29URLEmailTokenizerImpl
Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <URL>: A URL
- <EMAIL>: An email address
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
- <KATAKANA>: A sequence of katakana characters
- <HANGUL>: A sequence of Hangul characters
- <EMOJI>: A sequence of Emoji characters
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstatic final intEmail token typestatic final intEmoji token typestatic final intHangul token typestatic final intHiragana token typestatic final intIdeographic token typestatic final intKatakana token typestatic final intNumbersstatic final intChars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.).static final intURL token typestatic final intAlphanumeric sequencesprivate longNumber of characters up to the start of the matched text.private intNumber of characters from the last newline up to the start of the matched text.static final intThis character denotes the end of file.static final intLexical States.private intNumber of newlines encountered up to the start of the matched text.private static final int[]Translates DFA states to action switch labels.private static final Stringprivate static final int[]ZZ_ATTRIBUTE[aState] contains the attributes of stateaStateprivate static final Stringprivate intInitial size of the lookahead buffer.private static final int[]Second-level tables for translating characters to character classesprivate static final Stringprivate static final int[]Top-level table for translating characters to character classesprivate static final Stringprivate static final String[]private static final int[]ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integerprivate static final intError code for "could not match input".private static final intError code for "pushback value was too large".private static final int[]Translates a state to a row index in the transition tableprivate static final Stringprivate static final int[]The transition table of the DFAprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final intError code for "Unknown internal scanner error".private booleanWhether the scanner is currently at the beginning of a line.private booleanWhether the scanner is at the end of file.private char[]This buffer contains the current text to be matched and is the source of theyytext()string.private intCurrent text position in the buffer.private intMarks the last character in the buffer, that has been read from input.private booleanWhether the user-EOF-code has already been executed.private intprivate intCurrent lexical state.private intText position at the last accepting state.private ReaderInput device.private intMarks the beginning of theyytext()string in the buffer.private intCurrent state of the DFA. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionintResumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.final voidFills CharTermAttribute with the current token text.final voidsetBufferSize(int numChars) Sets the scanner buffer size in charsfinal booleanyyatEOF()Returns whether the scanner has reached the end of the reader it reads from.final voidyybegin(int newState) Enters a new lexical state.final intyychar()Character count processed so farfinal charyycharat(int position) Returns the character at the given position from the matched text.final voidyyclose()Closes the input reader.final intyylength()How many characters were matched.voidyypushback(int number) Pushes the specified amount of characters back into the input stream.final voidResets the scanner to read from a new input stream.private final voidResets the input position.final intyystate()Returns the current lexical state.final Stringyytext()Returns the text matched by the current regular expression.private static intzzCMap(int input) Translates raw input code points to DFA table rowprivate booleanzzRefill()Refills the input buffer.private static voidzzScanError(int errorCode) Reports an error that occurred while scanning.private static int[]private static intzzUnpackAction(String packed, int offset, int[] result) private static int[]private static intzzUnpackAttribute(String packed, int offset, int[] result) private static int[]private static intzzUnpackcmap_blocks(String packed, int offset, int[] result) private static int[]private static intzzUnpackcmap_top(String packed, int offset, int[] result) private static int[]private static intzzUnpackRowMap(String packed, int offset, int[] result) private static int[]private static intzzUnpackTrans(String packed, int offset, int[] result)
-
Field Details
-
YYEOF
public static final int YYEOFThis character denotes the end of file.- See Also:
-
ZZ_BUFFERSIZE
private int ZZ_BUFFERSIZEInitial size of the lookahead buffer. -
YYINITIAL
public static final int YYINITIALLexical States.- See Also:
-
AVOID_BAD_URL
public static final int AVOID_BAD_URL- See Also:
-
ZZ_LEXSTATE
private static final int[] ZZ_LEXSTATEZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer -
ZZ_CMAP_TOP
private static final int[] ZZ_CMAP_TOPTop-level table for translating characters to character classes -
ZZ_CMAP_TOP_PACKED_0
- See Also:
-
ZZ_CMAP_BLOCKS
private static final int[] ZZ_CMAP_BLOCKSSecond-level tables for translating characters to character classes -
ZZ_CMAP_BLOCKS_PACKED_0
- See Also:
-
ZZ_ACTION
private static final int[] ZZ_ACTIONTranslates DFA states to action switch labels. -
ZZ_ACTION_PACKED_0
- See Also:
-
ZZ_ROWMAP
private static final int[] ZZ_ROWMAPTranslates a state to a row index in the transition table -
ZZ_ROWMAP_PACKED_0
- See Also:
-
ZZ_TRANS
private static final int[] ZZ_TRANSThe transition table of the DFA -
ZZ_TRANS_PACKED_0
- See Also:
-
ZZ_TRANS_PACKED_1
- See Also:
-
ZZ_TRANS_PACKED_2
- See Also:
-
ZZ_TRANS_PACKED_3
- See Also:
-
ZZ_TRANS_PACKED_4
- See Also:
-
ZZ_TRANS_PACKED_5
- See Also:
-
ZZ_TRANS_PACKED_6
- See Also:
-
ZZ_TRANS_PACKED_7
- See Also:
-
ZZ_TRANS_PACKED_8
- See Also:
-
ZZ_TRANS_PACKED_9
- See Also:
-
ZZ_TRANS_PACKED_10
- See Also:
-
ZZ_UNKNOWN_ERROR
private static final int ZZ_UNKNOWN_ERRORError code for "Unknown internal scanner error".- See Also:
-
ZZ_NO_MATCH
private static final int ZZ_NO_MATCHError code for "could not match input".- See Also:
-
ZZ_PUSHBACK_2BIG
private static final int ZZ_PUSHBACK_2BIGError code for "pushback value was too large".- See Also:
-
ZZ_ERROR_MSG
-
ZZ_ATTRIBUTE
private static final int[] ZZ_ATTRIBUTEZZ_ATTRIBUTE[aState] contains the attributes of stateaState -
ZZ_ATTRIBUTE_PACKED_0
- See Also:
-
zzReader
Input device. -
zzState
private int zzStateCurrent state of the DFA. -
zzLexicalState
private int zzLexicalStateCurrent lexical state. -
zzBuffer
private char[] zzBufferThis buffer contains the current text to be matched and is the source of theyytext()string. -
zzMarkedPos
private int zzMarkedPosText position at the last accepting state. -
zzCurrentPos
private int zzCurrentPosCurrent text position in the buffer. -
zzStartRead
private int zzStartReadMarks the beginning of theyytext()string in the buffer. -
zzEndRead
private int zzEndReadMarks the last character in the buffer, that has been read from input. -
zzAtEOF
private boolean zzAtEOFWhether the scanner is at the end of file.- See Also:
-
zzFinalHighSurrogate
private int zzFinalHighSurrogate -
yyline
private int yylineNumber of newlines encountered up to the start of the matched text. -
yycolumn
private int yycolumnNumber of characters from the last newline up to the start of the matched text. -
yychar
private long yycharNumber of characters up to the start of the matched text. -
zzAtBOL
private boolean zzAtBOLWhether the scanner is currently at the beginning of a line. -
zzEOFDone
private boolean zzEOFDoneWhether the user-EOF-code has already been executed. -
WORD_TYPE
public static final int WORD_TYPEAlphanumeric sequences- See Also:
-
NUMERIC_TYPE
public static final int NUMERIC_TYPENumbers- See Also:
-
SOUTH_EAST_ASIAN_TYPE
public static final int SOUTH_EAST_ASIAN_TYPEChars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
- See Also:
-
IDEOGRAPHIC_TYPE
public static final int IDEOGRAPHIC_TYPEIdeographic token type- See Also:
-
HIRAGANA_TYPE
public static final int HIRAGANA_TYPEHiragana token type- See Also:
-
KATAKANA_TYPE
public static final int KATAKANA_TYPEKatakana token type- See Also:
-
HANGUL_TYPE
public static final int HANGUL_TYPEHangul token type- See Also:
-
EMAIL_TYPE
public static final int EMAIL_TYPEEmail token type- See Also:
-
URL_TYPE
public static final int URL_TYPEURL token type- See Also:
-
EMOJI_TYPE
public static final int EMOJI_TYPEEmoji token type- See Also:
-
-
Constructor Details
-
UAX29URLEmailTokenizerImpl
Creates a new scanner- Parameters:
in- the java.io.Reader to read input from.
-
-
Method Details
-
zzUnpackcmap_top
private static int[] zzUnpackcmap_top() -
zzUnpackcmap_top
-
zzUnpackcmap_blocks
private static int[] zzUnpackcmap_blocks() -
zzUnpackcmap_blocks
-
zzUnpackAction
private static int[] zzUnpackAction() -
zzUnpackAction
-
zzUnpackRowMap
private static int[] zzUnpackRowMap() -
zzUnpackRowMap
-
zzUnpackTrans
private static int[] zzUnpackTrans() -
zzUnpackTrans
-
zzUnpackAttribute
private static int[] zzUnpackAttribute() -
zzUnpackAttribute
-
yychar
public final int yychar()Character count processed so far -
getText
Fills CharTermAttribute with the current token text. -
setBufferSize
public final void setBufferSize(int numChars) Sets the scanner buffer size in chars -
zzCMap
private static int zzCMap(int input) Translates raw input code points to DFA table row -
zzRefill
Refills the input buffer.- Returns:
falseiff there was new input.- Throws:
IOException- if any I/O-Error occurs
-
yyclose
Closes the input reader.- Throws:
IOException- if the reader could not be closed.
-
yyreset
Resets the scanner to read from a new input stream.Does not close the old reader.
All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to
ZZ_INITIAL.Internal scan buffer is resized down to its initial length, if it has grown.
- Parameters:
reader- The new input stream.
-
yyResetPosition
private final void yyResetPosition()Resets the input position. -
yyatEOF
public final boolean yyatEOF()Returns whether the scanner has reached the end of the reader it reads from.- Returns:
- whether the scanner has reached EOF.
-
yystate
public final int yystate()Returns the current lexical state.- Returns:
- the current lexical state.
-
yybegin
public final void yybegin(int newState) Enters a new lexical state.- Parameters:
newState- the new lexical state
-
yytext
Returns the text matched by the current regular expression.- Returns:
- the matched text.
-
yycharat
public final char yycharat(int position) Returns the character at the given position from the matched text.It is equivalent to
yytext().charAt(pos), but faster.- Parameters:
position- the position of the character to fetch. A value from 0 toyylength()-1.- Returns:
- the character at
position.
-
yylength
public final int yylength()How many characters were matched.- Returns:
- the length of the matched text region.
-
zzScanError
private static void zzScanError(int errorCode) Reports an error that occurred while scanning.In a well-formed scanner (no or only correct usage of
yypushback(int)and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen".If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done in error fallback rules.
- Parameters:
errorCode- the code of the error message to display.
-
yypushback
public void yypushback(int number) Pushes the specified amount of characters back into the input stream.They will be read again by then next call of the scanning method.
- Parameters:
number- the number of characters to be read again. This number must not be greater thanyylength().
-
getNextToken
Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.- Returns:
- the next token.
- Throws:
IOException- if any I/O-Error occurs.
-