The sloppy frequency contribution of a match depends on the distance:
- highest freq for distance=0 (exact match).
- freq gets lower as distance gets higher.
Example: for query "a b"~2, a document "x a b a y" can be matched twice: once for "a b" (distance=0), and once for "b a" (distance=2).
Possibly not all valid combinations are encountered, because for efficiency we always propagate the least PhrasePosition. This allows to base on PriorityQueue and move forward faster. As result, for example, document "a b c b a" would score differently for queries "a b c"~4 and "c b a"~4, although they really are equivalent. Similarly, for doc "a b c b a f g", query "c b"~2 would get same score as "g f"~2, although "c b"~2 could be matched twice. We may want to fix this in the future (currently not, for performance reasons).
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final DocIdSetIteratorprivate final booleanprivate booleanprivate intprivate booleanprivate booleanprivate final ImpactsDISIprivate intprivate intprivate intprivate intprivate intprivate final intprivate final PhrasePositions[]private booleanprivate final PhraseQueueprivate PhrasePositions[][]private PhrasePositions[]private final int -
Constructor Summary
ConstructorsConstructorDescriptionSloppyPhraseMatcher(PhraseQuery.PostingsAndFreq[] postings, int slop, ScoreMode scoreMode, Similarity.SimScorer scorer, float matchCost, boolean captureLeadMatch) -
Method Summary
Modifier and TypeMethodDescriptionprivate booleanadvance a PhrasePosition and update 'end', return false if exhaustedprivate booleanAt initialization (each doc), each repetition group is sorted by (query) offset.private booleanpp was just advanced.(package private) DocIdSetIteratorApproximation that only matches documents that have all terms.private voidprivate intindex of a pp2 colliding with pp, or -1 if noneintThe end offset of the current matchintThe end position of the current matchprivate voidFill the queue (all pps are already placedprivate ArrayList<ArrayList<PhrasePositions>> gatherRptGroups(LinkedHashMap<Term, Integer> rptTerms) Detect repetition groups.(package private) ImpactsDISIApproximation that is aware of impacts.private booleanwith repeats: not so simple.private booleaninitialize with checking for repeats.private booleanInitialize PhrasePositions in place.private voidno repeats: simplest case, and most common.private PhrasePositionslesser(PhrasePositions pp, PhrasePositions pp2) compare two pps, but only by position and offset(package private) floatmaxFreq()An upper bound on the number of possible matches on this documentbooleanFind the next match on the current document, returningfalseif there are none.private voidmove all PPs to their first positionprivate ArrayList<FixedBitSet> ppTermsBitSets(PhrasePositions[] rpp, HashMap<Term, Integer> tord) bit-sets - for each repeating pp, for each of its repeating terms, the term ordinal values is setprivate PhrasePositions[]repeatingPPs(HashMap<Term, Integer> rptTerms) find repeating pps, and for each, if has multi-terms, update this.hasMultiTermRptsprivate LinkedHashMap<Term, Integer> find repeating terms and assign them ordinal valuesvoidreset()Called afterPhraseMatcher.approximation()has been advanced(package private) floatThe slop-adjusted weight of the current matchprivate voidsort each repetition group by (query) offset.intThe start offset of the current matchintThe start position of the current matchtermGroups(LinkedHashMap<Term, Integer> tord, ArrayList<FixedBitSet> bb) map each term to the single group that contains itprivate final inttpPos(PhrasePositions pp) Actual position in doc of a PhrasePosition, relies on that position = tpPos - offset)private voidunion (term group) bit-sets until they are disjoint (O(n^^2)), and each group have different termsMethods inherited from class org.apache.lucene.search.PhraseMatcher
getMatchCost
-
Field Details
-
phrasePositions
-
slop
private final int slop -
numPostings
private final int numPostings -
pq
-
captureLeadMatch
private final boolean captureLeadMatch -
approximation
-
impactsApproximation
-
end
private int end -
leadPosition
private int leadPosition -
leadOffset
private int leadOffset -
leadEndOffset
private int leadEndOffset -
leadOrd
private int leadOrd -
hasRpts
private boolean hasRpts -
checkedRpts
private boolean checkedRpts -
hasMultiTermRpts
private boolean hasMultiTermRpts -
rptGroups
-
rptStack
-
positioned
private boolean positioned -
matchLength
private int matchLength
-
-
Constructor Details
-
SloppyPhraseMatcher
public SloppyPhraseMatcher(PhraseQuery.PostingsAndFreq[] postings, int slop, ScoreMode scoreMode, Similarity.SimScorer scorer, float matchCost, boolean captureLeadMatch)
-
-
Method Details
-
approximation
DocIdSetIterator approximation()Description copied from class:PhraseMatcherApproximation that only matches documents that have all terms.- Specified by:
approximationin classPhraseMatcher
-
impactsApproximation
ImpactsDISI impactsApproximation()Description copied from class:PhraseMatcherApproximation that is aware of impacts.- Specified by:
impactsApproximationin classPhraseMatcher
-
maxFreq
Description copied from class:PhraseMatcherAn upper bound on the number of possible matches on this document- Specified by:
maxFreqin classPhraseMatcher- Throws:
IOException
-
reset
Description copied from class:PhraseMatcherCalled afterPhraseMatcher.approximation()has been advanced- Specified by:
resetin classPhraseMatcher- Throws:
IOException
-
sloppyWeight
float sloppyWeight()Description copied from class:PhraseMatcherThe slop-adjusted weight of the current matchThe sum of the slop-adjusted weights is used as the freq for scoring
- Specified by:
sloppyWeightin classPhraseMatcher
-
nextMatch
Description copied from class:PhraseMatcherFind the next match on the current document, returningfalseif there are none.- Specified by:
nextMatchin classPhraseMatcher- Throws:
IOException
-
captureLead
- Throws:
IOException
-
startPosition
public int startPosition()Description copied from class:PhraseMatcherThe start position of the current match- Specified by:
startPositionin classPhraseMatcher
-
endPosition
public int endPosition()Description copied from class:PhraseMatcherThe end position of the current match- Specified by:
endPositionin classPhraseMatcher
-
startOffset
Description copied from class:PhraseMatcherThe start offset of the current match- Specified by:
startOffsetin classPhraseMatcher- Throws:
IOException
-
endOffset
Description copied from class:PhraseMatcherThe end offset of the current match- Specified by:
endOffsetin classPhraseMatcher- Throws:
IOException
-
advancePP
advance a PhrasePosition and update 'end', return false if exhausted- Throws:
IOException
-
advanceRpts
pp was just advanced. If that caused a repeater collision, resolve by advancing the lesser of the two colliding pps. Note that there can only be one collision, as by the initialization there were no collisions before pp was advanced.- Throws:
IOException
-
lesser
compare two pps, but only by position and offset -
collide
index of a pp2 colliding with pp, or -1 if none -
initPhrasePositions
Initialize PhrasePositions in place. A one time initialization for this scorer (on first doc matching all terms):- Check if there are repetitions
- If there are, find groups of repetitions.
- no repetitions: "ho my"~2
- repetitions: "ho my my"~2
- repetitions: "my ho my"~2
- Returns:
- false if PPs are exhausted (and so current doc will not be a match)
- Throws:
IOException
-
initSimple
no repeats: simplest case, and most common. It is important to keep this piece of the code simple and efficient- Throws:
IOException
-
initComplex
with repeats: not so simple.- Throws:
IOException
-
placeFirstPositions
move all PPs to their first position- Throws:
IOException
-
fillQueue
private void fillQueue()Fill the queue (all pps are already placed -
advanceRepeatGroups
At initialization (each doc), each repetition group is sorted by (query) offset. This provides the start condition: no collisions.Case 1: no multi-term repeats
It is sufficient to advance each pp in the group by one less than its group index. So lesser pp is not advanced, 2nd one advance once, 3rd one advanced twice, etc.Case 2: multi-term repeats
- Returns:
- false if PPs are exhausted.
- Throws:
IOException
-
initFirstTime
initialize with checking for repeats. Heavy work, but done only for the first candidate doc.If there are repetitions, check if multi-term postings (MTP) are involved.
Without MTP, once PPs are placed in the first candidate doc, repeats (and groups) are visible.
With MTP, a more complex check is needed, up-front, as there may be "hidden collisions".
For example P1 has {A,B}, P1 has {B,C}, and the first doc is: "A C B". At start, P1 would point to "A", p2 to "C", and it will not be identified that P1 and P2 are repetitions of each other.The more complex initialization has two parts:
(1) identification of repetition groups.
(2) advancing repeat groups at the start of the doc.
For (1), a possible solution is to just create a single repetition group, made of all repeating pps. But this would slow down the check for collisions, as all pps would need to be checked. Instead, we compute "connected regions" on the bipartite graph of postings and terms.- Throws:
IOException
-
sortRptGroups
sort each repetition group by (query) offset. Done only once (at first doc) and allows to initialize faster for each doc. -
gatherRptGroups
private ArrayList<ArrayList<PhrasePositions>> gatherRptGroups(LinkedHashMap<Term, Integer> rptTerms) throws IOExceptionDetect repetition groups. Done once - for first doc- Throws:
IOException
-
tpPos
Actual position in doc of a PhrasePosition, relies on that position = tpPos - offset) -
repeatingTerms
find repeating terms and assign them ordinal values -
repeatingPPs
find repeating pps, and for each, if has multi-terms, update this.hasMultiTermRpts -
ppTermsBitSets
bit-sets - for each repeating pp, for each of its repeating terms, the term ordinal values is set -
unionTermGroups
union (term group) bit-sets until they are disjoint (O(n^^2)), and each group have different terms -
termGroups
private HashMap<Term,Integer> termGroups(LinkedHashMap<Term, Integer> tord, ArrayList<FixedBitSet> bb) throws IOExceptionmap each term to the single group that contains it- Throws:
IOException
-