Class BM25LSimilarity
java.lang.Object
org.apache.lucene.search.similarities.Similarity
com.atlassian.confluence.internal.search.v2.lucene.BM25LSimilarity
public class BM25LSimilarity
extends org.apache.lucene.search.similarities.Similarity
Extension of BM25 which shifts the term frequency normalization formula
to boost scores of very long documents.
Moved from the confluence-search plugin into core
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity
org.apache.lucene.search.similarities.Similarity.SimScorer, org.apache.lucene.search.similarities.Similarity.SimWeight
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected boolean
True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length. -
Constructor Summary
ConstructorsConstructorDescriptionBM25 with these default values:k1 = 1.25
,b = 0.4
.d = 0.5
.BM25LSimilarity
(float k1, float b, float d) BM25 with the supplied parameter values. -
Method Summary
Modifier and TypeMethodDescriptionprotected float
avgFieldLength
(org.apache.lucene.search.CollectionStatistics collectionStats) final long
computeNorm
(org.apache.lucene.index.FieldInvertState state) final org.apache.lucene.search.similarities.Similarity.SimWeight
computeWeight
(float queryBoost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats) protected float
decodeNormValue
(byte b) The default implementation returns1 / f2
wheref
isSmallFloat.byte315ToFloat(byte)
.protected byte
encodeNormValue
(float boost, int fieldLength) The default implementation encodesboost / sqrt(length)
withSmallFloat.floatToByte315(float)
.float
getB()
float
getDelta()
boolean
float
getK1()
protected float
idf
(long docFreq, long numDocs) Implemented aslog(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5))
.org.apache.lucene.search.Explanation
idfExplain
(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics termStats) Computes a score factor for a simple term and returns an explanation for that score factor.org.apache.lucene.search.Explanation
idfExplain
(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics[] termStats) Computes a score factor for a phrase.protected float
scorePayload
(int doc, int start, int end, org.apache.lucene.util.BytesRef payload) The default implementation returns1
void
setDiscountOverlaps
(boolean v) Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm.final org.apache.lucene.search.similarities.Similarity.SimScorer
simScorer
(org.apache.lucene.search.similarities.Similarity.SimWeight stats, org.apache.lucene.index.AtomicReaderContext context) protected float
sloppyFreq
(int distance) Implemented as1 / (distance + 1)
.toString()
Methods inherited from class org.apache.lucene.search.similarities.Similarity
coord, queryNorm
-
Field Details
-
discountOverlaps
protected boolean discountOverlapsTrue if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
-
-
Constructor Details
-
BM25LSimilarity
public BM25LSimilarity(float k1, float b, float d) BM25 with the supplied parameter values.- Parameters:
k1
- Controls non-linear term frequency normalization (saturation).b
- Controls to what degree document length normalizes tf values.d
- shift parameter.
-
BM25LSimilarity
public BM25LSimilarity()BM25 with these default values:k1 = 1.25
,b = 0.4
.d = 0.5
.
-
-
Method Details
-
idf
protected float idf(long docFreq, long numDocs) Implemented aslog(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5))
.- Parameters:
docFreq
- docFreqnumDocs
- numDocs- Returns:
log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5))
.
-
sloppyFreq
protected float sloppyFreq(int distance) Implemented as1 / (distance + 1)
.- Parameters:
distance
- distance- Returns:
1 / (distance + 1)
.
-
scorePayload
protected float scorePayload(int doc, int start, int end, org.apache.lucene.util.BytesRef payload) The default implementation returns1
- Parameters:
doc
- docstart
- start indexend
- end indexpayload
- payload- Returns:
1
-
avgFieldLength
protected float avgFieldLength(org.apache.lucene.search.CollectionStatistics collectionStats) - Parameters:
collectionStats
- collectionStats- Returns:
- the average as
sumTotalTermFreq / maxDoc
, or returns1
if the index does not store sumTotalTermFreq (Lucene 3.x indexes or any field that omits frequency information).
-
encodeNormValue
protected byte encodeNormValue(float boost, int fieldLength) The default implementation encodesboost / sqrt(length)
withSmallFloat.floatToByte315(float)
. This is compatible with Lucene's default implementation. If you change this, then you should changedecodeNormValue(byte)
to match.- Parameters:
boost
- boostfieldLength
- fieldLength- Returns:
boost / sqrt(length)
-
decodeNormValue
protected float decodeNormValue(byte b) The default implementation returns1 / f2
wheref
isSmallFloat.byte315ToFloat(byte)
.- Parameters:
b
- byte- Returns:
1 / f2
-
setDiscountOverlaps
public void setDiscountOverlaps(boolean v) Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.- Parameters:
v
- discountOverlaps
-
getDiscountOverlaps
public boolean getDiscountOverlaps()- Returns:
- true if overlap tokens are discounted from the document's length.
- See Also:
-
computeNorm
public final long computeNorm(org.apache.lucene.index.FieldInvertState state) - Specified by:
computeNorm
in classorg.apache.lucene.search.similarities.Similarity
-
idfExplain
public org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics termStats) Computes a score factor for a simple term and returns an explanation for that score factor.The default implementation uses:
idf(docFreq, searcher.maxDoc());
Note that
CollectionStatistics.maxDoc()
is used instead ofIndexReader#numDocs()
because alsoTermStatistics.docFreq()
is used, and when the latter is inaccurate, so isCollectionStatistics.maxDoc()
, and in the same direction. In addition,CollectionStatistics.maxDoc()
is more efficient to compute- Parameters:
collectionStats
- collection-level statisticstermStats
- term-level statistics for the term- Returns:
- an Explain object that includes both an idf score factor and an explanation for the term.
-
idfExplain
public org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics[] termStats) Computes a score factor for a phrase.The default implementation sums the idf factor for each term in the phrase.
- Parameters:
collectionStats
- collection-level statisticstermStats
- term-level statistics for the terms in the phrase- Returns:
- an Explain object that includes both an idf score factor for the phrase and an explanation for each term.
-
computeWeight
public final org.apache.lucene.search.similarities.Similarity.SimWeight computeWeight(float queryBoost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats) - Specified by:
computeWeight
in classorg.apache.lucene.search.similarities.Similarity
-
simScorer
public final org.apache.lucene.search.similarities.Similarity.SimScorer simScorer(org.apache.lucene.search.similarities.Similarity.SimWeight stats, org.apache.lucene.index.AtomicReaderContext context) throws IOException - Specified by:
simScorer
in classorg.apache.lucene.search.similarities.Similarity
- Throws:
IOException
-
toString
-
getK1
public float getK1()- Returns:
- the
k1
parameter - See Also:
-
getB
public float getB()- Returns:
- the
b
parameter - See Also:
-
getDelta
public float getDelta()
-