Class BM25LSimilarity
java.lang.Object
org.apache.lucene.search.similarities.Similarity
com.atlassian.confluence.internal.search.v2.lucene.BM25LSimilarity
public class BM25LSimilarity
extends org.apache.lucene.search.similarities.Similarity
Extension of BM25 which shifts the term frequency normalization formula
to boost scores of very long documents.
Moved from the confluence-search plugin into core
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity
org.apache.lucene.search.similarities.Similarity.SimScorer, org.apache.lucene.search.similarities.Similarity.SimWeight -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected booleanTrue if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length. -
Constructor Summary
ConstructorsConstructorDescriptionBM25 with these default values:k1 = 1.25,b = 0.4.d = 0.5.BM25LSimilarity(float k1, float b, float d) BM25 with the supplied parameter values. -
Method Summary
Modifier and TypeMethodDescriptionprotected floatavgFieldLength(org.apache.lucene.search.CollectionStatistics collectionStats) final longcomputeNorm(org.apache.lucene.index.FieldInvertState state) final org.apache.lucene.search.similarities.Similarity.SimWeightcomputeWeight(float queryBoost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats) protected floatdecodeNormValue(byte b) The default implementation returns1 / f2wherefisSmallFloat.byte315ToFloat(byte).protected byteencodeNormValue(float boost, int fieldLength) The default implementation encodesboost / sqrt(length)withSmallFloat.floatToByte315(float).floatgetB()floatgetDelta()booleanfloatgetK1()protected floatidf(long docFreq, long numDocs) Implemented aslog(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).org.apache.lucene.search.ExplanationidfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics termStats) Computes a score factor for a simple term and returns an explanation for that score factor.org.apache.lucene.search.ExplanationidfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics[] termStats) Computes a score factor for a phrase.protected floatscorePayload(int doc, int start, int end, org.apache.lucene.util.BytesRef payload) The default implementation returns1voidsetDiscountOverlaps(boolean v) Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm.final org.apache.lucene.search.similarities.Similarity.SimScorersimScorer(org.apache.lucene.search.similarities.Similarity.SimWeight stats, org.apache.lucene.index.AtomicReaderContext context) protected floatsloppyFreq(int distance) Implemented as1 / (distance + 1).toString()Methods inherited from class org.apache.lucene.search.similarities.Similarity
coord, queryNorm
-
Field Details
-
discountOverlaps
protected boolean discountOverlapsTrue if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
-
-
Constructor Details
-
BM25LSimilarity
public BM25LSimilarity(float k1, float b, float d) BM25 with the supplied parameter values.- Parameters:
k1- Controls non-linear term frequency normalization (saturation).b- Controls to what degree document length normalizes tf values.d- shift parameter.
-
BM25LSimilarity
public BM25LSimilarity()BM25 with these default values:k1 = 1.25,b = 0.4.d = 0.5.
-
-
Method Details
-
idf
protected float idf(long docFreq, long numDocs) Implemented aslog(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).- Parameters:
docFreq- docFreqnumDocs- numDocs- Returns:
log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).
-
sloppyFreq
protected float sloppyFreq(int distance) Implemented as1 / (distance + 1).- Parameters:
distance- distance- Returns:
1 / (distance + 1).
-
scorePayload
protected float scorePayload(int doc, int start, int end, org.apache.lucene.util.BytesRef payload) The default implementation returns1- Parameters:
doc- docstart- start indexend- end indexpayload- payload- Returns:
1
-
avgFieldLength
protected float avgFieldLength(org.apache.lucene.search.CollectionStatistics collectionStats) - Parameters:
collectionStats- collectionStats- Returns:
- the average as
sumTotalTermFreq / maxDoc, or returns1if the index does not store sumTotalTermFreq (Lucene 3.x indexes or any field that omits frequency information).
-
encodeNormValue
protected byte encodeNormValue(float boost, int fieldLength) The default implementation encodesboost / sqrt(length)withSmallFloat.floatToByte315(float). This is compatible with Lucene's default implementation. If you change this, then you should changedecodeNormValue(byte)to match.- Parameters:
boost- boostfieldLength- fieldLength- Returns:
boost / sqrt(length)
-
decodeNormValue
protected float decodeNormValue(byte b) The default implementation returns1 / f2wherefisSmallFloat.byte315ToFloat(byte).- Parameters:
b- byte- Returns:
1 / f2
-
setDiscountOverlaps
public void setDiscountOverlaps(boolean v) Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.- Parameters:
v- discountOverlaps
-
getDiscountOverlaps
public boolean getDiscountOverlaps()- Returns:
- true if overlap tokens are discounted from the document's length.
- See Also:
-
computeNorm
public final long computeNorm(org.apache.lucene.index.FieldInvertState state) - Specified by:
computeNormin classorg.apache.lucene.search.similarities.Similarity
-
idfExplain
public org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics termStats) Computes a score factor for a simple term and returns an explanation for that score factor.The default implementation uses:
idf(docFreq, searcher.maxDoc());
Note that
CollectionStatistics.maxDoc()is used instead ofIndexReader#numDocs()because alsoTermStatistics.docFreq()is used, and when the latter is inaccurate, so isCollectionStatistics.maxDoc(), and in the same direction. In addition,CollectionStatistics.maxDoc()is more efficient to compute- Parameters:
collectionStats- collection-level statisticstermStats- term-level statistics for the term- Returns:
- an Explain object that includes both an idf score factor and an explanation for the term.
-
idfExplain
public org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics[] termStats) Computes a score factor for a phrase.The default implementation sums the idf factor for each term in the phrase.
- Parameters:
collectionStats- collection-level statisticstermStats- term-level statistics for the terms in the phrase- Returns:
- an Explain object that includes both an idf score factor for the phrase and an explanation for each term.
-
computeWeight
public final org.apache.lucene.search.similarities.Similarity.SimWeight computeWeight(float queryBoost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats) - Specified by:
computeWeightin classorg.apache.lucene.search.similarities.Similarity
-
simScorer
public final org.apache.lucene.search.similarities.Similarity.SimScorer simScorer(org.apache.lucene.search.similarities.Similarity.SimWeight stats, org.apache.lucene.index.AtomicReaderContext context) throws IOException - Specified by:
simScorerin classorg.apache.lucene.search.similarities.Similarity- Throws:
IOException
-
toString
-
getK1
public float getK1()- Returns:
- the
k1parameter - See Also:
-
getB
public float getB()- Returns:
- the
bparameter - See Also:
-
getDelta
public float getDelta()
-