java.lang.Object
org.apache.lucene.search.similarities.Similarity
com.atlassian.confluence.internal.search.v2.lucene.BM25LSimilarity

public class BM25LSimilarity extends org.apache.lucene.search.similarities.Similarity
Extension of BM25 which shifts the term frequency normalization formula to boost scores of very long documents.

Moved from the confluence-search plugin into core

  • Nested Class Summary

    Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity

    org.apache.lucene.search.similarities.Similarity.SimScorer, org.apache.lucene.search.similarities.Similarity.SimWeight
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected boolean
    True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
  • Constructor Summary

    Constructors
    Constructor
    Description
    BM25 with these default values: k1 = 1.25, b = 0.4. d = 0.5.
    BM25LSimilarity(float k1, float b, float d)
    BM25 with the supplied parameter values.
  • Method Summary

    Modifier and Type
    Method
    Description
    protected float
    avgFieldLength(org.apache.lucene.search.CollectionStatistics collectionStats)
     
    final long
    computeNorm(org.apache.lucene.index.FieldInvertState state)
     
    final org.apache.lucene.search.similarities.Similarity.SimWeight
    computeWeight(float queryBoost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats)
     
    protected float
    The default implementation returns 1 / f2 where f is SmallFloat.byte315ToFloat(byte).
    protected byte
    encodeNormValue(float boost, int fieldLength)
    The default implementation encodes boost / sqrt(length) with SmallFloat.floatToByte315(float).
    float
     
    float
     
    boolean
     
    float
     
    protected float
    idf(long docFreq, long numDocs)
    Implemented as log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).
    org.apache.lucene.search.Explanation
    idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics termStats)
    Computes a score factor for a simple term and returns an explanation for that score factor.
    org.apache.lucene.search.Explanation
    idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics[] termStats)
    Computes a score factor for a phrase.
    protected float
    scorePayload(int doc, int start, int end, org.apache.lucene.util.BytesRef payload)
    The default implementation returns 1
    void
    Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm.
    final org.apache.lucene.search.similarities.Similarity.SimScorer
    simScorer(org.apache.lucene.search.similarities.Similarity.SimWeight stats, org.apache.lucene.index.AtomicReaderContext context)
     
    protected float
    sloppyFreq(int distance)
    Implemented as 1 / (distance + 1).
     

    Methods inherited from class org.apache.lucene.search.similarities.Similarity

    coord, queryNorm

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

    • discountOverlaps

      protected boolean discountOverlaps
      True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
  • Constructor Details

    • BM25LSimilarity

      public BM25LSimilarity(float k1, float b, float d)
      BM25 with the supplied parameter values.
      Parameters:
      k1 - Controls non-linear term frequency normalization (saturation).
      b - Controls to what degree document length normalizes tf values.
      d - shift parameter.
    • BM25LSimilarity

      public BM25LSimilarity()
      BM25 with these default values:
      • k1 = 1.25,
      • b = 0.4.
      • d = 0.5.
  • Method Details

    • idf

      protected float idf(long docFreq, long numDocs)
      Implemented as log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).
      Parameters:
      docFreq - docFreq
      numDocs - numDocs
      Returns:
      log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).
    • sloppyFreq

      protected float sloppyFreq(int distance)
      Implemented as 1 / (distance + 1).
      Parameters:
      distance - distance
      Returns:
      1 / (distance + 1).
    • scorePayload

      protected float scorePayload(int doc, int start, int end, org.apache.lucene.util.BytesRef payload)
      The default implementation returns 1
      Parameters:
      doc - doc
      start - start index
      end - end index
      payload - payload
      Returns:
      1
    • avgFieldLength

      protected float avgFieldLength(org.apache.lucene.search.CollectionStatistics collectionStats)
      Parameters:
      collectionStats - collectionStats
      Returns:
      the average as sumTotalTermFreq / maxDoc, or returns 1 if the index does not store sumTotalTermFreq (Lucene 3.x indexes or any field that omits frequency information).
    • encodeNormValue

      protected byte encodeNormValue(float boost, int fieldLength)
      The default implementation encodes boost / sqrt(length) with SmallFloat.floatToByte315(float). This is compatible with Lucene's default implementation. If you change this, then you should change decodeNormValue(byte) to match.
      Parameters:
      boost - boost
      fieldLength - fieldLength
      Returns:
      boost / sqrt(length)
    • decodeNormValue

      protected float decodeNormValue(byte b)
      The default implementation returns 1 / f2 where f is SmallFloat.byte315ToFloat(byte).
      Parameters:
      b - byte
      Returns:
      1 / f2
    • setDiscountOverlaps

      public void setDiscountOverlaps(boolean v)
      Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.
      Parameters:
      v - discountOverlaps
    • getDiscountOverlaps

      public boolean getDiscountOverlaps()
      Returns:
      true if overlap tokens are discounted from the document's length.
      See Also:
    • computeNorm

      public final long computeNorm(org.apache.lucene.index.FieldInvertState state)
      Specified by:
      computeNorm in class org.apache.lucene.search.similarities.Similarity
    • idfExplain

      public org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics termStats)
      Computes a score factor for a simple term and returns an explanation for that score factor.

      The default implementation uses:

       idf(docFreq, searcher.maxDoc());
       

      Note that CollectionStatistics.maxDoc() is used instead of IndexReader#numDocs() because also TermStatistics.docFreq() is used, and when the latter is inaccurate, so is CollectionStatistics.maxDoc(), and in the same direction. In addition, CollectionStatistics.maxDoc() is more efficient to compute

      Parameters:
      collectionStats - collection-level statistics
      termStats - term-level statistics for the term
      Returns:
      an Explain object that includes both an idf score factor and an explanation for the term.
    • idfExplain

      public org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics[] termStats)
      Computes a score factor for a phrase.

      The default implementation sums the idf factor for each term in the phrase.

      Parameters:
      collectionStats - collection-level statistics
      termStats - term-level statistics for the terms in the phrase
      Returns:
      an Explain object that includes both an idf score factor for the phrase and an explanation for each term.
    • computeWeight

      public final org.apache.lucene.search.similarities.Similarity.SimWeight computeWeight(float queryBoost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats)
      Specified by:
      computeWeight in class org.apache.lucene.search.similarities.Similarity
    • simScorer

      public final org.apache.lucene.search.similarities.Similarity.SimScorer simScorer(org.apache.lucene.search.similarities.Similarity.SimWeight stats, org.apache.lucene.index.AtomicReaderContext context) throws IOException
      Specified by:
      simScorer in class org.apache.lucene.search.similarities.Similarity
      Throws:
      IOException
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • getK1

      public float getK1()
      Returns:
      the k1 parameter
      See Also:
    • getB

      public float getB()
      Returns:
      the b parameter
      See Also:
    • getDelta

      public float getDelta()