java.lang.Object

org.apache.lucene.search.similarities.Similarity

com.atlassian.confluence.internal.search.v2.lucene.BM25LSimilarity

public class BM25LSimilarity extends org.apache.lucene.search.similarities.Similarity

Extension of BM25 which shifts the term frequency normalization formula to boost scores of very long documents.

Moved from the confluence-search plugin into core

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity
org.apache.lucene.search.similarities.Similarity.SimScorer, org.apache.lucene.search.similarities.Similarity.SimWeight
Field Summary

Fields

Modifier and Type

Field

Description

protected boolean

discountOverlaps

True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
Constructor Summary

Constructors

Constructor

Description

BM25LSimilarity()

BM25 with these default values: k1 = 1.25, b = 0.4. d = 0.5.

BM25LSimilarity(float k1, float b, float d)

BM25 with the supplied parameter values.
Method Summary

Modifier and Type

Method

Description

protected float

avgFieldLength(org.apache.lucene.search.CollectionStatistics collectionStats)

final long

computeNorm(org.apache.lucene.index.FieldInvertState state)

final org.apache.lucene.search.similarities.Similarity.SimWeight

computeWeight(float queryBoost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats)

protected float

decodeNormValue(byte b)

The default implementation returns 1 / f² where f is SmallFloat.byte315ToFloat(byte).

protected byte

encodeNormValue(float boost, int fieldLength)

The default implementation encodes boost / sqrt(length) with SmallFloat.floatToByte315(float).

float

getB()

float

getDelta()

boolean

getDiscountOverlaps()

float

getK1()

protected float

idf(long docFreq, long numDocs)

Implemented as log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).

org.apache.lucene.search.Explanation

idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics termStats)

Computes a score factor for a simple term and returns an explanation for that score factor.

org.apache.lucene.search.Explanation

idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics[] termStats)

Computes a score factor for a phrase.

protected float

scorePayload(int doc, int start, int end, org.apache.lucene.util.BytesRef payload)

The default implementation returns 1

void

setDiscountOverlaps(boolean v)

Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm.

final org.apache.lucene.search.similarities.Similarity.SimScorer

simScorer(org.apache.lucene.search.similarities.Similarity.SimWeight stats, org.apache.lucene.index.AtomicReaderContext context)

protected float

sloppyFreq(int distance)

Implemented as 1 / (distance + 1).

String

toString()

Methods inherited from class org.apache.lucene.search.similarities.Similarity
coord, queryNorm

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Details
- discountOverlaps
  
  protected boolean discountOverlaps
  
  True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
Constructor Details
- BM25LSimilarity
  
  public BM25LSimilarity(float k1, float b, float d)
  
  BM25 with the supplied parameter values.
  
  Parameters:
  
  k1 - Controls non-linear term frequency normalization (saturation).
  
  b - Controls to what degree document length normalizes tf values.
  
  d - shift parameter.
- BM25LSimilarity
  
  public BM25LSimilarity()
  BM25 with these default values:
  
  k1 = 1.25,
  b = 0.4.
  
  d = 0.5.
Method Details
- idf
  
  protected float idf(long docFreq, long numDocs)
  
  Implemented as log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).
  
  Parameters:
  
  docFreq - docFreq
  
  numDocs - numDocs
  
  Returns:
  
  log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5)).
- sloppyFreq
  
  protected float sloppyFreq(int distance)
  
  Implemented as 1 / (distance + 1).
  
  Parameters:
  
  distance - distance
  
  Returns:
  
  1 / (distance + 1).
- scorePayload
  
  protected float scorePayload(int doc, int start, int end, org.apache.lucene.util.BytesRef payload)
  
  The default implementation returns 1
  
  Parameters:
  
  doc - doc
  
  start - start index
  
  end - end index
  
  payload - payload
  
  Returns:
  
  1
- avgFieldLength
  
  protected float avgFieldLength(org.apache.lucene.search.CollectionStatistics collectionStats)
  
  Parameters:
  
  collectionStats - collectionStats
  
  Returns:
  
  the average as sumTotalTermFreq / maxDoc, or returns 1 if the index does not store sumTotalTermFreq (Lucene 3.x indexes or any field that omits frequency information).
- encodeNormValue
  
  protected byte encodeNormValue(float boost, int fieldLength)
  
  The default implementation encodes boost / sqrt(length) with SmallFloat.floatToByte315(float). This is compatible with Lucene's default implementation. If you change this, then you should change decodeNormValue(byte) to match.
  
  Parameters:
  
  boost - boost
  
  fieldLength - fieldLength
  
  Returns:
  
  boost / sqrt(length)
- decodeNormValue
  
  protected float decodeNormValue(byte b)
  
  The default implementation returns 1 / f² where f is SmallFloat.byte315ToFloat(byte).
  
  Parameters:
  
  b - byte
  
  Returns:
  
  1 / f²
- setDiscountOverlaps
  
  public void setDiscountOverlaps(boolean v)
  
  Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.
  
  Parameters:
  
  v - discountOverlaps
- getDiscountOverlaps
  
  public boolean getDiscountOverlaps()
  Returns:
  
  true if overlap tokens are discounted from the document's length.
  
  See Also:
  
  setDiscountOverlaps(boolean)
- computeNorm
  
  public final long computeNorm(org.apache.lucene.index.FieldInvertState state)
  
  Specified by:
  
  computeNorm in class org.apache.lucene.search.similarities.Similarity
- idfExplain
  
  public org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics termStats)
  Computes a score factor for a simple term and returns an explanation for that score factor.
  
  The default implementation uses:
  
  idf(docFreq, searcher.maxDoc());
  
  Note that CollectionStatistics.maxDoc() is used instead of IndexReader#numDocs() because also TermStatistics.docFreq() is used, and when the latter is inaccurate, so is CollectionStatistics.maxDoc(), and in the same direction. In addition, CollectionStatistics.maxDoc() is more efficient to compute
  Parameters:
  
  collectionStats - collection-level statistics
  
  termStats - term-level statistics for the term
  
  Returns:
  
  an Explain object that includes both an idf score factor and an explanation for the term.
- idfExplain
  
  public org.apache.lucene.search.Explanation idfExplain(org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics[] termStats)
  
  Computes a score factor for a phrase.
  
  The default implementation sums the idf factor for each term in the phrase.
  
  Parameters:
  
  collectionStats - collection-level statistics
  
  termStats - term-level statistics for the terms in the phrase
  
  Returns:
  
  an Explain object that includes both an idf score factor for the phrase and an explanation for each term.
- computeWeight
  
  public final org.apache.lucene.search.similarities.Similarity.SimWeight computeWeight(float queryBoost, org.apache.lucene.search.CollectionStatistics collectionStats, org.apache.lucene.search.TermStatistics... termStats)
  
  Specified by:
  
  computeWeight in class org.apache.lucene.search.similarities.Similarity
- simScorer
  
  public final org.apache.lucene.search.similarities.Similarity.SimScorer simScorer(org.apache.lucene.search.similarities.Similarity.SimWeight stats, org.apache.lucene.index.AtomicReaderContext context) throws IOException
  
  Specified by:
  
  simScorer in class org.apache.lucene.search.similarities.Similarity
  
  Throws:
  
  IOException
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object
- getK1
  
  public float getK1()
  Returns:
  
  the k1 parameter
  
  See Also:
  
  BM25LSimilarity(float, float, float)
- getB
  
  public float getB()
  Returns:
  
  the b parameter
  
  See Also:
  
  BM25LSimilarity(float, float, float)
- getDelta
  
  public float getDelta()

Class BM25LSimilarity

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.search.similarities.Similarity

Methods inherited from class java.lang.Object

Field Details

discountOverlaps

Constructor Details

BM25LSimilarity

BM25LSimilarity

Method Details

idf

sloppyFreq

scorePayload

avgFieldLength

encodeNormValue

decodeNormValue

setDiscountOverlaps

getDiscountOverlaps

computeNorm

idfExplain

idfExplain

computeWeight

simScorer

toString

getK1

getB

getDelta