Feature: Add naive hybrid search (FTS + vector search) #5068

prrao87 · 2025-03-18T15:33:18Z

API

Python

Description

With the upcoming vector index, we should consider adding support for hybrid search, where the user may want to combine FTS (keyword-based) and vector search results to improve the relevance of results.

A naive, simple implementation is shown in this blog post and would be a good starting point.

Here's how it would look in Python:

def hybrid_search(query, top_k=5, vector_weight=0.7):
    """Perform hybrid search using both vector similarity and BM25 keyword matching."""
    # Vector search
    query_embedding = model.encode(query, normalize_embeddings=True)
    vector_results = chunk_table.search(query_embedding).metric('cosine').limit(top_k*2).to_pandas()
    vector_results['vector_score'] = 1 - vector_results['_distance']
    
    # Keyword search with BM25s
    # Create corpus from our chunks
    corpus = chunks_df['content'].tolist()
    
    # Tokenize the corpus and index it
    corpus_tokens = bm25s.tokenize(corpus)
    retriever = bm25s.BM25(corpus=corpus)
    retriever.index(corpus_tokens)
    
    # Tokenize the query and retrieve results
    query_tokens = bm25s.tokenize(query)
    docs, scores = retriever.retrieve(query_tokens, k=len(corpus))  # Get scores for all documents
    
    # Map BM25 scores to our dataframe indices
    bm25_scores = {i: scores[0, idx] for idx, i in enumerate(docs[0])}
    vector_results['bm25_score'] = vector_results.index.map(
        lambda x: bm25_scores.get(x, 0) if x in bm25_scores else 0)
    
    # Normalize BM25 scores
    if vector_results['bm25_score'].max() > 0:
        vector_results['bm25_score'] = vector_results['bm25_score'] / vector_results['bm25_score'].max()
    
    # Combine scores with weighting
    vector_results['combined_score'] = (
        vector_weight * vector_results['vector_score'] + 
        (1 - vector_weight) * vector_results['bm25_score'])
    
    return vector_results.sort_values('combined_score', ascending=False).head(top_k)

This maps the retrieved results from BM25 to the vector search indices, normalizes the BM25 scores to align with the vector search results, and then applies the user-specified vector_weight, which instructs the algorithm how much to weight the vector search results over the FTS results. 0.7 is a sensible default (because too high a weight given to FTS can result in fewer matches overall - vector search is more forgiving).

I think this would be a decent base implementation to do at the C++ level, that can then be exposed to the higher level clients with the specified arguments without too much work. Alternatively, we could only implement this in Python for an initial release, so that users at least have an option to get relevant keyword-based results alongside vector search results from their graphs.

The text was updated successfully, but these errors were encountered:

prrao87 · 2025-03-18T15:34:59Z

@acquamarin what do you think?

acquamarin · 2025-03-19T02:12:03Z

I think we can implement this naive solution for the upcoming release.

if vector_results['bm25_score'].max() > 0:
        vector_results['bm25_score'] = vector_results['bm25_score'] / vector_results['bm25_score'].max()

I am not sure why the bm25_score can be smaller than 0, and why they normalize the bm25_score. We have do a bit research on the algorithm there.

prrao87 · 2025-03-19T18:06:47Z

I am not sure why the bm25_score can be smaller than 0, and why they normalize the bm25_score.

I'm also not sure if BM25 is ever less than zero -- the if condition might just be overcautious. But the BM25 scores need to be normalized because they are typically large numbers > 1, but the vector search (typically cosine similarity) are between 0 and 1, so in order to put them on the same scale, the BM25 scores must also be brought to between 0 and 1.

One more thing to note is that we return distance metrics in our vector search so we'd have to transform that to a "score" by subtracting from 1 and then use that to compare against BM25.

All this assumes that it's cosine similarity only (if it's dot product metric the meaning of the scores change, so this function would only support the cosine similarity metric and not custom distance metrics).

prrao87 added the feature New features or missing components of existing features label Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Add naive hybrid search (FTS + vector search) #5068

Feature: Add naive hybrid search (FTS + vector search) #5068

prrao87 commented Mar 18, 2025

prrao87 commented Mar 18, 2025

acquamarin commented Mar 19, 2025

prrao87 commented Mar 19, 2025 •

edited

Loading

Feature: Add naive hybrid search (FTS + vector search) #5068

Feature: Add naive hybrid search (FTS + vector search) #5068

Comments

prrao87 commented Mar 18, 2025

API

Description

prrao87 commented Mar 18, 2025

acquamarin commented Mar 19, 2025

prrao87 commented Mar 19, 2025 • edited Loading

prrao87 commented Mar 19, 2025 •

edited

Loading