Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add naive hybrid search (FTS + vector search) #5068

Open
prrao87 opened this issue Mar 18, 2025 · 3 comments
Open

Feature: Add naive hybrid search (FTS + vector search) #5068

prrao87 opened this issue Mar 18, 2025 · 3 comments
Labels
feature New features or missing components of existing features

Comments

@prrao87
Copy link
Member

prrao87 commented Mar 18, 2025

API

Python

Description

With the upcoming vector index, we should consider adding support for hybrid search, where the user may want to combine FTS (keyword-based) and vector search results to improve the relevance of results.

A naive, simple implementation is shown in this blog post and would be a good starting point.

Here's how it would look in Python:

def hybrid_search(query, top_k=5, vector_weight=0.7):
    """Perform hybrid search using both vector similarity and BM25 keyword matching."""
    # Vector search
    query_embedding = model.encode(query, normalize_embeddings=True)
    vector_results = chunk_table.search(query_embedding).metric('cosine').limit(top_k*2).to_pandas()
    vector_results['vector_score'] = 1 - vector_results['_distance']
    
    # Keyword search with BM25s
    # Create corpus from our chunks
    corpus = chunks_df['content'].tolist()
    
    # Tokenize the corpus and index it
    corpus_tokens = bm25s.tokenize(corpus)
    retriever = bm25s.BM25(corpus=corpus)
    retriever.index(corpus_tokens)
    
    # Tokenize the query and retrieve results
    query_tokens = bm25s.tokenize(query)
    docs, scores = retriever.retrieve(query_tokens, k=len(corpus))  # Get scores for all documents
    
    # Map BM25 scores to our dataframe indices
    bm25_scores = {i: scores[0, idx] for idx, i in enumerate(docs[0])}
    vector_results['bm25_score'] = vector_results.index.map(
        lambda x: bm25_scores.get(x, 0) if x in bm25_scores else 0)
    
    # Normalize BM25 scores
    if vector_results['bm25_score'].max() > 0:
        vector_results['bm25_score'] = vector_results['bm25_score'] / vector_results['bm25_score'].max()
    
    # Combine scores with weighting
    vector_results['combined_score'] = (
        vector_weight * vector_results['vector_score'] + 
        (1 - vector_weight) * vector_results['bm25_score'])
    
    return vector_results.sort_values('combined_score', ascending=False).head(top_k)

This maps the retrieved results from BM25 to the vector search indices, normalizes the BM25 scores to align with the vector search results, and then applies the user-specified vector_weight, which instructs the algorithm how much to weight the vector search results over the FTS results. 0.7 is a sensible default (because too high a weight given to FTS can result in fewer matches overall - vector search is more forgiving).

I think this would be a decent base implementation to do at the C++ level, that can then be exposed to the higher level clients with the specified arguments without too much work. Alternatively, we could only implement this in Python for an initial release, so that users at least have an option to get relevant keyword-based results alongside vector search results from their graphs.

@prrao87 prrao87 added the feature New features or missing components of existing features label Mar 18, 2025
@prrao87
Copy link
Member Author

prrao87 commented Mar 18, 2025

@acquamarin what do you think?

@acquamarin
Copy link
Collaborator

I think we can implement this naive solution for the upcoming release.

if vector_results['bm25_score'].max() > 0:
        vector_results['bm25_score'] = vector_results['bm25_score'] / vector_results['bm25_score'].max()

I am not sure why the bm25_score can be smaller than 0, and why they normalize the bm25_score. We have do a bit research on the algorithm there.

@prrao87
Copy link
Member Author

prrao87 commented Mar 19, 2025

I am not sure why the bm25_score can be smaller than 0, and why they normalize the bm25_score.

I'm also not sure if BM25 is ever less than zero -- the if condition might just be overcautious. But the BM25 scores need to be normalized because they are typically large numbers > 1, but the vector search (typically cosine similarity) are between 0 and 1, so in order to put them on the same scale, the BM25 scores must also be brought to between 0 and 1.

One more thing to note is that we return distance metrics in our vector search so we'd have to transform that to a "score" by subtracting from 1 and then use that to compare against BM25.

All this assumes that it's cosine similarity only (if it's dot product metric the meaning of the scores change, so this function would only support the cosine similarity metric and not custom distance metrics).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New features or missing components of existing features
Projects
None yet
Development

No branches or pull requests

2 participants