You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the upcoming vector index, we should consider adding support for hybrid search, where the user may want to combine FTS (keyword-based) and vector search results to improve the relevance of results.
A naive, simple implementation is shown in this blog post and would be a good starting point.
Here's how it would look in Python:
defhybrid_search(query, top_k=5, vector_weight=0.7):
"""Perform hybrid search using both vector similarity and BM25 keyword matching."""# Vector searchquery_embedding=model.encode(query, normalize_embeddings=True)
vector_results=chunk_table.search(query_embedding).metric('cosine').limit(top_k*2).to_pandas()
vector_results['vector_score'] =1-vector_results['_distance']
# Keyword search with BM25s# Create corpus from our chunkscorpus=chunks_df['content'].tolist()
# Tokenize the corpus and index itcorpus_tokens=bm25s.tokenize(corpus)
retriever=bm25s.BM25(corpus=corpus)
retriever.index(corpus_tokens)
# Tokenize the query and retrieve resultsquery_tokens=bm25s.tokenize(query)
docs, scores=retriever.retrieve(query_tokens, k=len(corpus)) # Get scores for all documents# Map BM25 scores to our dataframe indicesbm25_scores= {i: scores[0, idx] foridx, iinenumerate(docs[0])}
vector_results['bm25_score'] =vector_results.index.map(
lambdax: bm25_scores.get(x, 0) ifxinbm25_scoreselse0)
# Normalize BM25 scoresifvector_results['bm25_score'].max() >0:
vector_results['bm25_score'] =vector_results['bm25_score'] /vector_results['bm25_score'].max()
# Combine scores with weightingvector_results['combined_score'] = (
vector_weight*vector_results['vector_score'] +
(1-vector_weight) *vector_results['bm25_score'])
returnvector_results.sort_values('combined_score', ascending=False).head(top_k)
This maps the retrieved results from BM25 to the vector search indices, normalizes the BM25 scores to align with the vector search results, and then applies the user-specified vector_weight, which instructs the algorithm how much to weight the vector search results over the FTS results. 0.7 is a sensible default (because too high a weight given to FTS can result in fewer matches overall - vector search is more forgiving).
I think this would be a decent base implementation to do at the C++ level, that can then be exposed to the higher level clients with the specified arguments without too much work. Alternatively, we could only implement this in Python for an initial release, so that users at least have an option to get relevant keyword-based results alongside vector search results from their graphs.
The text was updated successfully, but these errors were encountered:
I am not sure why the bm25_score can be smaller than 0, and why they normalize the bm25_score.
I'm also not sure if BM25 is ever less than zero -- the if condition might just be overcautious. But the BM25 scores need to be normalized because they are typically large numbers > 1, but the vector search (typically cosine similarity) are between 0 and 1, so in order to put them on the same scale, the BM25 scores must also be brought to between 0 and 1.
One more thing to note is that we return distance metrics in our vector search so we'd have to transform that to a "score" by subtracting from 1 and then use that to compare against BM25.
All this assumes that it's cosine similarity only (if it's dot product metric the meaning of the scores change, so this function would only support the cosine similarity metric and not custom distance metrics).
API
Python
Description
With the upcoming vector index, we should consider adding support for hybrid search, where the user may want to combine FTS (keyword-based) and vector search results to improve the relevance of results.
A naive, simple implementation is shown in this blog post and would be a good starting point.
Here's how it would look in Python:
This maps the retrieved results from BM25 to the vector search indices, normalizes the BM25 scores to align with the vector search results, and then applies the user-specified
vector_weight
, which instructs the algorithm how much to weight the vector search results over the FTS results. 0.7 is a sensible default (because too high a weight given to FTS can result in fewer matches overall - vector search is more forgiving).I think this would be a decent base implementation to do at the C++ level, that can then be exposed to the higher level clients with the specified arguments without too much work. Alternatively, we could only implement this in Python for an initial release, so that users at least have an option to get relevant keyword-based results alongside vector search results from their graphs.
The text was updated successfully, but these errors were encountered: