[BUG] 检索效果很差 #5274

FanZhang91 · 2025-03-21T11:42:14Z

在构建了pdf文档(pdf中含有text以及table数据)对应的数据库之后，然后用Langchain-Chatchat做检索。在没有涉及到多轮对话的范畴下检索效果就很差，请问是什么原因？请问要怎么改进？

embedding model: quentinz/bge-large-zh-v1.5
llm model: qwen2:7b

dosubot · 2025-03-21T11:45:00Z

I found some similar threads related to retrieval issues in Langchain-Chatchat:

An open issue discussing retrieval problems with the rank-bm25 algorithm affecting overall retrieval performance ^[1].
An open issue about retrieval capabilities being worse in version 3.0 compared to version 2.0, particularly with case sensitivity in search terms ^[2].

For your specific issue with poor retrieval performance on PDF documents containing text and tables, consider the following:

Ensure that the PDF_OCR_THRESHOLD settings are appropriately configured. This setting controls the OCR process by determining which images within a PDF should undergo OCR based on their size relative to the page. Adjusting these thresholds might improve the processing of documents with mixed content types ^[3].
Review the configuration options in config_work_space.py and _model_config.py to optimize retrieval performance. You can adjust settings like set_vector_search_top_k, set_score_threshold, and set_default_search_engine to fine-tune the retrieval process ^[4]^[5].

These adjustments might help improve the retrieval results for your PDF documents.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

FanZhang91 · 2025-03-21T12:22:40Z

@dosubot thanks for your reply。With your help, I found that adjusting PDF_OCR_THRESHOLD parameter can improve the retrieval results. However, the results of retrieval for questions related to Table data are still not ideal. Do you have any better suggestions?

dosubot · 2025-03-21T12:23:45Z

To improve retrieval results for questions related to table data in PDF documents using Langchain-Chatchat, you can utilize the EnsembleRetrieverService, which combines different retrieval methods. Specifically, using a combination of BM25 and FAISS retrievers can enhance retrieval performance. Here's how you can set it up:

from langchain.vectorstores import VectorStore
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

class EnsembleRetrieverService(BaseRetrieverService):
    @staticmethod
    def from_vectorstore(
        vectorstore: VectorStore,
        top_k: int,
        score_threshold: int | float,
    ):
        faiss_retriever = vectorstore.as_retriever(
            search_type="similarity_score_threshold",
            search_kwargs={"score_threshold": score_threshold, "k": top_k},
        )
        import jieba

        docs = list(vectorstore.docstore._dict.values())
        bm25_retriever = BM25Retriever.from_documents(
            docs,
            preprocess_func=jieba.lcut_for_search,
        )
        bm25_retriever.k = top_k
        ensemble_retriever = EnsembleRetriever(
            retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
        )
        return EnsembleRetrieverService(retriever=ensemble_retriever, top_k=top_k)

This setup uses both BM25 and FAISS retrievers with equal weights to balance between keyword-based and vector-based retrieval, potentially improving the accuracy of retrieving table data from PDF documents. Adjust the top_k and score_threshold parameters to fine-tune the retrieval results based on your specific needs ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

FanZhang91 · 2025-03-24T02:58:31Z

@dosubot I found that the latest version of Langchain-Chatchat uses both BM25 and FAISS retrievers by default when retrieving knowledge. However, it still cannot retrieve content related to table information. Is there any other solution?

dosubot · 2025-03-24T03:00:00Z

To enhance the retrieval of table data in Langchain-Chatchat, consider the following approaches:

Configuration Options: Utilize specific configuration options designed to improve table data retrieval. For instance, you can specify table_names and table_comments to guide the model in selecting the correct tables and understanding their context. This can help in managing resources and avoiding errors related to token length ^[1]^[2].
SQLDatabaseSequentialChain: If table_names are not specified, Langchain-Chatchat uses SQLDatabaseSequentialChain to predict which tables are needed before inputting them into SQLDatabaseChain. This method helps in efficiently handling structured data by avoiding unnecessary data processing ^[3].
Preprocessing Steps: Ensure that the preprocessing steps for PDF documents are optimized. The text extraction process might not preserve the tabular structure, and OCR might not accurately capture complex table structures. Adjusting these processes can potentially improve the retrieval of table data ^[4].

These strategies can help improve the retrieval of table data in Langchain-Chatchat. Adjusting configurations and preprocessing methods can lead to better handling of structured data like tables.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

FanZhang91 · 2025-03-24T03:35:31Z

@dosubot I will try these constructive suggestions, thank you very much for your reply. There is last question： How to extend the Langchain-Chatchat framework to multimodal scenarios?

dosubot · 2025-03-24T03:37:03Z

To extend the Langchain-Chatchat framework to multimodal scenarios, you can leverage its existing support for multimodal data processing. Here are some steps and considerations:

Utilize Existing Modules: Langchain-Chatchat already includes modules for multimodal data processing, such as the text2images tool in the tools_factory module, which indicates support for converting text to images ^[1].
Configure Multimodal Models: You can configure the platform to support various multimodal models. Options include image2text_models, text2image_models, speech2text_models, and text2speech_models. These can be set to specific models or "auto" to automatically detect available models ^[2]^[3].
Adjust Platform Settings: Use the MODEL_PLATFORMS configuration to specify platforms and models for different modalities. This includes setting up API details and concurrency options for each platform ^[2].
Tool Configuration: Enable or disable specific tools related to multimodal processing, such as vqa_processor for visual question answering and aqa_processor for audio question answering. Configure paths for models and tokenizers as needed ^[2].
Dependencies: Ensure that necessary dependencies for multimodal capabilities are installed, such as opencv-python for image processing, PyMuPDF for PDF handling, and rapidocr_onnxruntime for OCR tasks ^[4].

By configuring these options and utilizing the existing tools, you can extend Langchain-Chatchat to handle multimodal scenarios effectively.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

2069348297 · 2025-03-24T08:34:02Z

佬你是在哪看到检索流程的我看代码就调用了一个client api 之后没看到检索流程我不是源码安装是这个原因吗

FanZhang91 · 2025-03-24T11:20:22Z

佬你是在哪看到检索流程的我看代码就调用了一个client api 之后没看到检索流程我不是源码安装是这个原因吗

。在做知识库对话时，服务端入口对应libs/chatchat-server/chatchat/server/api_server/kb_routes.py文件里的这个方法。

2069348297 · 2025-03-24T11:23:07Z

佬能加个V吗 15097431207 哎呀嘞！ ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: "chatchat-space/Langchain-Chatchat" ***@***.***>; 发送时间: 2025年3月24日(星期一) 晚上7:20 ***@***.***>; ***@***.******@***.***>; 主题: Re: [chatchat-space/Langchain-Chatchat] [BUG] 检索效果很差 (Issue #5274) 佬你是在哪看到检索流程的我看代码就调用了一个client api 之后没看到检索流程我不是源码安装是这个原因吗 test1.png (view on web)。在做知识库对话时，服务端入口对应libs/chatchat-server/chatchat/server/api_server/kb_routes.py文件里的这个方法。 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***> FanZhang91 left a comment (chatchat-space/Langchain-Chatchat#5274) 佬你是在哪看到检索流程的我看代码就调用了一个client api 之后没看到检索流程我不是源码安装是这个原因吗 test1.png (view on web)。在做知识库对话时，服务端入口对应libs/chatchat-server/chatchat/server/api_server/kb_routes.py文件里的这个方法。 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

FanZhang91 · 2025-03-24T12:21:41Z

@dosu 用户界面输入的text信息为：“根据这个图片中的报错信息，帮我分析原因”。
已知上面的用户输入信息，我需要检索这个图片中红色框中的报错信息对应的原因(报错原因对应知识库里面的一个表格数据), 但是每次检索找到的都是与图片中结构内容很类似但是却与这个报错信息没有任何关系的log日志信息，而不是我想要的报错原因。

   我现在的做法是先对图像数据做个ocr，然后利用llm将ocr输出的text以及界面上用户提示文本信息进行回复(希望大模型仅输出与报错相关的text内容)；接着将关键报错信息与用户界面输入文本信息一起送入bge-large-zh-v1.5进行知识检索。但是在用llm将ocr输出的文本信息以及界面输入文本信息进行回复时遇到一些问题：
(1) 大模型输出的是对ocr输出文本信息的总结，而忽略我想要的关键报错信息；
(2) 实际操作时，我这个思路在进行VQA时比较脆弱，比较依赖于llm（在做知识检索步骤之前）的理解能力；
(3) 基于现有的Langchain-Chatchat框架，将知识检索扩展到VQA知识检索有什么好的思路？
(4) 就纯text内容检索而言，现有的Langchain-Chatchat框架支持这种因果关系的知识检索吗？

FanZhang91 added the bug Something isn't working label Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] 检索效果很差 #5274

[BUG] 检索效果很差 #5274

FanZhang91 commented Mar 21, 2025 •

edited

Loading

dosubot bot commented Mar 21, 2025

FanZhang91 commented Mar 21, 2025

dosubot bot commented Mar 21, 2025

FanZhang91 commented Mar 24, 2025

dosubot bot commented Mar 24, 2025

FanZhang91 commented Mar 24, 2025 •

edited

Loading

dosubot bot commented Mar 24, 2025

2069348297 commented Mar 24, 2025

FanZhang91 commented Mar 24, 2025

2069348297 commented Mar 24, 2025 via email

FanZhang91 commented Mar 24, 2025 •

edited

Loading

[BUG] 检索效果很差 #5274

[BUG] 检索效果很差 #5274

Comments

FanZhang91 commented Mar 21, 2025 • edited Loading

dosubot bot commented Mar 21, 2025

FanZhang91 commented Mar 21, 2025

dosubot bot commented Mar 21, 2025

FanZhang91 commented Mar 24, 2025

dosubot bot commented Mar 24, 2025

FanZhang91 commented Mar 24, 2025 • edited Loading

dosubot bot commented Mar 24, 2025

2069348297 commented Mar 24, 2025

FanZhang91 commented Mar 24, 2025

2069348297 commented Mar 24, 2025 via email

FanZhang91 commented Mar 24, 2025 • edited Loading

FanZhang91 commented Mar 21, 2025 •

edited

Loading

FanZhang91 commented Mar 24, 2025 •

edited

Loading

FanZhang91 commented Mar 24, 2025 •

edited

Loading