Skip to content

[BUG] 检索效果很差 #5274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
FanZhang91 opened this issue Mar 21, 2025 · 11 comments
Open

[BUG] 检索效果很差 #5274

FanZhang91 opened this issue Mar 21, 2025 · 11 comments
Labels
bug Something isn't working

Comments

@FanZhang91
Copy link

FanZhang91 commented Mar 21, 2025

在构建了pdf文档(pdf中含有text以及table数据)对应的数据库之后,然后用Langchain-Chatchat做检索。在没有涉及到多轮对话的范畴下检索效果就很差,请问是什么原因?请问要怎么改进?


embedding model: quentinz/bge-large-zh-v1.5
llm model: qwen2:7b

@FanZhang91 FanZhang91 added the bug Something isn't working label Mar 21, 2025
Copy link

dosubot bot commented Mar 21, 2025

I found some similar threads related to retrieval issues in Langchain-Chatchat:

  1. An open issue discussing retrieval problems with the rank-bm25 algorithm affecting overall retrieval performance [1].
  2. An open issue about retrieval capabilities being worse in version 3.0 compared to version 2.0, particularly with case sensitivity in search terms [2].

For your specific issue with poor retrieval performance on PDF documents containing text and tables, consider the following:

  • Ensure that the PDF_OCR_THRESHOLD settings are appropriately configured. This setting controls the OCR process by determining which images within a PDF should undergo OCR based on their size relative to the page. Adjusting these thresholds might improve the processing of documents with mixed content types [3].

  • Review the configuration options in config_work_space.py and _model_config.py to optimize retrieval performance. You can adjust settings like set_vector_search_top_k, set_score_threshold, and set_default_search_engine to fine-tune the retrieval process [4][5].

These adjustments might help improve the retrieval results for your PDF documents.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@FanZhang91
Copy link
Author

@dosubot thanks for your reply。With your help, I found that adjusting PDF_OCR_THRESHOLD parameter can improve the retrieval results. However, the results of retrieval for questions related to Table data are still not ideal. Do you have any better suggestions?

Copy link

dosubot bot commented Mar 21, 2025

To improve retrieval results for questions related to table data in PDF documents using Langchain-Chatchat, you can utilize the EnsembleRetrieverService, which combines different retrieval methods. Specifically, using a combination of BM25 and FAISS retrievers can enhance retrieval performance. Here's how you can set it up:

from langchain.vectorstores import VectorStore
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

class EnsembleRetrieverService(BaseRetrieverService):
    @staticmethod
    def from_vectorstore(
        vectorstore: VectorStore,
        top_k: int,
        score_threshold: int | float,
    ):
        faiss_retriever = vectorstore.as_retriever(
            search_type="similarity_score_threshold",
            search_kwargs={"score_threshold": score_threshold, "k": top_k},
        )
        import jieba

        docs = list(vectorstore.docstore._dict.values())
        bm25_retriever = BM25Retriever.from_documents(
            docs,
            preprocess_func=jieba.lcut_for_search,
        )
        bm25_retriever.k = top_k
        ensemble_retriever = EnsembleRetriever(
            retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
        )
        return EnsembleRetrieverService(retriever=ensemble_retriever, top_k=top_k)

This setup uses both BM25 and FAISS retrievers with equal weights to balance between keyword-based and vector-based retrieval, potentially improving the accuracy of retrieving table data from PDF documents. Adjust the top_k and score_threshold parameters to fine-tune the retrieval results based on your specific needs [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@FanZhang91
Copy link
Author

@dosubot I found that the latest version of Langchain-Chatchat uses both BM25 and FAISS retrievers by default when retrieving knowledge. However, it still cannot retrieve content related to table information. Is there any other solution?

Copy link

dosubot bot commented Mar 24, 2025

To enhance the retrieval of table data in Langchain-Chatchat, consider the following approaches:

  1. Configuration Options: Utilize specific configuration options designed to improve table data retrieval. For instance, you can specify table_names and table_comments to guide the model in selecting the correct tables and understanding their context. This can help in managing resources and avoiding errors related to token length [1][2].

  2. SQLDatabaseSequentialChain: If table_names are not specified, Langchain-Chatchat uses SQLDatabaseSequentialChain to predict which tables are needed before inputting them into SQLDatabaseChain. This method helps in efficiently handling structured data by avoiding unnecessary data processing [3].

  3. Preprocessing Steps: Ensure that the preprocessing steps for PDF documents are optimized. The text extraction process might not preserve the tabular structure, and OCR might not accurately capture complex table structures. Adjusting these processes can potentially improve the retrieval of table data [4].

These strategies can help improve the retrieval of table data in Langchain-Chatchat. Adjusting configurations and preprocessing methods can lead to better handling of structured data like tables.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@FanZhang91
Copy link
Author

FanZhang91 commented Mar 24, 2025

@dosubot I will try these constructive suggestions, thank you very much for your reply. There is last question: How to extend the Langchain-Chatchat framework to multimodal scenarios?

Copy link

dosubot bot commented Mar 24, 2025

To extend the Langchain-Chatchat framework to multimodal scenarios, you can leverage its existing support for multimodal data processing. Here are some steps and considerations:

  1. Utilize Existing Modules: Langchain-Chatchat already includes modules for multimodal data processing, such as the text2images tool in the tools_factory module, which indicates support for converting text to images [1].

  2. Configure Multimodal Models: You can configure the platform to support various multimodal models. Options include image2text_models, text2image_models, speech2text_models, and text2speech_models. These can be set to specific models or "auto" to automatically detect available models [2][3].

  3. Adjust Platform Settings: Use the MODEL_PLATFORMS configuration to specify platforms and models for different modalities. This includes setting up API details and concurrency options for each platform [2].

  4. Tool Configuration: Enable or disable specific tools related to multimodal processing, such as vqa_processor for visual question answering and aqa_processor for audio question answering. Configure paths for models and tokenizers as needed [2].

  5. Dependencies: Ensure that necessary dependencies for multimodal capabilities are installed, such as opencv-python for image processing, PyMuPDF for PDF handling, and rapidocr_onnxruntime for OCR tasks [4].

By configuring these options and utilizing the existing tools, you can extend Langchain-Chatchat to handle multimodal scenarios effectively.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@2069348297
Copy link

佬 你是在哪看到检索流程的 我看代码就调用了一个client api 之后没看到检索流程 我不是源码安装 是这个原因吗

@FanZhang91
Copy link
Author

佬 你是在哪看到检索流程的 我看代码就调用了一个client api 之后没看到检索流程 我不是源码安装 是这个原因吗

Image。在做知识库对话时,服务端入口对应libs/chatchat-server/chatchat/server/api_server/kb_routes.py文件里的这个方法。

@2069348297
Copy link

2069348297 commented Mar 24, 2025 via email

@FanZhang91
Copy link
Author

FanZhang91 commented Mar 24, 2025

@dosu 用户界面输入的text信息为:“根据这个图片中的报错信息,帮我分析原因”。
已知上面的用户输入信息,我需要检索这个图片中红色框中的报错信息对应的原因(报错原因对应知识库里面的一个表格数据), 但是每次检索找到的都是与图片中结构内容很类似但是却与这个报错信息没有任何关系的log日志信息,而不是我想要的报错原因。

   我现在的做法是先对图像数据做个ocr,然后利用llm将ocr输出的text以及界面上用户提示文本信息进行回复(希望大模型仅输出与报错相关的text内容);接着将关键报错信息与用户界面输入文本信息一起送入bge-large-zh-v1.5进行知识检索。但是在用llm将ocr输出的文本信息以及界面输入文本信息进行回复时遇到一些问题:
(1) 大模型输出的是对ocr输出文本信息的总结,而忽略我想要的关键报错信息;
(2) 实际操作时,我这个思路在进行VQA时比较脆弱,比较依赖于llm(在做知识检索步骤之前)的理解能力;
(3) 基于现有的Langchain-Chatchat框架,将知识检索扩展到VQA知识检索有什么好的思路?
(4) 就纯text内容检索而言,现有的Langchain-Chatchat框架支持这种因果关系的知识检索吗?

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants