Fix the performance problem when the total number of segments approaches 100 million. #16643

wanghan5 · 2025-03-24T09:55:35Z

Self Checks

This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:)
Please do not modify this template :) and fill in all the required fields.

Dify version

1.1.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

After importing about 300 thousands of file into a knownledge base, The query became very slow

✔️ Expected Behavior

The query result will return in 1 second.

❌ Actual Behavior

It took about 11 seconds.

dosubot · 2025-03-24T09:58:47Z

Hello @wanghan5! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve this issue.

I found a similar open issue related to performance problems in the Dify codebase. The suggested solution involves optimizing the IndexingRunner().run() method, particularly the _update_segment_keywords() function, by:

Extracting keywords during segment insertion.
Implementing batch processing to reduce database operation overhead.
Utilizing multi-threading or multi-processing to handle multiple segments concurrently.

Additionally, it suggests modifying the DatasetDocumentStore.add_documents method to extract keywords during document insertion, ensuring keywords are inserted into document_segments rather than documents ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

wanghan5 · 2025-03-27T04:06:52Z

Documents have already been indexed. The performance problem happens when trying to retrieve them.

I found some issues regarding wrong index and missuage of properties.

There is knowledge base rate limit in could version. https://docs.dify.ai/guides/knowledge-base/knowledge-request-rate-limit

I would like to sumbit a pull request later. Hope it helps.

dosubot bot added the 🐞 bug Something isn't working label Mar 24, 2025

wanghan5 linked a pull request Mar 28, 2025 that will close this issue

Fix Performance Issues: #17083

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the performance problem when the total number of segments approaches 100 million. #16643

Fix the performance problem when the total number of segments approaches 100 million. #16643

wanghan5 commented Mar 24, 2025 •

edited

Loading

dosubot bot commented Mar 24, 2025

wanghan5 commented Mar 27, 2025

Fix the performance problem when the total number of segments approaches 100 million. #16643

Fix the performance problem when the total number of segments approaches 100 million. #16643

Comments

wanghan5 commented Mar 24, 2025 • edited Loading

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

dosubot bot commented Mar 24, 2025

wanghan5 commented Mar 27, 2025

wanghan5 commented Mar 24, 2025 •

edited

Loading