Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the performance problem when the total number of segments approaches 100 million. #16643

Open
5 tasks done
wanghan5 opened this issue Mar 24, 2025 · 2 comments · May be fixed by #17083
Open
5 tasks done

Fix the performance problem when the total number of segments approaches 100 million. #16643

wanghan5 opened this issue Mar 24, 2025 · 2 comments · May be fixed by #17083
Labels
🐞 bug Something isn't working

Comments

@wanghan5
Copy link

wanghan5 commented Mar 24, 2025

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

1.1.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

After importing about 300 thousands of file into a knownledge base, The query became very slow

Image

✔️ Expected Behavior

The query result will return in 1 second.

❌ Actual Behavior

Image
It took about 11 seconds.

@dosubot dosubot bot added the 🐞 bug Something isn't working label Mar 24, 2025
Copy link

dosubot bot commented Mar 24, 2025

Hello @wanghan5! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve this issue.

I found a similar open issue related to performance problems in the Dify codebase. The suggested solution involves optimizing the IndexingRunner().run() method, particularly the _update_segment_keywords() function, by:

  1. Extracting keywords during segment insertion.
  2. Implementing batch processing to reduce database operation overhead.
  3. Utilizing multi-threading or multi-processing to handle multiple segments concurrently.

Additionally, it suggests modifying the DatasetDocumentStore.add_documents method to extract keywords during document insertion, ensuring keywords are inserted into document_segments rather than documents [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@wanghan5
Copy link
Author

Documents have already been indexed. The performance problem happens when trying to retrieve them.

I found some issues regarding wrong index and missuage of properties.

There is knowledge base rate limit in could version. https://docs.dify.ai/guides/knowledge-base/knowledge-request-rate-limit

I would like to sumbit a pull request later. Hope it helps.

@wanghan5 wanghan5 linked a pull request Mar 28, 2025 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant