-
Notifications
You must be signed in to change notification settings - Fork 2.9k
embeddings: add support for prefixes in embeddings #4524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for continuedev canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eliransin thanks for the great work here! Everything written so far looks good to me. There's one thing we should add, which is support for the config.yaml format. This would be added in the config-yaml/src/schemas/models.ts
embedOptionsSchema
, and then should be passed through here: https://github.com/continuedev/continue/blob/c1699c975fc252d2ed14b8c4a9770d802852cd3b/core/config/yaml/models.ts
611c16b
to
cf83f62
Compare
Changes in this version:
|
@sestinj I have a bit of a noob qusetion/s, I noticed that |
Summary:
My question is: I am going to go ahead and submit one unified PR and will assume that if this is the wrong thing to do, checks will fail 🙂 Probably will only do it in a day or so, so maybe I'll get an answer by then... |
Some of the embedding models (e.g nomic-embed-text) are trained to embed for several purposes, this on the flip side, means that if not given the purpose for the embedding the retrieval results might and probably will be suboptimal. This change aims to deal with it by adding a configuration option for chunk and query prefixes with the option to support more embedding purposes in the future (i.e clustering). Some refferences to justify this change: 1. https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes The nomic-embed-text prefixes section 2. https://www.youtube.com/watch?v=76EIC_RaDNw (Don’t Embed Wrong! by Matt Williams) Signed-off-by: Eliran Sinvani <[email protected]>
After adding embedding prefix model support we also need to add support for it in config-yaml, this in turn require us to bump up the version of config-yaml. Signed-off-by: Eliran Sinvani <[email protected]>
cf83f62
to
10e72a6
Compare
after adding zod schema definitions for the newly added embeddings prefix in config-yaml package. let's integrate it. Signed-off-by: Eliran Sinvani <[email protected]>
10e72a6
to
51090ef
Compare
@sestinj I digged a bit in the previous commits, it seams like I need to first have the package in registry so probably it should occure in 2 separate stages. I will break this PR into two. |
@sestinj converted this to draft until the config-yaml package is published and proper adjutments are made as a result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry again for being a bit slow to review. All looks good to me at this point except for the one docs change. I would go ahead and make the update myself but since we're in Draft state I want to leave it for you
@@ -84,7 +84,11 @@ We recommend configuring **Nomic Embed Text** as your embeddings model. | |||
{ | |||
"embeddingsProvider": { | |||
"provider": "ollama", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the "embeddingPrefixes" from the JSON in the docs here. I know it's correct, but it ends up looking very complicated for a beginner setup. The way that we will solve this for all users going forward is by having the blocks on hub.continue.dev include the necessary embedding prefixes and then suggest in the docs that users use uses: ollama/nomic-embed-text
for example
Some of the embedding models (e.g nomic-embed-text) are trained to embed for several purposes, this on the flip side, means that if not given the purpose for the embedding the retrieval results might and probably will be suboptimal. This change aims to deal with it by adding a configuration option for chunk and query prefixes with the option to support more embedding purposes in the future (i.e clustering).
Some refferences to justify this change:
Description
[ What changed? Feel free to be brief. ]
Checklist
Screenshots
[ For visual changes, include screenshots. ]
Testing instructions
Nothing special just adding some prefixes and making sure that a different index is being created. Some better retrievals might be observed but I guess it depend on the size of the index (the smaller the index the less the effect I assume).