embeddings: add support for prefixes in embeddings #4524

eliransin · 2025-03-07T10:16:36Z

Some of the embedding models (e.g nomic-embed-text) are trained to embed for several purposes, this on the flip side, means that if not given the purpose for the embedding the retrieval results might and probably will be suboptimal. This change aims to deal with it by adding a configuration option for chunk and query prefixes with the option to support more embedding purposes in the future (i.e clustering).

Some refferences to justify this change:

https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes The nomic-embed-text prefixes section
https://www.youtube.com/watch?v=76EIC_RaDNw (Don’t Embed Wrong! by Matt Williams)

Description

[ What changed? Feel free to be brief. ]

Checklist

The relevant docs, if any, have been updated or created
The relevant tests, if any, have been updated or created

Screenshots

[ For visual changes, include screenshots. ]

Testing instructions

Nothing special just adding some prefixes and making sure that a different index is being created. Some better retrievals might be observed but I guess it depend on the size of the index (the smaller the index the less the effect I assume).

netlify · 2025-03-07T10:17:08Z

✅ Deploy Preview for continuedev canceled.

Name	Link
🔨 Latest commit	`51090ef`
🔍 Latest deploy log	https://app.netlify.com/sites/continuedev/deploys/67de531fdd868e0008e9dee1

sestinj

@eliransin thanks for the great work here! Everything written so far looks good to me. There's one thing we should add, which is support for the config.yaml format. This would be added in the config-yaml/src/schemas/models.ts embedOptionsSchema, and then should be passed through here: https://github.com/continuedev/continue/blob/c1699c975fc252d2ed14b8c4a9770d802852cd3b/core/config/yaml/models.ts

eliransin · 2025-03-18T04:51:54Z

Changes in this version:

rebased on main and resolved conflicts
added zod definitions as instructed in cr comments
🙂

eliransin · 2025-03-20T06:37:34Z

@sestinj I have a bit of a noob qusetion/s, I noticed that config-yaml/src/schemas/models.ts embedOptionsSchema sits in its own package.
should I bump the version of this package (and probably also bump up the coresponding entry in the core package.json) ? If I do, is the dependency installed from local source or should I first open a PR to update this package and have the new version uploaded and registered in package registy and only then do a second PR that uses this package in core?
Tx

eliransin · 2025-03-20T07:34:33Z

Summary:
Figured out I should do the following, but unsure if I need to do it in two separate steps:

Make changes to packages/config-yaml including bumping up the version
Update the dependencies of core to the new @continuedev/config-yml version
Add the extra assignment from the newly added option in core/config/yaml/models.ts
All of the current already made changes

My question is:
Should 1 be in a separate PR?

I am going to go ahead and submit one unified PR and will assume that if this is the wrong thing to do, checks will fail 🙂

Probably will only do it in a day or so, so maybe I'll get an answer by then...

Some of the embedding models (e.g nomic-embed-text) are trained to embed for several purposes, this on the flip side, means that if not given the purpose for the embedding the retrieval results might and probably will be suboptimal. This change aims to deal with it by adding a configuration option for chunk and query prefixes with the option to support more embedding purposes in the future (i.e clustering). Some refferences to justify this change: 1. https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes The nomic-embed-text prefixes section 2. https://www.youtube.com/watch?v=76EIC_RaDNw (Don’t Embed Wrong! by Matt Williams) Signed-off-by: Eliran Sinvani <[email protected]>

After adding embedding prefix model support we also need to add support for it in config-yaml, this in turn require us to bump up the version of config-yaml. Signed-off-by: Eliran Sinvani <[email protected]>

after adding zod schema definitions for the newly added embeddings prefix in config-yaml package. let's integrate it. Signed-off-by: Eliran Sinvani <[email protected]>

eliransin · 2025-03-22T06:18:36Z

@sestinj I digged a bit in the previous commits, it seams like I need to first have the package in registry so probably it should occure in 2 separate stages. I will break this PR into two.

eliransin · 2025-03-22T06:59:19Z

@sestinj converted this to draft until the config-yaml package is published and proper adjutments are made as a result.

sestinj

Sorry again for being a bit slow to review. All looks good to me at this point except for the one docs change. I would go ahead and make the update myself but since we're in Draft state I want to leave it for you

sestinj · 2025-04-23T21:21:32Z

docs/docs/customize/model-providers/top-level/ollama.mdx

@@ -84,7 +84,11 @@ We recommend configuring **Nomic Embed Text** as your embeddings model.
    {
      "embeddingsProvider": {
        "provider": "ollama",


Let's remove the "embeddingPrefixes" from the JSON in the docs here. I know it's correct, but it ends up looking very complicated for a beginner setup. The way that we will solve this for all users going forward is by having the blocks on hub.continue.dev include the necessary embedding prefixes and then suggest in the docs that users use uses: ollama/nomic-embed-text for example

sestinj requested changes Mar 16, 2025

View reviewed changes

eliransin force-pushed the embeding_prefixes branch from 611c16b to cf83f62 Compare March 18, 2025 04:49

eliransin requested a review from a team as a code owner March 18, 2025 04:49

eliransin requested review from RomneyDa and removed request for a team March 18, 2025 04:49

eliransin requested a review from sestinj March 19, 2025 19:22

eliransin added 2 commits March 21, 2025 09:27

config-yaml: add embedding prefixes support

21baeb8

After adding embedding prefix model support we also need to add support for it in config-yaml, this in turn require us to bump up the version of config-yaml. Signed-off-by: Eliran Sinvani <[email protected]>

eliransin force-pushed the embeding_prefixes branch from cf83f62 to 10e72a6 Compare March 21, 2025 07:43

embedding prefixes: integrate config-yaml embedding prefixes

51090ef

after adding zod schema definitions for the newly added embeddings prefix in config-yaml package. let's integrate it. Signed-off-by: Eliran Sinvani <[email protected]>

eliransin force-pushed the embeding_prefixes branch from 10e72a6 to 51090ef Compare March 22, 2025 06:05

eliransin mentioned this pull request Mar 22, 2025

config-yaml: add embedding prefixes support #4765

Merged

2 tasks

eliransin marked this pull request as draft March 22, 2025 06:58

sestinj requested changes Apr 23, 2025

View reviewed changes

RomneyDa removed their request for review May 3, 2025 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

embeddings: add support for prefixes in embeddings #4524

embeddings: add support for prefixes in embeddings #4524

Uh oh!

eliransin commented Mar 7, 2025

Uh oh!

netlify bot commented Mar 7, 2025 •

edited

Loading

Uh oh!

sestinj left a comment

Uh oh!

eliransin commented Mar 18, 2025

Uh oh!

eliransin commented Mar 20, 2025 •

edited

Loading

Uh oh!

eliransin commented Mar 20, 2025

Uh oh!

eliransin commented Mar 22, 2025

Uh oh!

eliransin commented Mar 22, 2025

Uh oh!

sestinj left a comment

Uh oh!

sestinj Apr 23, 2025

Uh oh!

Uh oh!

embeddings: add support for prefixes in embeddings #4524

Are you sure you want to change the base?

embeddings: add support for prefixes in embeddings #4524

Uh oh!

Conversation

eliransin commented Mar 7, 2025

Description

Checklist

Screenshots

Testing instructions

Uh oh!

netlify bot commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for continuedev canceled.

Uh oh!

sestinj left a comment

Choose a reason for hiding this comment

Uh oh!

eliransin commented Mar 18, 2025

Uh oh!

eliransin commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eliransin commented Mar 20, 2025

Uh oh!

eliransin commented Mar 22, 2025

Uh oh!

eliransin commented Mar 22, 2025

Uh oh!

sestinj left a comment

Choose a reason for hiding this comment

Uh oh!

sestinj Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

netlify bot commented Mar 7, 2025 •

edited

Loading

eliransin commented Mar 20, 2025 •

edited

Loading