Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Net: Introduce new record model #11264

Merged
merged 1 commit into from
Apr 3, 2025

Conversation

roji
Copy link
Member

@roji roji commented Mar 28, 2025

See #11140 for the overall approach here. In a nutshell:

  • Evolve VectorStoreRecordPropertyReader into a proper VectorStoreRecordModel, which gets built from the user's CLR type and/or VectorStoreRecordDefinition. This is now the centralized place for all information about properties and information about them, as well as validation etc. The model and model building logic is exposed publicly, so that external connectors can be built; these types are in a separate ConnectorSupport namespace (they're connector-facing and not user-facing), and are also marked as experimental so we can evolve them.
  • Properties in the model are also represented by their own types, and are capable of reading/writing themselves to/from the user's record type. This abstracts away the specific method of accessing properties (reflection, dictionary lookup, source generated delegate...) away from connector code.
  • As a result, connectors which handle serialization themselves no longer need separate mappers for POCOs and for VectorStoreGenericDataModel. Connectors which use an external serializer (JSON) still need different mappers, but once we transition from VectorStoreGenericDataModel to a plain dictionary, that will likely go away too.
  • Note that we now apply the user's VectorStoreRecordDefinition in addition to (and overriding) the attributes (we were previously doing one or the other).
    • One motivation was that an embedding generator cannot be configured via an attribute, since it typically requires endpoint/auth configuration.
    • We see EF users doing this, i.e. expressing what they can via attributes, and falling back to the Fluent API for the rest.
    • This also allows the same CLR type - with some general attributes - to be used with varying VectorStoreRecordDefinitions.

Closes #11140

@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel documentation memory labels Mar 28, 2025
@github-actions github-actions bot changed the title Introduce new record model .Net: Introduce new record model Mar 28, 2025
@roji roji marked this pull request as ready for review March 28, 2025 22:20
@roji roji requested a review from a team as a code owner March 28, 2025 22:20
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed 102 out of 168 files, so far LGTM. PTAL at my comments, I'll try to finish the review later today.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the hard work you have put into this refactoring @roji !

As soon as you rename those few public properties that me and @westey-m has commented on, I am going to hit the approve button.

Thanks!

this.Customize();
this.Finalize(clrType);

return new(clrType, this.KeyProperties, this.DataProperties, this.VectorProperties, this.PropertyMap);
Copy link
Contributor

@westey-m westey-m Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this class be reused to process multiple types and definitions?
If not, I think we should have some very explicit warnings in the xml comments around this, so that consumers do not accidentally use this incorrectly. Or preferably, just make the class thread safe and reusable, so that warnings are unnecessary.
Or make the constructor internal and force invocations via a static method on the class.

Copy link
Member Author

@roji roji Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it isn't reusable or thread-safe. We could easily make it reusable, but making it thread safety would mean we can't use any members (e.g. the various lists and dictionaries) and must constantly pass everything as parameters.

The way the model builder (currently) works is that a collection creates one in its constructor, uses it to create the model, and then discards the builder - it's a one-time, startup operation. Performance doesn't seem very important here - the couple extra allocations on instantiating a new collection are generally negligible, unless we really want to start optimizing for super-efficient creation of collections (which would likely imply much more important/heavy things first).

So I basically don't currently see a need for making one-time thing reusable/thread-safe... For now I've added a doc remark that the type is single-use and not thread-safe. Let me know if you disagree and think we should do something else.

Note that all the new connector-facing types under ConnectorSupport should be considered a first version and not yet completely fleshed out (that's why they're [Experimental]). We're very likely still going to have to iterate and mature all this (NativeAOT, hierarchical types, and various other thing). So I'm not (yet) applying the same level of design rigor as we need to on the user-facing public surface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this after merging, but I'd definitely recommend switching this design to something that matches it's intended usage pattern.
This is after all intended to be used by other implementers too, and considering that the constructor param values are essentially static to a connector, if I saw the design of this class I would think I can define a static instance in my connector and then just invoke it multiple times.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that putting something as static implies not just invoking multiple times (reusability), but also thread-safety, since collections might get created concurrently... At that point I'm really hoping connector implementors don't assume a type is thread-safe without carefully checking the docs etc. (and am also assuming other connector implementors basically copy-paste our code).

FWIW there's a similar question for the fillter translators (though those are internal to each connector rather than a general MEVD component). Expression visitors may need to save some state in members to communicate things between layers, which would be incompatible with concurrent use; and since IVectorStoreRecordCollection is thread-safe, a new translator needs to be instantiated for each query (though we can pool them if we think that's really necessary for perf).

In the interest of moving fast, let's defer this for now - maybe open an issue with possible design evolution around this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can do this in a follow up. I think it would be a fairly trivial change. I did note that the filter translators are using a similar pattern, but they are not something we are shipping publicly. Also, the developer who wrote the filter will also be the one who uses the filter, so not the same level of risk or scrutiny required.

Comment on lines +129 to +132
IsFilterable = dataAttribute.IsFilterable,
IsFullTextSearchable = dataAttribute.IsFullTextSearchable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making it possible to just pass the attribute or definition as needed to the VectorStoreRecordDataPropertyModel so that it can do cloning of settings. That way, if we add new properties, cloning is just happening there instead of each point of consumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that you'd like e.g. VectorStoreRecordDataPropertyModel to have direct references to VectorStoreRecordDataAttribute and to VectorStoreRecordDataProperty? Wouldn't that just move the logic of having to access properties on these two types from the model builder to the PropertyModels?

(I think I may be misunderstanding what you have in mind)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's the goal. When adding a new property to e.g. VectorStoreRecordDataPropertyModel the copying from VectorStoreRecordDataAttribute and VectorStoreRecordDataProperty can be updated once within the VectorStoreRecordDataPropertyModel class, instead of having to find every place in the code where we currently instantiate new instances of VectorStoreRecordDataPropertyModel from VectorStoreRecordDataAttribute VectorStoreRecordDataProperty and update each of those locations to copy all required properties across.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Instantiating and populating a PropertyModel is something that should only ever happen in the ModelBuilder (or possibly a subclass), never elsewhere - so the scope here is a bit limited; but I get what you're saying. Plus it's worth mentioning that we don't only instantiate a PropertyModel from an attribute/definition, we also populate it (e.g. we create a PropertyModel from an attribute, and then there's also a definition that needs to be added (an possible overwrite) stuff on that PropertyModel.

I'd rather avoid having PropertyModel directly reference the attribute/definition (e.g. in a constructor), which are just sources for it, so how about this: given this all happens within the model builder only, we can just have a Populate() method on VectorStoreRecordModelBuilder which does the copying, this way this all stays inside that scope?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good. Ultimately this is a nice to have thing, so I'm fine whatever direction you take.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at implementing this, and we currently have 3 places inside the model builder where we transfer information from an attribute/definition to a property model:

  1. When traversing a CLR type's properties, we discover an attribute. Here we create a property model, copying over the information from the attribute.
  2. When processing the record definition, we find a definition for a property we don't already have (i.e. we didn't encounter it in step 1 above, e.g. because of dynamic mapping). Here we just create the property model, but don't populate it from the definition (only the mandatory constructor stuff - name+Type).
  3. After the above, regardless of whether the property was instantiated in step 1 or 2, we copy over the information from the definition to the property model. This allows the definition to clobber the attribute (which is the design I was going for).

So there's one piece of code that creates and populates from an attribute, one that just creates, and one that just populates from a record definition - these look like they're going to have to stay distinct (e.g. there need to be two distinct code paths for attribute vs. definition as these are two different types)... I can move that code out to live elsewhere (in special methods inside the builder, or in the PropertyModel itself, which I dislike since it's only part of model building); but that just seems like moving code around and making it more complex to understand the program flow...

Maybe I'm still missing what you have in mind though... Unless it's important to you to do this now, maybe simply submit a PR later proposing what you'd like to have?

@roji roji temporarily deployed to integration April 3, 2025 19:19 — with GitHub Actions Inactive
@roji roji merged commit a312b59 into microsoft:feature-vector-data-preb2 Apr 3, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation kernel Issues or pull requests impacting the core kernel memory .NET Issue or Pull requests regarding .NET code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants