-
Notifications
You must be signed in to change notification settings - Fork 96
Add support for chunking_settings in semantic_text fields #4261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this! Before we review in more depth, I'm wondering if this should be consolidated with InferenceChunkingSettings
or not.
export class ChunkingSettings { | ||
/** | ||
* Indicates the type of strategy to use. | ||
*/ | ||
type: ChunkingStrategy | ||
|
||
/** | ||
* The maximum number of words in a chunk. | ||
*/ | ||
max_chunk_size: integer | ||
|
||
/** | ||
* The number of overlapping words allowed in chunks. This cannot be | ||
* defined as more than half of the max_chunk_size. Required for `word` | ||
* type chunking settings. | ||
*/ | ||
overlap?: integer | ||
|
||
/** | ||
* The number of overlapping sentences allowed in chunks. Required for `sentence` type chunking settings. | ||
*/ | ||
sentence_overlap?: 0 | 1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a similar class for the Inference API; should they be the same, or are they also distinct in Elasticsearch?
/** | |
* Chunking configuration object | |
*/ | |
export class InferenceChunkingSettings { | |
/** | |
* The maximum size of a chunk in words. | |
* This value cannot be higher than `300` or lower than `20` (for `sentence` strategy) or `10` (for `word` strategy). | |
* @server_default 250 | |
*/ | |
max_chunk_size?: integer | |
/** | |
* The number of overlapping words for chunks. | |
* It is applicable only to a `word` chunking strategy. | |
* This value cannot be higher than half the `max_chunk_size` value. | |
* @server_default 100 | |
*/ | |
overlap?: integer | |
/** | |
* The number of overlapping sentences for chunks. | |
* It is applicable only for a `sentence` chunking strategy. | |
* It can be either `1` or `0`. | |
* @server_default 1 | |
*/ | |
sentence_overlap?: integer | |
/** | |
* The chunking strategy: `sentence` or `word`. | |
* @server_default sentence | |
*/ | |
strategy?: string | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice catch!
The only concern that I have with combining them, is that we currently require all required settings to be sent in, with no defaults applied. So if you select word
based chunking and don't send in both max_chunk_size
and overlap
you will get an error. This made sense when we wrote it because you're already putting in custom overrides, though I suppose we could revisit this in the future.
Based on this specification, though, max_chunk_size
and overlap
would both be optional due to default values, and using this would break.
WDYT the best course of action is? Maybe put in a comment why they're separate and keep them separate for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, let's keep them separate. To ensure they keep using the same property names, we can use OverloadOf
: https://github.com/elastic/elasticsearch-specification/blob/main/docs/behaviors.md#overloadof. With that, we don't need a comment!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes. However, were the parameters different? You used to have type
and it's now strategy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would have been a bug 🤦♀️ Here's the PR with the example, strategy is correct: elastic/elasticsearch#121041
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM.
Co-authored-by: Quentin Pradet <[email protected]> (cherry picked from commit 39cc516)
) Co-authored-by: Quentin Pradet <[email protected]> (cherry picked from commit 39cc516) Co-authored-by: Kathleen DeRusso <[email protected]>
Updates the specification to include
chunking_settings
forsemantic_text
fields which were added in elastic/elasticsearch#121041