Skip to content

Add support for chunking_settings in semantic_text fields #4261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 15, 2025

Conversation

kderusso
Copy link
Member

Updates the specification to include chunking_settings for semantic_text fields which were added in elastic/elasticsearch#121041

Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this! Before we review in more depth, I'm wondering if this should be consolidated with InferenceChunkingSettings or not.

Comment on lines 22 to 44
export class ChunkingSettings {
/**
* Indicates the type of strategy to use.
*/
type: ChunkingStrategy

/**
* The maximum number of words in a chunk.
*/
max_chunk_size: integer

/**
* The number of overlapping words allowed in chunks. This cannot be
* defined as more than half of the max_chunk_size. Required for `word`
* type chunking settings.
*/
overlap?: integer

/**
* The number of overlapping sentences allowed in chunks. Required for `sentence` type chunking settings.
*/
sentence_overlap?: 0 | 1
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a similar class for the Inference API; should they be the same, or are they also distinct in Elasticsearch?

/**
* Chunking configuration object
*/
export class InferenceChunkingSettings {
/**
* The maximum size of a chunk in words.
* This value cannot be higher than `300` or lower than `20` (for `sentence` strategy) or `10` (for `word` strategy).
* @server_default 250
*/
max_chunk_size?: integer
/**
* The number of overlapping words for chunks.
* It is applicable only to a `word` chunking strategy.
* This value cannot be higher than half the `max_chunk_size` value.
* @server_default 100
*/
overlap?: integer
/**
* The number of overlapping sentences for chunks.
* It is applicable only for a `sentence` chunking strategy.
* It can be either `1` or `0`.
* @server_default 1
*/
sentence_overlap?: integer
/**
* The chunking strategy: `sentence` or `word`.
* @server_default sentence
*/
strategy?: string
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice catch!

The only concern that I have with combining them, is that we currently require all required settings to be sent in, with no defaults applied. So if you select word based chunking and don't send in both max_chunk_size and overlap you will get an error. This made sense when we wrote it because you're already putting in custom overrides, though I suppose we could revisit this in the future.

Based on this specification, though, max_chunk_size and overlap would both be optional due to default values, and using this would break.

WDYT the best course of action is? Maybe put in a comment why they're separate and keep them separate for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, let's keep them separate. To ensure they keep using the same property names, we can use OverloadOf: https://github.com/elastic/elasticsearch-specification/blob/main/docs/behaviors.md#overloadof. With that, we don't need a comment!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes. However, were the parameters different? You used to have type and it's now strategy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would have been a bug 🤦‍♀️ Here's the PR with the example, strategy is correct: elastic/elasticsearch#121041

@kderusso kderusso requested a review from pquentin April 14, 2025 12:59
Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM.

@pquentin pquentin merged commit 39cc516 into main Apr 15, 2025
8 checks passed
@pquentin pquentin deleted the kderusso/semantic-text-chunking-settings branch April 15, 2025 07:43
github-actions bot pushed a commit that referenced this pull request Apr 15, 2025
Co-authored-by: Quentin Pradet <[email protected]>
(cherry picked from commit 39cc516)
pquentin pushed a commit that referenced this pull request Apr 16, 2025
)

Co-authored-by: Quentin Pradet <[email protected]>
(cherry picked from commit 39cc516)

Co-authored-by: Kathleen DeRusso <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants