Skip to content

updating standard analyzer docs #9747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

AntonEliatra
Copy link
Contributor

Description

updating standard analyzer docs

Version

all

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

@kolchfa-aws
Copy link
Collaborator

@udabhas Could you please review this PR? Thanks!

`max_token_length` | Optional | Integer | Sets the maximum length of the produced token. If this length is exceeded, the token is split into multiple tokens at the length configured in `max_token_length`. Default is `255`.
`stopwords` | Optional | String or list of strings | A string specifying a predefined list of stopwords (such as `_english_`) or an array specifying a custom list of stopwords. Default is `_none_`.
`stopwords_path` | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stop words.
The `standard` analyzer supports the following parameters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all the parameters are optional, can we reword it as supports the following optional parameters?

| Parameter | Type | Default | Description |
|:----------|:-----|:--------|:------------|
| `max_token_length` | Integer | `255` | Sets the maximum length of a token before it is split. |
| `stopwords` | List or String | None | A list of stopwords or a predefined stopword set like `_english_` to remove during analysis. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String or list of strings was better description

|:----------|:-----|:--------|:------------|
| `max_token_length` | Integer | `255` | Sets the maximum length of a token before it is split. |
| `stopwords` | List or String | None | A list of stopwords or a predefined stopword set like `_english_` to remove during analysis. |
| `stopwords_path` | String | None | Path to a file containing stopwords to be used during analysis. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you verify once if both stopwords and stopwords_path be used in a single request? If not, let's put a small note or clarification?


Use the following command to configure an index with a custom analyzer that is equivalent to the `standard` analyzer:
The following example crated index `products` and configures `max_token_length` and `stopwords`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: creates/crated?

}
```
{% include copy-curl.html %}

The response contains the generated tokens:
The returned token are separated based on spacing, lowercased and stopwords are removed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrasing:
The returned token are 1/ separated based on spacing, 2/ lowercased and 3/ stopwords removed:

want to emphasize on e actions which happened.


- `standard` tokenizer: Removes most punctuation and splits text on spaces and other common delimiters.
- `lowercase` token filter: Converts all tokens to lowercase, ensuring case-insensitive matching.
- `stop` token filter: Removes common stopwords, such as "the", "is", and "and", from the tokenized output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we miss adding detail about stop words or is it part of tokenization or lowercasing itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sandeshkr419 but the stop words are not removed in standard analyzer, is that what you mean?

@sandeshkr419 sandeshkr419 self-assigned this Jun 25, 2025

| Parameter | Type | Default | Description |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the documentation pages use Data type instead of Type. Eg: https://github.com/opensearch-project/documentation-website/pull/9479/files

Let's stick to a single nomenclature across documentations - either type or data type - either is fine, as long as it is consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sandeshkr419 thats updated now across the repo, All "Data Type" changed to "Data type"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants