updating standard analyzer docs #9747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

AntonEliatra wants to merge 4 commits into opensearch-project:main from AntonEliatra:updating-standard-analyzer-docs

+69 −44

Contributor

AntonEliatra commented Apr 28, 2025

Description

updating standard analyzer docs

Version

all

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.


          updating standard analyzer docs

79982b0

Signed-off-by: Anton Rubin <[email protected]>

AntonEliatra requested review from kolchfa-aws, Naarcha-AWS, AMoo-Miki, natebower, dlvenable and epugh as code owners

April 28, 2025 11:16

github-actions bot commented Apr 28, 2025

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

github-actions bot assigned kolchfa-aws

kolchfa-aws reviewed

View reviewed changes

_analyzers/supported-analyzers/standard.md Outdated Show resolved Hide resolved


          Update _analyzers/supported-analyzers/standard.md

72c65e1

Signed-off-by: kolchfa-aws <[email protected]>

Collaborator

kolchfa-aws commented Apr 28, 2025

@udabhas Could you please review this PR? Thanks!

kolchfa-aws added 3 - Tech review Content gap backport 2.19 labels

kolchfa-aws added backport 3.0 and removed backport 2.19 labels

sandeshkr419 suggested changes

View reviewed changes

_analyzers/supported-analyzers/standard.md Outdated

-              `max_token_length` | Optional | Integer | Sets the maximum length of the produced token. If this length is exceeded, the token is split into multiple tokens at the length configured in `max_token_length`. Default is `255`.
-              `stopwords` | Optional | String or list of strings | A string specifying a predefined list of stopwords (such as `_english_`) or an array specifying a custom list of stopwords. Default is `_none_`.
-              `stopwords_path` | Optional | String | The path (absolute or relative to the config directory) to the file containing a list of stop words.
+              The `standard` analyzer supports the following parameters:

Member

sandeshkr419 Jun 25, 2025

Since all the parameters are optional, can we reword it as supports the following optional parameters?

_analyzers/supported-analyzers/standard.md Outdated

+              | Parameter | Type | Default | Description |
+              |:----------|:-----|:--------|:------------|
+              | `max_token_length` | Integer | `255` | Sets the maximum length of a token before it is split. |
+              | `stopwords` | List or String | None | A list of stopwords or a predefined stopword set like `_english_` to remove during analysis. |

Member

sandeshkr419 Jun 25, 2025

String or list of strings was better description

_analyzers/supported-analyzers/standard.md

+              |:----------|:-----|:--------|:------------|
+              | `max_token_length` | Integer | `255` | Sets the maximum length of a token before it is split. |
+              | `stopwords` | List or String | None | A list of stopwords or a predefined stopword set like `_english_` to remove during analysis. |
+              | `stopwords_path` | String | None | Path to a file containing stopwords to be used during analysis. |

Member

sandeshkr419 Jun 25, 2025

Can you verify once if both stopwords and stopwords_path be used in a single request? If not, let's put a small note or clarification?

_analyzers/supported-analyzers/standard.md Outdated

    
              Use the following command to configure an index with a custom analyzer that is equivalent to the `standard` analyzer:

              The following example crated index `products` and configures `max_token_length` and `stopwords`:

Member

sandeshkr419 Jun 25, 2025

typo: creates/crated?

_analyzers/supported-analyzers/standard.md Outdated

               }
               ```
               {% include copy-curl.html %}
-              The response contains the generated tokens:
+              The returned token are separated based on spacing, lowercased and stopwords are removed:

Member

sandeshkr419 Jun 25, 2025

rephrasing:
The returned token are 1/ separated based on spacing, 2/ lowercased and 3/ stopwords removed:

want to emphasize on e actions which happened.

_analyzers/supported-analyzers/standard.md

-              - `standard` tokenizer: Removes most punctuation and splits text on spaces and other common delimiters.
-              - `lowercase` token filter: Converts all tokens to lowercase, ensuring case-insensitive matching.
-              - `stop` token filter: Removes common stopwords, such as "the", "is", and "and", from the tokenized output.

Member

sandeshkr419 Jun 25, 2025

Did we miss adding detail about stop words or is it part of tokenization or lowercasing itself.

Contributor Author

AntonEliatra Jun 26, 2025

@sandeshkr419 but the stop words are not removed in standard analyzer, is that what you mean?

sandeshkr419 self-assigned this

sandeshkr419 reviewed

View reviewed changes

_analyzers/supported-analyzers/standard.md Outdated


		\| Parameter \| Type \| Default \| Description \|

Member

sandeshkr419 Jun 25, 2025

Most of the documentation pages use Data type instead of Type. Eg: https://github.com/opensearch-project/documentation-website/pull/9479/files

Let's stick to a single nomenclature across documentations - either type or data type - either is fine, as long as it is consistent.

Contributor Author

AntonEliatra Jun 26, 2025

@sandeshkr419 thats updated now across the repo, All "Data Type" changed to "Data type"

AntonEliatra added 2 commits

June 26, 2025 11:58


          addressing the PR comments

cfe0817

Signed-off-by: Anton Rubin <[email protected]>


          replacing add Data Type with Data type

a62523c

Signed-off-by: Anton Rubin <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

sandeshkr419 sandeshkr419 requested changes

kolchfa-aws kolchfa-aws left review comments

Naarcha-AWS Awaiting requested review from Naarcha-AWS Naarcha-AWS is a code owner

AMoo-Miki Awaiting requested review from AMoo-Miki AMoo-Miki is a code owner

natebower Awaiting requested review from natebower natebower is a code owner

dlvenable Awaiting requested review from dlvenable dlvenable is a code owner

epugh Awaiting requested review from epugh epugh is a code owner

At least 1 approving review is required to merge this pull request.

Labels

3 - Tech review backport 3.0 Content gap