Skip to content

Support production-ready data provider for segmenters #1652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Mar 9, 2022

Conversation

aethanyc
Copy link
Contributor

@aethanyc aethanyc commented Mar 2, 2022

Fixed #751.
Fixed #1372.
Fixed #1373.
Fixed #1376.
Fixed #1638.

This PR makes the segmenter capable of loading rule break data from production-ready providers such as FsDataProvider or StaticDataProvider.

Some highlights of the patch stack:

  1. Add a new provider crate icu_provider_segmenter to deserialize the segmenter rule-break TOML files into SegmenterRuleTable, and transform it into RuleBreakDataV1. The transformation logic is ported from build.rs, and we leverage serde to serializeRuleBreakDataV1.
  2. After Bring LineBreakDataV1 data structure closer to RuleBreakDataV1 #1634, LineBreakDataV1 and RuleBreakDataV1 are the same, so I unified the line segmenter to use RuleBreakDataV1.
  3. Hook the provider into icu4x-datagen to generate break data for providers.
  4. Remove build.rs.

aethanyc added 9 commits March 1, 2022 16:14
The purpose of the crate is to deserialize the segmenter rule break TOML files
into `SegmenterRuleTable`, and transform it into `RuleBreakDataV1`. The main
function where the transformation takes place is in
SegmenterRuleProvider::generate_break_data(), which is ported from
`generate_rule_segmenter_table` along with many other helpers in
`build.rs`.

Flatten `RuleBreakPropertyTable` into a linear structure so that it can be
serialize/dezerialize via ZeroVec.

In the next commit, we'll convert line segmenter to use RuleBreakDataV1. This
patch removes "provider_serde" cfg for LineBreakDataV1 just to build
successfully.
This commit is generated via `cargo make testdata`.
This commit is generated via `cargo make diplomat-gen`.
This commit is generated via `cargo make generate-readmes`.
@aethanyc aethanyc requested review from makotokato and sffc March 2, 2022 00:22
@aethanyc aethanyc requested a review from a team as a code owner March 2, 2022 00:22
@aethanyc aethanyc removed the request for review from a team March 2, 2022 00:23
@aethanyc
Copy link
Contributor Author

aethanyc commented Mar 3, 2022

I'll resolve the conflict on top of #1644. We haven't support loading dictionary data from FsDataProvider yet, so there's data panic. Convert this PR to draft for now.

@aethanyc aethanyc marked this pull request as draft March 3, 2022 19:57
@aethanyc aethanyc marked this pull request as ready for review March 3, 2022 20:45
@makotokato
Copy link
Member

I'll resolve the conflict on top of #1644. We haven't support loading dictionary data from FsDataProvider yet, so there's data panic. Convert this PR to draft for now.

Thanks a lot!

@aethanyc
Copy link
Contributor Author

aethanyc commented Mar 4, 2022

I make the line segmenter to load the dictionary data from InvariantDataProvider in this commit. This PR is ready for review.


/// Returns the absolute path to the directory containing rule break data.
pub fn break_data_root() -> PathBuf {
PathBuf::from(std::env!("CARGO_MANIFEST_DIR")).join("data")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding LSTM (JSON) and dictionary (currently this is binary data from ICU4C), do we store it into same directory?

If you think that others are stored into provider/testdata/data/json/segmenter via JSON format directory, please ignore my comment.
I imagine that the following.

  • data/json/segmenter/lstm/th.json or data/json/segmenter_lstm/th.json
  • data/json/segmenter/char16trie/th.json or data/json/segmenter_char16trie/th.json (convert from binary to JSON)
  • etc

Copy link
Contributor Author

@aethanyc aethanyc Mar 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I take a closer look at how Unicode properties testdata are structured.

For ICU4C generated raw toml data, they are placed in provider/testdata/data/uprops, and icu4x-datagen generates testdata under provider/testdata/data/json/props and a binary format under provider/testdata/data/testdata.postcard.

Also, note that the Resource key has a restrict format, and it determines the generated data path and file name.

If LSTM/dictionary data's raw format and runtime format are similar or even identical, the provider transforming the raw data to the runtime data can be trivial, but I guess we still need to write such a provider for icu4x-datagen?

I image we can put the segmenter raw data (rule break TOML, raw LSTM json, and ICU4C generated dictionary data) under provider/testdata/data/segmenter, and the the resource key -> generate data path will be like:

  • LineBreakDataV1Marker = "segmenter/line@1" -> provider/testdata/data/json/segmenter/[email protected]
  • ThaiDictionaryBreakDataV1Marker = segmenter/thai_dict@1 -> provider/testdata/data/json/segmenter/[email protected]
  • ThaiLstmBreakDataV1Marker = segmenter/thai_lstm@1 -> provider/testdata/data/json/segmenter/[email protected]

(Yes, I think we need different key per language for dictionary and lstm)

I know there is a plan to make icu4x-datagen more friendly to the user, so the data path might be subjected to change. I'll wait @sffc's review and follow his opinion on how we should structure the segmenter data so that they integrate better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @aethanyc wrote above looks correct to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I make sense.

makotokato
makotokato previously approved these changes Mar 5, 2022
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.6 MB is more than I was expecting. But the PR looks good.

aethanyc added 2 commits March 5, 2022 13:50
To generate only the line break data, run command such as

```
cargo run --bin icu4x-datagen --\
          --input-from-testdata\
          --all-locales\
          --syntax=json\
          --keys "segmenter/line@1"\
          --out="/tmp/segmenter_data"\
          --overwrite
```
…ty values

Also, regenerate testdata via `cargo make testdata`.
@aethanyc aethanyc requested a review from Manishearth as a code owner March 5, 2022 22:05
@aethanyc aethanyc requested review from makotokato and sffc and removed request for Manishearth March 8, 2022 01:39
@aethanyc
Copy link
Contributor Author

aethanyc commented Mar 8, 2022

Addressed @sffc's comment with three new commits. Re-request the review.

aethanyc added 2 commits March 7, 2022 22:26
This also supports the customized keys in `--keys` for segmenter keys.
@aethanyc aethanyc requested a review from sffc March 8, 2022 07:23
@aethanyc aethanyc closed this Mar 9, 2022
@aethanyc aethanyc reopened this Mar 9, 2022
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this and continue the discussion about data size in #1653

@aethanyc aethanyc removed the request for review from makotokato March 9, 2022 01:49
@aethanyc
Copy link
Contributor Author

aethanyc commented Mar 9, 2022

I've addressed all @makotokato's comments, and have @sffc's review on the latest version. I feel it's sufficient to merge this PR.

@aethanyc aethanyc merged commit 97da558 into unicode-org:main Mar 9, 2022
@aethanyc aethanyc deleted the segmenter-provider branch March 9, 2022 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants