Support production-ready data provider for segmenters #1652

aethanyc · 2022-03-02T00:22:27Z

Fixed #751.
Fixed #1372.
Fixed #1373.
Fixed #1376.
Fixed #1638.

This PR makes the segmenter capable of loading rule break data from production-ready providers such as FsDataProvider or StaticDataProvider.

Some highlights of the patch stack:

Add a new provider crate icu_provider_segmenter to deserialize the segmenter rule-break TOML files into SegmenterRuleTable, and transform it into RuleBreakDataV1. The transformation logic is ported from build.rs, and we leverage serde to serializeRuleBreakDataV1.
After Bring LineBreakDataV1 data structure closer to RuleBreakDataV1 #1634, LineBreakDataV1 and RuleBreakDataV1 are the same, so I unified the line segmenter to use RuleBreakDataV1.
Hook the provider into icu4x-datagen to generate break data for providers.
Remove build.rs.

The purpose of the crate is to deserialize the segmenter rule break TOML files into `SegmenterRuleTable`, and transform it into `RuleBreakDataV1`. The main function where the transformation takes place is in SegmenterRuleProvider::generate_break_data(), which is ported from `generate_rule_segmenter_table` along with many other helpers in `build.rs`. Flatten `RuleBreakPropertyTable` into a linear structure so that it can be serialize/dezerialize via ZeroVec. In the next commit, we'll convert line segmenter to use RuleBreakDataV1. This patch removes "provider_serde" cfg for LineBreakDataV1 just to build successfully.

This commit is generated via `cargo make testdata`.

This commit is generated via `cargo make diplomat-gen`.

This commit is generated via `cargo make generate-readmes`.

aethanyc · 2022-03-03T19:56:43Z

I'll resolve the conflict on top of #1644. We haven't support loading dictionary data from FsDataProvider yet, so there's data panic. Convert this PR to draft for now.

makotokato · 2022-03-04T01:46:12Z

I'll resolve the conflict on top of #1644. We haven't support loading dictionary data from FsDataProvider yet, so there's data panic. Convert this PR to draft for now.

Thanks a lot!

aethanyc · 2022-03-04T15:34:21Z

I make the line segmenter to load the dictionary data from InvariantDataProvider in this commit. This PR is ready for review.

provider/segmenter/src/lib.rs

makotokato · 2022-03-04T15:27:16Z

provider/segmenter/src/lib.rs

+
+/// Returns the absolute path to the directory containing rule break data.
+pub fn break_data_root() -> PathBuf {
+    PathBuf::from(std::env!("CARGO_MANIFEST_DIR")).join("data")


When adding LSTM (JSON) and dictionary (currently this is binary data from ICU4C), do we store it into same directory?

If you think that others are stored into provider/testdata/data/json/segmenter via JSON format directory, please ignore my comment.
I imagine that the following.

data/json/segmenter/lstm/th.json or data/json/segmenter_lstm/th.json

data/json/segmenter/char16trie/th.json or data/json/segmenter_char16trie/th.json (convert from binary to JSON)

etc

Hmm, I take a closer look at how Unicode properties testdata are structured.

For ICU4C generated raw toml data, they are placed in provider/testdata/data/uprops, and icu4x-datagen generates testdata under provider/testdata/data/json/props and a binary format under provider/testdata/data/testdata.postcard.

Also, note that the Resource key has a restrict format, and it determines the generated data path and file name.

If LSTM/dictionary data's raw format and runtime format are similar or even identical, the provider transforming the raw data to the runtime data can be trivial, but I guess we still need to write such a provider for icu4x-datagen?

I image we can put the segmenter raw data (rule break TOML, raw LSTM json, and ICU4C generated dictionary data) under provider/testdata/data/segmenter, and the the resource key -> generate data path will be like:

LineBreakDataV1Marker = "segmenter/line@1" -> provider/testdata/data/json/segmenter/[email protected]

ThaiDictionaryBreakDataV1Marker = segmenter/thai_dict@1 -> provider/testdata/data/json/segmenter/[email protected]

ThaiLstmBreakDataV1Marker = segmenter/thai_lstm@1 -> provider/testdata/data/json/segmenter/[email protected]

(Yes, I think we need different key per language for dictionary and lstm)

I know there is a plan to make icu4x-datagen more friendly to the user, so the data path might be subjected to change. I'll wait @sffc's review and follow his opinion on how we should structure the segmenter data so that they integrate better.

What @aethanyc wrote above looks correct to me.

OK, I make sense.

sffc

2.6 MB is more than I was expecting. But the PR looks good.

tools/datagen/src/bin/datagen.rs

experimental/segmenter/src/line.rs

experimental/segmenter/src/provider.rs

To generate only the line break data, run command such as ``` cargo run --bin icu4x-datagen --\ --input-from-testdata\ --all-locales\ --syntax=json\ --keys "segmenter/line@1"\ --out="/tmp/segmenter_data"\ --overwrite ```

…ty values Also, regenerate testdata via `cargo make testdata`.

aethanyc · 2022-03-08T01:42:40Z

Addressed @sffc's comment with three new commits. Re-request the review.

tools/datagen/src/bin/datagen.rs

provider/segmenter/src/transform.rs

This reverts commit 9bfca00.

This also supports the customized keys in `--keys` for segmenter keys.

sffc

Let's merge this and continue the discussion about data size in #1653

aethanyc · 2022-03-09T01:52:00Z

I've addressed all @makotokato's comments, and have @sffc's review on the latest version. I feel it's sufficient to merge this PR.

aethanyc added 9 commits March 1, 2022 16:14

Add owners for provider/segmenter

866edd2

Switch line segmenter to use RuleBreakDataV1

8e443ac

Hook icu_segmenter_provider into icu4x-datagen

d6828b9

Regenerate testdata for segmenter

5b64348

This commit is generated via `cargo make testdata`.

Remove unused build.rs

d3e0143

Fix line segmenter FFI

bfda20b

Regenerate line segmenter FFI

4c413cf

This commit is generated via `cargo make diplomat-gen`.

Regenerate README

8bd3d29

This commit is generated via `cargo make generate-readmes`.

aethanyc requested review from makotokato and sffc March 2, 2022 00:22

aethanyc requested a review from a team as a code owner March 2, 2022 00:22

aethanyc removed the request for review from a team March 2, 2022 00:23

This was referenced Mar 2, 2022

Reduce data size of rule-based segmentation #1653

Closed

Make LineBreakPropertyTable and RuleBreakPropertyTable serializable #1638

Closed

aethanyc added 3 commits March 1, 2022 16:51

Guard serde(borrow) in provider_serde

81c6dc1

Fix segmenter FFI example

46c33b5

Merge remote-tracking branch 'origin/main' into segmenter-provider

4e53a65

aethanyc marked this pull request as draft March 3, 2022 19:57

Integrate with DictionarySegmenter and regenerate testdata

f822e85

aethanyc marked this pull request as ready for review March 3, 2022 20:45

makotokato reviewed Mar 4, 2022

View reviewed changes

Rename break_data_root -> segmenter_data_root

ba519d2

makotokato previously approved these changes Mar 5, 2022

View reviewed changes

sffc reviewed Mar 5, 2022

View reviewed changes

tools/datagen/src/bin/datagen.rs Outdated Show resolved Hide resolved

experimental/segmenter/src/line.rs Outdated Show resolved Hide resolved

sffc reviewed Mar 5, 2022

View reviewed changes

experimental/segmenter/src/provider.rs Show resolved Hide resolved

Use UNKNOWN as the fallback value for codepoints not in property_table

51d4463

aethanyc added 2 commits March 5, 2022 13:50

Support customized keys for segmenter keys in icu4x-datagen

9bfca00

To generate only the line break data, run command such as ``` cargo run --bin icu4x-datagen --\ --input-from-testdata\ --all-locales\ --syntax=json\ --keys "segmenter/line@1"\ --out="/tmp/segmenter_data"\ --overwrite ```

Make property table size larger to hold grapheme cluster break proper…

154b49b

…ty values Also, regenerate testdata via `cargo make testdata`.

aethanyc dismissed makotokato’s stale review via 154b49b March 5, 2022 22:05

aethanyc requested a review from Manishearth as a code owner March 5, 2022 22:05

aethanyc requested review from makotokato and sffc and removed request for Manishearth March 8, 2022 01:39

sffc reviewed Mar 8, 2022

View reviewed changes

tools/datagen/src/bin/datagen.rs Outdated Show resolved Hide resolved

provider/segmenter/src/transform.rs Outdated Show resolved Hide resolved

aethanyc added 2 commits March 7, 2022 22:26

Revert "Support customized keys for segmenter keys in icu4x-datagen"

2910929

This reverts commit 9bfca00.

Load segmenter rule data dependencies from icu_provider_uprops

cb9cdd1

This also supports the customized keys in `--keys` for segmenter keys.

aethanyc requested a review from sffc March 8, 2022 07:23

aethanyc added 2 commits March 8, 2022 16:34

Merge remote-tracking branch 'origin/main' into segmenter-provider

4f7cfc1

Regenerate postcard testdata to resolve merge conflict with main

fdadfdc

aethanyc closed this Mar 9, 2022

aethanyc reopened this Mar 9, 2022

sffc reviewed Mar 9, 2022

View reviewed changes

sffc approved these changes Mar 9, 2022

View reviewed changes

aethanyc removed the request for review from makotokato March 9, 2022 01:49

aethanyc merged commit 97da558 into unicode-org:main Mar 9, 2022

aethanyc deleted the segmenter-provider branch March 9, 2022 01:54

sffc mentioned this pull request Mar 9, 2022

Add payload conversion and heap measurement tool #1670

Merged

2 tasks

aethanyc mentioned this pull request Mar 9, 2022

API and Features for Segmenter #1379

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support production-ready data provider for segmenters #1652

Support production-ready data provider for segmenters #1652

Uh oh!

aethanyc commented Mar 2, 2022 •

edited

Loading

Uh oh!

aethanyc commented Mar 3, 2022

Uh oh!

makotokato commented Mar 4, 2022

Uh oh!

aethanyc commented Mar 4, 2022

Uh oh!

Uh oh!

makotokato Mar 4, 2022

Uh oh!

aethanyc Mar 4, 2022 •

edited

Loading

Uh oh!

sffc Mar 5, 2022

Uh oh!

makotokato Mar 5, 2022

Uh oh!

sffc left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aethanyc commented Mar 8, 2022

Uh oh!

Uh oh!

Uh oh!

sffc left a comment

Uh oh!

aethanyc commented Mar 9, 2022

Uh oh!

Uh oh!

Support production-ready data provider for segmenters #1652

Support production-ready data provider for segmenters #1652

Uh oh!

Conversation

aethanyc commented Mar 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aethanyc commented Mar 3, 2022

Uh oh!

makotokato commented Mar 4, 2022

Uh oh!

aethanyc commented Mar 4, 2022

Uh oh!

Uh oh!

makotokato Mar 4, 2022

Choose a reason for hiding this comment

Uh oh!

aethanyc Mar 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sffc Mar 5, 2022

Choose a reason for hiding this comment

Uh oh!

makotokato Mar 5, 2022

Choose a reason for hiding this comment

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aethanyc commented Mar 8, 2022

Uh oh!

Uh oh!

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

aethanyc commented Mar 9, 2022

Uh oh!

Uh oh!

aethanyc commented Mar 2, 2022 •

edited

Loading

aethanyc Mar 4, 2022 •

edited

Loading