-
Notifications
You must be signed in to change notification settings - Fork 214
Support production-ready data provider for segmenters #1652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The purpose of the crate is to deserialize the segmenter rule break TOML files into `SegmenterRuleTable`, and transform it into `RuleBreakDataV1`. The main function where the transformation takes place is in SegmenterRuleProvider::generate_break_data(), which is ported from `generate_rule_segmenter_table` along with many other helpers in `build.rs`. Flatten `RuleBreakPropertyTable` into a linear structure so that it can be serialize/dezerialize via ZeroVec. In the next commit, we'll convert line segmenter to use RuleBreakDataV1. This patch removes "provider_serde" cfg for LineBreakDataV1 just to build successfully.
This commit is generated via `cargo make testdata`.
This commit is generated via `cargo make diplomat-gen`.
This commit is generated via `cargo make generate-readmes`.
I'll resolve the conflict on top of #1644. We haven't support loading dictionary data from |
Thanks a lot! |
I make the line segmenter to load the dictionary data from |
|
||
/// Returns the absolute path to the directory containing rule break data. | ||
pub fn break_data_root() -> PathBuf { | ||
PathBuf::from(std::env!("CARGO_MANIFEST_DIR")).join("data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding LSTM (JSON) and dictionary (currently this is binary data from ICU4C), do we store it into same directory?
If you think that others are stored into provider/testdata/data/json/segmenter
via JSON format directory, please ignore my comment.
I imagine that the following.
data/json/segmenter/lstm/th.json
ordata/json/segmenter_lstm/th.json
data/json/segmenter/char16trie/th.json
ordata/json/segmenter_char16trie/th.json
(convert from binary to JSON)- etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I take a closer look at how Unicode properties testdata are structured.
For ICU4C generated raw toml data, they are placed in provider/testdata/data/uprops
, and icu4x-datagen generates testdata under provider/testdata/data/json/props
and a binary format under provider/testdata/data/testdata.postcard
.
Also, note that the Resource key has a restrict format, and it determines the generated data path and file name.
If LSTM/dictionary data's raw format and runtime format are similar or even identical, the provider transforming the raw data to the runtime data can be trivial, but I guess we still need to write such a provider for icu4x-datagen
?
I image we can put the segmenter raw data (rule break TOML, raw LSTM json, and ICU4C generated dictionary data) under provider/testdata/data/segmenter
, and the the resource key -> generate data path will be like:
LineBreakDataV1Marker = "segmenter/line@1"
->provider/testdata/data/json/segmenter/[email protected]
ThaiDictionaryBreakDataV1Marker = segmenter/thai_dict@1
->provider/testdata/data/json/segmenter/[email protected]
ThaiLstmBreakDataV1Marker = segmenter/thai_lstm@1
->provider/testdata/data/json/segmenter/[email protected]
(Yes, I think we need different key per language for dictionary and lstm)
I know there is a plan to make icu4x-datagen more friendly to the user, so the data path might be subjected to change. I'll wait @sffc's review and follow his opinion on how we should structure the segmenter data so that they integrate better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What @aethanyc wrote above looks correct to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.6 MB is more than I was expecting. But the PR looks good.
To generate only the line break data, run command such as ``` cargo run --bin icu4x-datagen --\ --input-from-testdata\ --all-locales\ --syntax=json\ --keys "segmenter/line@1"\ --out="/tmp/segmenter_data"\ --overwrite ```
…ty values Also, regenerate testdata via `cargo make testdata`.
Addressed @sffc's comment with three new commits. Re-request the review. |
This reverts commit 9bfca00.
This also supports the customized keys in `--keys` for segmenter keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this and continue the discussion about data size in #1653
I've addressed all @makotokato's comments, and have @sffc's review on the latest version. I feel it's sufficient to merge this PR. |
Fixed #751.
Fixed #1372.
Fixed #1373.
Fixed #1376.
Fixed #1638.
This PR makes the segmenter capable of loading rule break data from production-ready providers such as
FsDataProvider
orStaticDataProvider
.Some highlights of the patch stack:
icu_provider_segmenter
to deserialize the segmenter rule-break TOML files intoSegmenterRuleTable
, and transform it intoRuleBreakDataV1
. The transformation logic is ported frombuild.rs
, and we leverage serde to serializeRuleBreakDataV1
.LineBreakDataV1
andRuleBreakDataV1
are the same, so I unified the line segmenter to useRuleBreakDataV1
.build.rs
.