-
Notifications
You must be signed in to change notification settings - Fork 214
Datagen CLI improvements #2950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datagen CLI improvements #2950
Conversation
781b0d2
to
117b4ff
Compare
117b4ff
to
ad075f0
Compare
ad075f0
to
e3122d1
Compare
e3122d1
to
dc2d8f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of good things in here, and other things that need further discussion.
Note that I see #2945 as the path forward to better support clients who want easy, default settings, so I put less value on the ergonomics of icu4x-datagen. I put more value on the educational value of icu4x-datagen.
) | ||
.arg( | ||
Arg::with_name("CLDR_TAG") | ||
.short("t") | ||
.long("cldr-tag") | ||
.value_name("TAG") | ||
.default_value("latest") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion on setting defaults for --cldr-tag
and --icuexport-tag
Pros and cons for making this change:
- Pro: This is the option most people should choose if they don't know any better.
- Con: It makes the default behavior of
icu4x-datagen
be non-deterministic and network-dependent, which seems undesirable. - Con: It's good for educational purposes for people to make an explicit decision about where to get their data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Con: It makes the default behavior of icu4x-datagen be non-deterministic and network-dependent, which seems undesirable.
I see icu4x-datagen as a simple tool to get started with ICU4X, so it should be allowed to use the network by default. Note that the API does not download by default, which is more appropriate for build scripts.
Con: It's good for educational purposes for people to make an explicit decision about where to get their data.
I don't think we should encourage non-standard or outdated data sources. If clients want to use the binary in some kind of build process, they will look at icu4-datagen --help
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If others agree that network access in the CLI version of icu4x-datagen
is OK as a default, I'm fine with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging this because it blocks another PR, we can address this in a follow-up.
I'm almost sold on setting the default CLDR/ICU to |
39b541a
to
958ce4c
Compare
docs/tutorials/data_management.md
Outdated
@@ -23,18 +23,13 @@ Get a coffee, this might take a while ☕. | |||
Once installed, run: | |||
|
|||
```console | |||
$ icu4x-datagen --cldr-tag latest --icuexport-tag latest --out my_data_blob.postcard --format blob --all-keys --all-locales | |||
$ icu4x-datagen --keys all --locales all --format blob --out my_data_blob.postcard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible Issue: all
is a valid locale, Allar (none
is fine because it is 4 letters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bikeshed:
--locales all
and bet that there won't ever be Allar data in CLDR (this might not be a very safe bet, because CLDR is working on adding seed data in thousands of locales for things like "default region for language", etc)--locales-all
--locales full
--locales every
--locales all!
- or stick with
--all-locales
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I thought about this, but it's very unlikely that a client would want to generate just Allar, don't you think? If they're generating many single-locale blobs, they should use the API. I think it's fine to say generating only Allar is not supported by the CLI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe any
is actually a better name, because it is actually a filter. Also exists though.
The reason why I want to get rid of the separate --all-x
arguments is because they clutter the --help
output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find that to be a huge concern, personally. Not really a fan of ambiguity in the format
Why not --locales *
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@macchiati pointed out that these are not actually "all" locales; they are only whatever locales happen to be covered in the current CLDR release. So, --locales modern
or --locales basic
could be more precise.
@srl295 suggested that keywords could be expected to be wrapped with some other characters, like --locales <none>
or --locales "*"
.
@pedberg-icu agrees that we "should not use anything that could be confused with a potential language code"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's --locales basic
? PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CLDR coverage levels are basic
, moderate
, and modern
. I think full
and basic
correspond to the same set of locales?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CLDR JSON sets are probably modern = {modern} and full = {modern, moderate, basic} then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CLDR JSON sets are probably modern = {modern} and full = {modern, moderate, basic} then?
I believe so; @srl295 can you confirm?
docs/tutorials/data_management.md
Outdated
@@ -23,18 +23,13 @@ Get a coffee, this might take a while ☕. | |||
Once installed, run: | |||
|
|||
```console | |||
$ icu4x-datagen --cldr-tag latest --icuexport-tag latest --out my_data_blob.postcard --format blob --all-keys --all-locales | |||
$ icu4x-datagen --keys all --locales all --format blob --out my_data_blob.postcard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bikeshed:
--locales all
and bet that there won't ever be Allar data in CLDR (this might not be a very safe bet, because CLDR is working on adding seed data in thousands of locales for things like "default region for language", etc)--locales-all
--locales full
--locales every
--locales all!
- or stick with
--all-locales
9162b25
to
0454e0c
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
I made more parameters have defaults so that it's easier to play around with:
--keys all
for all keys,--keys none
for none (already available through the API)--locales full
for all locales,--locales none
for none (now works correctly on the API level, before it would still generateund
s)--format
now prints a warning when omitted--cldr-tag
and--icuexport-tag
now default tolatest
. Downloading is lazy now though, so we'll only talk to the network if the data is needed (i.e. not if we do--locales none
or--keys none
)--out
now defaults toicu4x_data
for--format=dir
and--format=mod
--hello-world
. It's mainly used internally, and now that all keys work out of the box, clients shouldn't have a need for thisThis means the invocation
icu4x-datagen --format=mod --keys all --locales full
is now enough to generate data for all locales, all keys, with the latest CLDR and ICU data.
Similarly
icu4x-datagen --format=mod --keys all --locales none
now also runs without downloading any data, and can be used to generate initial baked data to make a binary (or, stay tuned, an FFI layer, #2743) compile.