Skip to content

Datagen CLI improvements #2950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jan 10, 2023
Merged

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Jan 4, 2023

I made more parameters have defaults so that it's easier to play around with:

  • --keys all for all keys, --keys none for none (already available through the API)
  • --locales full for all locales, --locales none for none (now works correctly on the API level, before it would still generate unds)
  • --format now prints a warning when omitted
  • --cldr-tag and --icuexport-tag now default to latest. Downloading is lazy now though, so we'll only talk to the network if the data is needed (i.e. not if we do --locales none or --keys none)
  • databake now fails if the directory exists. We cannot add an overwrite flag because that'd be breaking
  • --out now defaults to icu4x_data for --format=dir and --format=mod
  • Removed --hello-world. It's mainly used internally, and now that all keys work out of the box, clients shouldn't have a need for this

This means the invocation

icu4x-datagen --format=mod --keys all --locales full

is now enough to generate data for all locales, all keys, with the latest CLDR and ICU data.

Similarly

icu4x-datagen --format=mod --keys all --locales none

now also runs without downloading any data, and can be used to generate initial baked data to make a binary (or, stay tuned, an FFI layer, #2743) compile.

@robertbastian robertbastian marked this pull request as ready for review January 4, 2023 23:31
@robertbastian robertbastian requested review from Manishearth, sffc and a team as code owners January 4, 2023 23:31
@robertbastian robertbastian mentioned this pull request Jan 5, 2023
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of good things in here, and other things that need further discussion.

Note that I see #2945 as the path forward to better support clients who want easy, default settings, so I put less value on the ergonomics of icu4x-datagen. I put more value on the educational value of icu4x-datagen.

)
.arg(
Arg::with_name("CLDR_TAG")
.short("t")
.long("cldr-tag")
.value_name("TAG")
.default_value("latest")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion on setting defaults for --cldr-tag and --icuexport-tag

Pros and cons for making this change:

  1. Pro: This is the option most people should choose if they don't know any better.
  2. Con: It makes the default behavior of icu4x-datagen be non-deterministic and network-dependent, which seems undesirable.
  3. Con: It's good for educational purposes for people to make an explicit decision about where to get their data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Con: It makes the default behavior of icu4x-datagen be non-deterministic and network-dependent, which seems undesirable.

I see icu4x-datagen as a simple tool to get started with ICU4X, so it should be allowed to use the network by default. Note that the API does not download by default, which is more appropriate for build scripts.

Con: It's good for educational purposes for people to make an explicit decision about where to get their data.

I don't think we should encourage non-standard or outdated data sources. If clients want to use the binary in some kind of build process, they will look at icu4-datagen --help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If others agree that network access in the CLI version of icu4x-datagen is OK as a default, I'm fine with this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging this because it blocks another PR, we can address this in a follow-up.

@sffc
Copy link
Member

sffc commented Jan 5, 2023

I'm almost sold on setting the default CLDR/ICU to latest, but for now I still prefer users to write an explicit --all-keys --all-locales. It makes a nice and clear symmetry to show what's going on. I think it's fair to say that these are the 2 most important flags, and they can be required. --format used to be defaulted, and now it's also required, which seems fine.

@robertbastian robertbastian requested a review from sffc January 5, 2023 23:48
@@ -23,18 +23,13 @@ Get a coffee, this might take a while ☕.
Once installed, run:

```console
$ icu4x-datagen --cldr-tag latest --icuexport-tag latest --out my_data_blob.postcard --format blob --all-keys --all-locales
$ icu4x-datagen --keys all --locales all --format blob --out my_data_blob.postcard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible Issue: all is a valid locale, Allar (none is fine because it is 4 letters)

Copy link
Member

@sffc sffc Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikeshed:

  • --locales all and bet that there won't ever be Allar data in CLDR (this might not be a very safe bet, because CLDR is working on adding seed data in thousands of locales for things like "default region for language", etc)
  • --locales-all
  • --locales full
  • --locales every
  • --locales all!
  • or stick with --all-locales

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I thought about this, but it's very unlikely that a client would want to generate just Allar, don't you think? If they're generating many single-locale blobs, they should use the API. I think it's fine to say generating only Allar is not supported by the CLI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe any is actually a better name, because it is actually a filter. Also exists though.

The reason why I want to get rid of the separate --all-x arguments is because they clutter the --help output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find that to be a huge concern, personally. Not really a fan of ambiguity in the format

Why not --locales *?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@macchiati pointed out that these are not actually "all" locales; they are only whatever locales happen to be covered in the current CLDR release. So, --locales modern or --locales basic could be more precise.

@srl295 suggested that keywords could be expected to be wrapped with some other characters, like --locales <none> or --locales "*".

@pedberg-icu agrees that we "should not use anything that could be confused with a potential language code"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's --locales basic? PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLDR coverage levels are basic, moderate, and modern. I think full and basic correspond to the same set of locales?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLDR JSON sets are probably modern = {modern} and full = {modern, moderate, basic} then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLDR JSON sets are probably modern = {modern} and full = {modern, moderate, basic} then?

I believe so; @srl295 can you confirm?

@@ -23,18 +23,13 @@ Get a coffee, this might take a while ☕.
Once installed, run:

```console
$ icu4x-datagen --cldr-tag latest --icuexport-tag latest --out my_data_blob.postcard --format blob --all-keys --all-locales
$ icu4x-datagen --keys all --locales all --format blob --out my_data_blob.postcard
Copy link
Member

@sffc sffc Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikeshed:

  • --locales all and bet that there won't ever be Allar data in CLDR (this might not be a very safe bet, because CLDR is working on adding seed data in thousands of locales for things like "default region for language", etc)
  • --locales-all
  • --locales full
  • --locales every
  • --locales all!
  • or stick with --all-locales

@robertbastian robertbastian requested a review from sffc January 6, 2023 14:43
@robertbastian robertbastian requested a review from sffc January 6, 2023 20:27
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@robertbastian robertbastian merged commit 15d8e3a into unicode-org:main Jan 10, 2023
@robertbastian robertbastian deleted the lazysource branch January 10, 2023 10:31
@robertbastian robertbastian mentioned this pull request Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants