Skip to content

Locale deserialize 'zh-hant_CN' #1948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
takeseem opened this issue Feb 28, 2018 · 6 comments
Closed

Locale deserialize 'zh-hant_CN' #1948

takeseem opened this issue Feb 28, 2018 · 6 comments

Comments

@takeseem
Copy link

takeseem commented Feb 28, 2018

zh-hant_CN will convert to zh_HANT_cn
Locale include (language, country, variant), Locale.toString() use '_' to split then {language}_{country}_({variant}_# | #){script}-{extensions}

but Jackson deserialize Locale use '-' or '_' : (
code: FromStringDeserializer.Std._deserialize

iana language-tags

zh-Hant	Chinese, in traditional script	[Mark_Davis][Mark_Davis_2]	
zh-Hant-CN	PRC Mainland Chinese in traditional script	[Mark_Davis][Mark_Davis_2]	
zh-Hant-HK	Hong Kong Chinese in traditional script	[Mark_Davis][Mark_Davis_2]	
zh-Hant-MO	Macao Chinese in traditional script	[Mark_Davis][Mark_Davis_2]	
zh-Hant-SG	Singapore Chinese in traditional script	[Mark_Davis][Mark_Davis_2]	
zh-Hant-TW	Taiwan Chinese in traditional script	[Mark_Davis][Mark_Davis_2]

rfc4647 language lookup

  Example of a Lookup Fallback Pattern

   Range to match: zh-Hant-CN-x-private1-private2
   1. zh-Hant-CN-x-private1-private2
   2. zh-Hant-CN-x-private1
   3. zh-Hant-CN
   4. zh-Hant
   5. zh
   6. (default)
	public static void main(String[] args) {
		System.out.println(new Locale("zh-hant", "CN"));
	}

in jdk1.5 -> 1.8

jdk1.5	zh-hant_CN
jdk1.6	zh-hant_CN
jdk1.7	zh-hant_CN
jdk1.8	zh-hant_CN

others:
#1344
#1600

takeseem added a commit to takeseem/jackson-databind that referenced this issue Feb 28, 2018
@cowtowncoder
Copy link
Member

I am not sure I really understand the intent here: for one, changing observed behavior in a patch is almost guaranteed break some usage somewhere. I can not do that without good justification.

Locale instances have 3 pieces of information:

  • Language
  • Country
  • Variant

and the important part is that these parts get correctly split. Check should NOT be based on textual representation by Locale.toString() only (or even primarily).

But most of all... I am not sure I understand the problem being solved here. What exactly is wrong, and how is it being improved?

@takeseem
Copy link
Author

takeseem commented Mar 1, 2018

            case STD_LOCALE:
                {
                    int ix = value.indexOf('_');
                    if (ix < 0) { // single argument
                        return new Locale(value);
                    }
                    String first = value.substring(0, ix);
                    value = value.substring(ix+1);
                    ix = value.indexOf('_');
                    if (ix < 0) { // two pieces
                        return new Locale(first, value);
                    }
                    String second = value.substring(0, ix);
                    return new Locale(first, second, value.substring(ix+1));
                }
            case STD_LOCALE:
                {
                    int ix = _firstHyphenOrUnderscore(value);
                    if (ix < 0) { // single argument
                        return new Locale(value);
                    }
                    String first = value.substring(0, ix);
                    value = value.substring(ix+1);
                    ix = _firstHyphenOrUnderscore(value);
                    if (ix < 0) { // two pieces
                        return new Locale(first, value);
                    }
                    String second = value.substring(0, ix);
                    return new Locale(first, second, value.substring(ix+1));
                }

        protected int _firstHyphenOrUnderscore(String str)
        {
            for (int i = 0, end = str.length(); i < end; ++i) {
                char c = str.charAt(i);
                if (c == '_' || c == '-') {
                    return i;
                }
            }
            return -1;
        }

Jackson serialize Locale default use Locale.toString(), It will be broken.

a most of system use the _ to split language, country, variant, like

  • en_US
  • zh_CN
  • zh_TW
  • zh-hant_CN

takeseem added a commit to takeseem/jackson-databind that referenced this issue Mar 1, 2018
config: System.setProperty("jackson.std.locale.de.java", "not empty")
@takeseem
Copy link
Author

takeseem commented Mar 1, 2018

commit #1951

so If use Java system (include andorid), we need a complete Java style deserialize Locale's String.

Default is use - or \_ split,
If set -Djackson.std.locale.de.java="not empty" then force use java Locale style.

@cowtowncoder
Copy link
Member

I am not sure I understand, still. Older versions only accepted underscore (_) as separator.
Newer code allows hyphen as well. If I understand you correctly, you are saying that only underscore should be accepted? This seems odd since as per #1344 both should be accepted?

Removing acceptance of hyphen would probably break existing usage so that can not be simply disabled. If logic for when one or both may be used is wrong we can improve it. We just need to agree on how logic should work; perhaps it should check separates from left to right?

And for tests we need something better than simply writing out strings to compare: components decode should be tested for correctness. Not just read/write and verify round-trip (String value written is same as String value used as input), but that component, language deserialized matches Locale object.

@cowtowncoder
Copy link
Member

Another possibility: JDK 1.7 has Locale.forLanguageTag(String). I could possibly change code for next minor or major version to use that (although probably not safe for patch), if that is an improvement.

@cowtowncoder
Copy link
Member

At this point I am not sure what expected handling would be; and I do not think proposed solution is correct. It is NOT correct to say that underscore (_) is the canonical separator, for example: RFCs that specify this claim hyphen should be used. Yet JDK prefers underscore for legacy reasons.
Some more discussion can be found from answers on:

https://softwareengineering.stackexchange.com/questions/325458/parse-both-en-us-and-en-us-as-locale-in-java/367706#367706

Finally, I don't think Jackson 2.9 should change handling unless it can be guaranteed change does not break existing usage.
Because of this, I think I will change 3.0 to use toLanguageTag() and forLanguageTag(), but leave 2.9 as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants