Character encoding inconsistency / reporting #5681

solardiz · 2025-03-04T08:00:06Z

Testing against the test vectors from openwall/john-samples#31 I am only able to directly crack the simple password 12345678. For cracking the complex password, I have to first process the wordlist through iconv -f utf8 -t iso-8859-1. I guess it got inadvertently converted the other way somewhere on the way to git commit? Should we replace it with the result of this iconv with a subsequent commit?

Oh, alternatively I am able to get it cracked by adding -target-enc=iso-8859-1.

@magnumripper @davidedg please suggest how to fix this encoding issue best, to minimize user confusion and users' wasted time on running with wrong encoding settings. Right now, by default we print Using default input encoding: UTF-8, but with the input wordlist actually in UTF-8 we fail to crack this password. So it feels like a bug.

The text was updated successfully, but these errors were encountered:

magnumripper · 2025-03-04T10:14:17Z

I generally recommend always using UTF-8 for wordlists, and -target-enc where needed. For samples however, maybe it's better to have it as the expected encoding already in the password hint file. If we do that, I suggest we use both encodings in the password hint file: Keep the UTF-8 and add one in ISO. Then also explain this with #!comment: lines!

If we do not change oubliette-passwords.txt (and perhaps even if we do), we should add some kind of README that explains the situation and the -target-enc option.

magnumripper · 2025-03-04T10:29:43Z

Right now, by default we print Using default input encoding: UTF-8, but with the input wordlist actually in UTF-8 we fail to crack this password. So it feels like a bug.

We could amend the output when --target-encoding is not used, such as:

Using default input encoding: UTF-8 and expecting target encoding to the same

or

Using default input encoding: UTF-8
Expected target encoding: UTF-8

magnumripper · 2025-03-04T10:38:30Z

Right now, by default we print Using default input encoding: UTF-8, but with the input wordlist actually in UTF-8 we fail to crack this password. So it feels like a bug.

We could amend the output when --target-encoding is not used, such as:
Using default input encoding: UTF-8 and expecting target encoding to the same
or
Using default input encoding: UTF-8
Expected target encoding: UTF-8

Hmm no, that ends up even more confusing for the case when no encoding option is used, but the wordlist is already (in this case) in ISO-8859-1. So maybe we should change the Using default input encoding: UTF-8 to Expecting input encoding to match target encoding (for that case, but not if FMT_UNICODE)

solardiz · 2025-03-04T17:31:18Z

Expecting input encoding to match target encoding

I like this one. Maybe even: Expecting input character encoding to match the target encoding to be clearer what kind of encoding we refer to.

I recall that there are cases where passing -enc=raw makes a difference, so perhaps the above isn't always the default?

magnumripper · 2025-03-05T08:43:53Z

be clearer what kind of encoding we refer to.

What could it be other than character encoding?

I recall that there are cases where passing -enc=raw makes a difference, so perhaps the above isn't always the default?

For Unicode formats like NT, -enc=raw affects the conversion to UTF-16 (will behave like old john, which in turn behaves exactly like -enc=iso-8859-1 - perhaps we should clearly say so).
For any format including Unicode ones, rules processing with -enc=raw will also behave like old john: Can only lower/upper case ASCII, all character classes are ASCII only, and so on.

We could add a line when rules are in use with RAW:

Rules will not fully support non-ASCII characters

or s/support/handle/

solardiz · 2025-03-05T15:06:12Z

What could it be other than character encoding?

e.g. base64 ;-)

Rules will not fully support non-ASCII characters

or s/support/handle/

Yes, we could. Maybe prefix it with "Note: " like we do for some other things that are almost but not quite warnings.

magnumripper · 2025-03-14T13:42:36Z

Thinking loud and I may edit this post. There's an incredible number of possible situations, potentially needing different reporting:

UTF-16 format, default settings, without rules. UTF-8 input, UTF-16 output. This is never "lossy" as long as the input actually is UTF-8: Invalid UTF-8 in the input will likely result in truncated/thrashed output sent to the format.
UTF-16 format, default settings (no internal codepage), with rules. The rules engine will only fully support ASCII (for case changes, character classes and so on). Invalid UTF-8 in the input (or as a result of rules mangling) will likely result in truncated/thrashed output sent to the format.
UTF-16 format, using internal codepage for rules/masks (implies/requires UTF-8 input). The rules engine will fully support non-ASCII as long as the input characters are included in the chosen codepage. The codepage doesn't have to be "correct" in any other sense, so ISO-8859-1, CP1252, CP437 or CP850 will all work fine for most western non-ASCII characters even if, for example, the targeted NT hashes came from a system using CP1252 as it's "ANSI" codepage. Invalid UTF-8 in the input will likely result in truncated/thrashed output sent to the format.
UTF-16 format, using some legacy codepage for input (I think this implies it will also be used as internal codepage, or we'll have yet another case for that).
UTF-16 format, using -enc=raw. This is by necessity the same as using ISO-8859-1 input encoding, except (if rules are in use) it will disable any internal codepage in case there's a custom default for it in conf. The rules engine will only fully support ASCII.

and even worse:

Normal format, using -enc=raw without rules. This will send the input as-is to the hash function. Here we could say Expecting input encoding to match target encoding, not mentioning anything else about the matter.
Normal format, using -enc=raw with rules (no internal codepage). Same as the previous one except we might want to mention that the rules engine will only fully support ASCII.
Normal format, using defaults without rules. This very case ought to be the same as using -enc=raw so we should again say Expecting input encoding to match target encoding and not mention anything else about the matter.
Normal format, using defaults with rules (no internal codepage). This should be the same as using -enc=raw with rules per above.
Normal format, UTF-8 input, non-UTF-8 target encoding, without rules. Invalid UTF-8 in the input will likely result in truncated/thrashed output sent to the format.
Normal format, with rules and using an internal encoding or target encoding (both implies/requires UTF-8 input, and unspecified target encoding means [I'm pretty sure] we'll encode back to UTF-8 after rules). The rules engine will fully support non-ASCII as long as the input characters are included in the chosen codepage. The codepage doesn't have to be "correct" in any other sense. Invalid UTF-8 in the input will likely result in truncated/thrashed output sent to the format.

"UTF-16 format" per above, also applies to UTF-32 formats in case we have any? I can't recall we do.

I'm not entirely sure I even listed half of all possibilities, lol

solardiz added the bug label Mar 4, 2025

solardiz added this to the Potentially 2.0.0 milestone Mar 4, 2025

solardiz mentioned this issue Mar 4, 2025

Oubliette Password Manager support #5680

Merged

magnumripper self-assigned this Mar 5, 2025

magnumripper changed the title ~~Oubliette character encoding inconsistency~~ Character encoding inconsistency / reporting Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character encoding inconsistency / reporting #5681

Character encoding inconsistency / reporting #5681

solardiz commented Mar 4, 2025

magnumripper commented Mar 4, 2025

magnumripper commented Mar 4, 2025

magnumripper commented Mar 4, 2025 •

edited

Loading

solardiz commented Mar 4, 2025

magnumripper commented Mar 5, 2025

solardiz commented Mar 5, 2025

magnumripper commented Mar 14, 2025 •

edited

Loading

Character encoding inconsistency / reporting #5681

Character encoding inconsistency / reporting #5681

Comments

solardiz commented Mar 4, 2025

magnumripper commented Mar 4, 2025

magnumripper commented Mar 4, 2025

magnumripper commented Mar 4, 2025 • edited Loading

solardiz commented Mar 4, 2025

magnumripper commented Mar 5, 2025

solardiz commented Mar 5, 2025

magnumripper commented Mar 14, 2025 • edited Loading

magnumripper commented Mar 4, 2025 •

edited

Loading

magnumripper commented Mar 14, 2025 •

edited

Loading