Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character encoding inconsistency / reporting #5681

Open
solardiz opened this issue Mar 4, 2025 · 7 comments
Open

Character encoding inconsistency / reporting #5681

solardiz opened this issue Mar 4, 2025 · 7 comments
Assignees
Labels

Comments

@solardiz
Copy link
Member

solardiz commented Mar 4, 2025

Originally in #5680 (comment)

Testing against the test vectors from openwall/john-samples#31 I am only able to directly crack the simple password 12345678. For cracking the complex password, I have to first process the wordlist through iconv -f utf8 -t iso-8859-1. I guess it got inadvertently converted the other way somewhere on the way to git commit? Should we replace it with the result of this iconv with a subsequent commit?

Oh, alternatively I am able to get it cracked by adding -target-enc=iso-8859-1.

@magnumripper @davidedg please suggest how to fix this encoding issue best, to minimize user confusion and users' wasted time on running with wrong encoding settings. Right now, by default we print Using default input encoding: UTF-8, but with the input wordlist actually in UTF-8 we fail to crack this password. So it feels like a bug.

@magnumripper
Copy link
Member

I generally recommend always using UTF-8 for wordlists, and -target-enc where needed. For samples however, maybe it's better to have it as the expected encoding already in the password hint file. If we do that, I suggest we use both encodings in the password hint file: Keep the UTF-8 and add one in ISO. Then also explain this with #!comment: lines!

If we do not change oubliette-passwords.txt (and perhaps even if we do), we should add some kind of README that explains the situation and the -target-enc option.

@magnumripper
Copy link
Member

Right now, by default we print Using default input encoding: UTF-8, but with the input wordlist actually in UTF-8 we fail to crack this password. So it feels like a bug.

We could amend the output when --target-encoding is not used, such as:

Using default input encoding: UTF-8 and expecting target encoding to the same

or

Using default input encoding: UTF-8
Expected target encoding: UTF-8

@magnumripper
Copy link
Member

magnumripper commented Mar 4, 2025

Right now, by default we print Using default input encoding: UTF-8, but with the input wordlist actually in UTF-8 we fail to crack this password. So it feels like a bug.

We could amend the output when --target-encoding is not used, such as:

Using default input encoding: UTF-8 and expecting target encoding to the same

or

Using default input encoding: UTF-8
Expected target encoding: UTF-8

Hmm no, that ends up even more confusing for the case when no encoding option is used, but the wordlist is already (in this case) in ISO-8859-1. So maybe we should change the Using default input encoding: UTF-8 to Expecting input encoding to match target encoding (for that case, but not if FMT_UNICODE)

@solardiz
Copy link
Member Author

solardiz commented Mar 4, 2025

Expecting input encoding to match target encoding

I like this one. Maybe even: Expecting input character encoding to match the target encoding to be clearer what kind of encoding we refer to.

I recall that there are cases where passing -enc=raw makes a difference, so perhaps the above isn't always the default?

@magnumripper
Copy link
Member

be clearer what kind of encoding we refer to.

What could it be other than character encoding?

I recall that there are cases where passing -enc=raw makes a difference, so perhaps the above isn't always the default?

For Unicode formats like NT, -enc=raw affects the conversion to UTF-16 (will behave like old john, which in turn behaves exactly like -enc=iso-8859-1 - perhaps we should clearly say so).
For any format including Unicode ones, rules processing with -enc=raw will also behave like old john: Can only lower/upper case ASCII, all character classes are ASCII only, and so on.

We could add a line when rules are in use with RAW:

Rules will not fully support non-ASCII characters

or s/support/handle/

@magnumripper magnumripper self-assigned this Mar 5, 2025
@magnumripper magnumripper changed the title Oubliette character encoding inconsistency Character encoding inconsistency / reporting Mar 5, 2025
@solardiz
Copy link
Member Author

solardiz commented Mar 5, 2025

What could it be other than character encoding?

e.g. base64 ;-)

Rules will not fully support non-ASCII characters

or s/support/handle/

Yes, we could. Maybe prefix it with "Note: " like we do for some other things that are almost but not quite warnings.

@magnumripper
Copy link
Member

magnumripper commented Mar 14, 2025

Thinking loud and I may edit this post. There's an incredible number of possible situations, potentially needing different reporting:

  • UTF-16 format, default settings, without rules. UTF-8 input, UTF-16 output. This is never "lossy" as long as the input actually is UTF-8: Invalid UTF-8 in the input will likely result in truncated/thrashed output sent to the format.
  • UTF-16 format, default settings (no internal codepage), with rules. The rules engine will only fully support ASCII (for case changes, character classes and so on). Invalid UTF-8 in the input (or as a result of rules mangling) will likely result in truncated/thrashed output sent to the format.
  • UTF-16 format, using internal codepage for rules/masks (implies/requires UTF-8 input). The rules engine will fully support non-ASCII as long as the input characters are included in the chosen codepage. The codepage doesn't have to be "correct" in any other sense, so ISO-8859-1, CP1252, CP437 or CP850 will all work fine for most western non-ASCII characters even if, for example, the targeted NT hashes came from a system using CP1252 as it's "ANSI" codepage. Invalid UTF-8 in the input will likely result in truncated/thrashed output sent to the format.
  • UTF-16 format, using some legacy codepage for input (I think this implies it will also be used as internal codepage, or we'll have yet another case for that).
  • UTF-16 format, using -enc=raw. This is by necessity the same as using ISO-8859-1 input encoding, except (if rules are in use) it will disable any internal codepage in case there's a custom default for it in conf. The rules engine will only fully support ASCII.

and even worse:

  • Normal format, using -enc=raw without rules. This will send the input as-is to the hash function. Here we could say Expecting input encoding to match target encoding, not mentioning anything else about the matter.
  • Normal format, using -enc=raw with rules (no internal codepage). Same as the previous one except we might want to mention that the rules engine will only fully support ASCII.
  • Normal format, using defaults without rules. This very case ought to be the same as using -enc=raw so we should again say Expecting input encoding to match target encoding and not mention anything else about the matter.
  • Normal format, using defaults with rules (no internal codepage). This should be the same as using -enc=raw with rules per above.
  • Normal format, UTF-8 input, non-UTF-8 target encoding, without rules. Invalid UTF-8 in the input will likely result in truncated/thrashed output sent to the format.
  • Normal format, with rules and using an internal encoding or target encoding (both implies/requires UTF-8 input, and unspecified target encoding means [I'm pretty sure] we'll encode back to UTF-8 after rules). The rules engine will fully support non-ASCII as long as the input characters are included in the chosen codepage. The codepage doesn't have to be "correct" in any other sense. Invalid UTF-8 in the input will likely result in truncated/thrashed output sent to the format.

"UTF-16 format" per above, also applies to UTF-32 formats in case we have any? I can't recall we do.

I'm not entirely sure I even listed half of all possibilities, lol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants