Skip to content

Handle misencoded license text files graceful. #884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 24, 2025

Conversation

schlenk
Copy link
Contributor

@schlenk schlenk commented Apr 23, 2025

This handles #868

If a license file is not decodeable as UTF-8, it uses the fallback to io2str using chardet to determine the encoding will be tried.

It also adds some tests to exercise the code:

  • A UTF16 encoded file. chardet fails to detect it as UTF-16, so it gets skipped.
  • A RTF file. This is technically text/rtf (or application/rtf, but misdetected as application/msword...). It is technically UTF-8 and gets properly encoded.
  • A plain text UTF-8 file. This gets properly encoded (but maybe the XML encoding should have xml:space="preserve" set, to properly protect newlines).

@schlenk schlenk requested a review from a team as a code owner April 23, 2025 11:16
@jkowalleck jkowalleck added the enhancement New feature or request label Apr 23, 2025
@jkowalleck jkowalleck linked an issue Apr 23, 2025 that may be closed by this pull request
@schlenk schlenk changed the title feat: Handle misencoded license text files graceful. Handle misencoded license text files graceful. Apr 23, 2025
Copy link
Member

@jkowalleck jkowalleck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current implementation seams utterly complex.
please refactor it.

Michael Schlenker added 4 commits April 24, 2025 10:17
Signed-off-by: Michael Schlenker <[email protected]>
Signed-off-by: Michael Schlenker <[email protected]>
Signed-off-by: Michael Schlenker <[email protected]>
Signed-off-by: Michael Schlenker <[email protected]>
@schlenk
Copy link
Contributor Author

schlenk commented Apr 24, 2025

the current implementation seems utterly complex. please refactor it.

A good chunk of the complexity stems from the original code in pep639.py, which i shifted around a little.
I could change some of those, e.g. kick out the base64 encoding path and the looking into multiple locations loop, as only 'licenses' is mentioned in the PEP, if that would help.

Signed-off-by: Jan Kowalleck <[email protected]>
@jkowalleck
Copy link
Member

@schlenk, I've rearranged the tests via d95dc73

Signed-off-by: Jan Kowalleck <[email protected]>
Signed-off-by: Jan Kowalleck <[email protected]>
@schlenk
Copy link
Contributor Author

schlenk commented Apr 24, 2025

@schlenk, I've rearranged the tests via d95dc73

Thanks. Looks much cleaner. I found it somewhat hard to figure out how to add good tests in the first place. Is there some documentation about the test concept/how to add good tests? As there is already quite a bit of scaffolding and infrastructure present, but not entirely obvious at first.

@jkowalleck
Copy link
Member

jkowalleck commented Apr 24, 2025

@schlenk , after revieweing, I propose the following simplification: schlenk#1

If you agree, just merge the pullrequest, and it will apply upstream

fix: try to detect licensetexts, or pass silently
@schlenk schlenk requested a review from jkowalleck April 24, 2025 12:10
@jkowalleck
Copy link
Member

Something is off on windows.
not gonna lie, i've expected such a thing, since windows often auto-converts line endings - which causes unexpected situations

@jkowalleck
Copy link
Member

Something is off on windows. not gonna lie, i've expected such a thing, since windows often auto-converts line endings - which causes unexpected situations

will try to fix this

Signed-off-by: Michael Schlenker <[email protected]>
@jkowalleck jkowalleck merged commit cc25975 into CycloneDX:main Apr 24, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

license file gathering for non-utf8 files
2 participants