Skip to content

Scanning issue with line endings #102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
deadmoose opened this issue Feb 27, 2025 · 0 comments
Open

Scanning issue with line endings #102

deadmoose opened this issue Feb 27, 2025 · 0 comments
Assignees

Comments

@deadmoose
Copy link

TLDR: We've discovered several files where we get inconsistent scan results due to platform-local line endings (e.g. a match is found from mac/linux but NOT windows).

Generated wfps are identical cross-platform with the exception of the initial md5sum + filesize. The server appears to find a match based ONLY on that checksum and not the actual content.

Sample file

We've got a simple 15-line example file, which is also available here:

// Copyright 1998-2018 Epic Games, Inc. All Rights Reserved.

#pragma once

#include "CoreMinimal.h"
#include "Modules/ModuleManager.h"

class FMirrorAnimationSystemModule : public IModuleInterface
{
public:

  /** IModuleInterface implementation */
  virtual void StartupModule() override;
  virtual void ShutdownModule() override;
};

Mostly-identical Fingerprints

Since winnowing normalizes away everything but [a-zA-Z0-9] and maintains its own line-numbers based purely on 0x0a, the meat of the file remains the same.

On mac/linux, it's a 333 byte file, and the generated wfp:

file=b2827a7597ac0be40a11181b65a7d53e,333,MirrorAnimationSystem.h
6=09d35107,9d9a1d1a
8=3b74cd81
13=501782c6
14=f30318b8

Depending how exactly it gets to a windows machine, the line endings get altered (e.g. by doing a git checkout).
As expected, if those get changed to CRLF, the 15-line file gains an additional 15 bytes, and the generated wfp becomes:

file=7ebec0f57e4a25d3418460458a7780e8,348,MirrorAnimationSystem.h
6=09d35107,9d9a1d1a
8=3b74cd81
13=501782c6
14=f30318b8

Different results

Hitting the scan service with those two wfps gets different results:

dhoover@shoggoth:~/line_ending_woe$ curl -X POST https://api.osskb.org/api/scan/direct -H "Content-Type: multipart/form-data" -F "[email protected]"
{"MirrorAnimationSystem.h": [{"id": "none","server": {"version": "5.4.9","kb_version": {"monthly":"25.02", "daily":"25.02.26"}}}]}

dhoover@shoggoth:~/line_ending_woe$ curl -X POST https://api.osskb.org/api/scan/direct -H "Content-Type: multipart/form-data" -F "[email protected]"
{"MirrorAnimationSystem.h": [{"id": "file","lines": "all","oss_lines": "all","matched": "100%","file_hash": "b2827a7597ac0be40a11181b65a7d53e","source_hash": "b2827a7597ac0be40a11181b65a7d53e","file_url": "https://api.osskb.org/file_contents/b2827a7597ac0be40a11181b65a7d53e","purl": ["pkg:github/rexocrates/mirror_animation_system"],"vendor": "rexocrates","component": "mirror_animation_system","version": "f378ca9","latest": "f378ca9","url": "https://github.com/rexocrates/mirror_animation_system","status": "pending","release_date": "2021-08-09","file": "MirrorAnimationSystem/Source/MirrorAnimationSystem/Public/MirrorAnimationSystem.h","url_hash": "18e1778c5a4f2686ee7619a874c81944","licenses": [{"name": "MIT","patent_hints": "no", "copyleft": "no", "checklist_url": "https://www.osadl.org/fileadmin/checklists/unreflicenses/MIT.txt","osadl_updated": "2024-11-29T15:09:00+0000","source": "component_declared","url": "https://spdx.org/licenses/MIT.html"},{"name": "MIT","patent_hints": "no", "copyleft": "no", "checklist_url": "https://www.osadl.org/fileadmin/checklists/unreflicenses/MIT.txt","osadl_updated": "2024-11-29T15:09:00+0000","source": "license_file","url": "https://spdx.org/licenses/MIT.html"}],"server": {"version": "5.4.9","kb_version": {"monthly":"25.02", "daily":"25.02.26"}}}]}

As implied, it's a 100% match, but it's discovering that ONLY based on the md5sum, not the other winnowed bits. For instance, adding a single carriage return changes the md5sum of the file but NOT the winnowed bits:

dhoover@shoggoth:~/line_ending_woe$ cat linux-plus-cr.wfp
file=794e5778bfab77dbb2469a4389cde0b7,334,MirrorAnimationSystem.h
6=09d35107,9d9a1d1a
8=3b74cd81
13=501782c6
14=f30318b8

dhoover@shoggoth:~/line_ending_woe$ curl -X POST https://api.osskb.org/api/scan/direct -H "Content-Type: multipart/form-data" -F "[email protected]"
{"MirrorAnimationSystem.h": [{"id": "none","server": {"version": "5.4.9","kb_version": {"monthly":"25.02", "daily":"25.02.26"}}}]}

DIFFERENT Matches Based On Line Endings

Closely related, it's possible for files to get different matches based on this.

To simulate, I've grabbed a file and tested both it as verbatim from upstream and it plus a carriage return so the fingerprints are identical but the md5sum is NOT.

E.g. this file:

dhoover@shoggoth:~/line_ending_woe$ diff -u1 aksslider-*.wfp
--- aksslider-addedline.wfp     2025-02-27 10:21:44.540732673 -0800
+++ aksslider-intact.wfp        2025-02-27 10:21:10.540556654 -0800
@@ -1,2 +1,2 @@
-file=0641726468cbb89d10c66a93bd3f268c,8953,AkSSlider.cpp
+file=9912ad779e55aeac49f7f642e2c3a3c4,8951,AkSSlider.cpp
 4=da0762b5

dhoover@shoggoth:~/line_ending_woe$ curl -X POST https://api.osskb.org/api/scan/direct -H "Content-Type: multipart/form-data" -F "[email protected]"
{"AkSSlider.cpp": [{"id": "file","lines": "all","oss_lines": "all","matched": "100%","file_hash": "9912ad779e55aeac49f7f642e2c3a3c4","source_hash": "9912ad779e55aeac49f7f642e2c3a3c4","file_url": "https://api.osskb.org/file_contents/9912ad779e55aeac49f7f642e2c3a3c4","purl": ["pkg:github/cime-art/projetbroom"],"vendor": "Cime-Art","component": "ProjetBroom","version": "0.1","latest": "0.1","url": "https://github.com/Cime-Art/ProjetBroom","status": "pending","release_date": "2021-04-01","file": "ProjetBroom-0.1/Plugins/Wwise/Source/AkAudio/Private/AkWaapiSlate/Widgets/Input/AkSSlider.cpp","url_hash": "33006fb61fe332c61c26c4d1e63824df","licenses": [],"server": {"version": "5.4.9","kb_version": {"monthly":"25.02", "daily":"25.02.26"}}}]}

dhoover@shoggoth:~/line_ending_woe$ curl -X POST https://api.osskb.org/api/scan/direct -H "Content-Type: multipart/form-data" -F "[email protected]"
{"AkSSlider.cpp": [{"id": "snippet","lines": "5-161,176-272,292-302","oss_lines": "18-174,189-285,305-315","matched": "86%","file_hash": "06ed3b5547504d5ea38a215e4ee293a5","source_hash": "0641726468cbb89d10c66a93bd3f268c","file_url": "https://api.osskb.org/file_contents/06ed3b5547504d5ea38a215e4ee293a5","purl": ["pkg:github/medallyon/backtrace-game-jam"],"vendor": "Medallyon","component": "backtrace-game-jam","version": "1","latest": "1","url": "https://github.com/Medallyon/backtrace-game-jam","status": "pending","release_date": "2021-12-04","file": "Plugins/Wwise/Source/AkAudio/Private/AkWaapiSlate/Widgets/Input/AkSSlider.cpp","url_hash": "b17016741edbc4594cc0e701ef62a416","licenses": [{"name": "LicenseRef-scancode-commercial-license","source": "scancode"},{"name": "LicenseRef-scancode-proprietary-license","source": "scancode"}],"server": {"version": "5.4.9","kb_version": {"monthly":"25.02", "daily":"25.02.26"}}}]}

What to do?

As a result, our Windows-based developers get very different results than anyone else and/or our CI system:

  • Some files are missing completely
  • Some files match against different packages

Missing files means there's no UI to do anything about them with something like code compare, while different matches means it shows up in the UI, but anything they do based on it winds up being ignored elsewhere since the purl is completely different from what other developers/the CI system sees.

It would be possible to tell everything to not do md5-based whole-file checking, but we'd like to keep that enabled; these files are above the default minimum size threshold for scanning (256 bytes) and clearly are legitimate matches.

@ortizjeronimo ortizjeronimo self-assigned this Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants