Scanning issue with line endings #102

deadmoose · 2025-02-27T18:38:58Z

TLDR: We've discovered several files where we get inconsistent scan results due to platform-local line endings (e.g. a match is found from mac/linux but NOT windows).

Generated wfps are identical cross-platform with the exception of the initial md5sum + filesize. The server appears to find a match based ONLY on that checksum and not the actual content.

Sample file

We've got a simple 15-line example file, which is also available here:

// Copyright 1998-2018 Epic Games, Inc. All Rights Reserved.

#pragma once

#include "CoreMinimal.h"
#include "Modules/ModuleManager.h"

class FMirrorAnimationSystemModule : public IModuleInterface
{
public:

  /** IModuleInterface implementation */
  virtual void StartupModule() override;
  virtual void ShutdownModule() override;
};

Mostly-identical Fingerprints

Since winnowing normalizes away everything but [a-zA-Z0-9] and maintains its own line-numbers based purely on 0x0a, the meat of the file remains the same.

On mac/linux, it's a 333 byte file, and the generated wfp:

file=b2827a7597ac0be40a11181b65a7d53e,333,MirrorAnimationSystem.h
6=09d35107,9d9a1d1a
8=3b74cd81
13=501782c6
14=f30318b8

Depending how exactly it gets to a windows machine, the line endings get altered (e.g. by doing a git checkout).
As expected, if those get changed to CRLF, the 15-line file gains an additional 15 bytes, and the generated wfp becomes:

file=7ebec0f57e4a25d3418460458a7780e8,348,MirrorAnimationSystem.h
6=09d35107,9d9a1d1a
8=3b74cd81
13=501782c6
14=f30318b8

Different results

Hitting the scan service with those two wfps gets different results:

dhoover@shoggoth:~/line_ending_woe$ curl -X POST https://api.osskb.org/api/scan/direct -H "Content-Type: multipart/form-data" -F "[email protected]"
{"MirrorAnimationSystem.h": [{"id": "none","server": {"version": "5.4.9","kb_version": {"monthly":"25.02", "daily":"25.02.26"}}}]}

dhoover@shoggoth:~/line_ending_woe$ curl -X POST https://api.osskb.org/api/scan/direct -H "Content-Type: multipart/form-data" -F "[email protected]"
{"MirrorAnimationSystem.h": [{"id": "file","lines": "all","oss_lines": "all","matched": "100%","file_hash": "b2827a7597ac0be40a11181b65a7d53e","source_hash": "b2827a7597ac0be40a11181b65a7d53e","file_url": "https://api.osskb.org/file_contents/b2827a7597ac0be40a11181b65a7d53e","purl": ["pkg:github/rexocrates/mirror_animation_system"],"vendor": "rexocrates","component": "mirror_animation_system","version": "f378ca9","latest": "f378ca9","url": "https://github.com/rexocrates/mirror_animation_system","status": "pending","release_date": "2021-08-09","file": "MirrorAnimationSystem/Source/MirrorAnimationSystem/Public/MirrorAnimationSystem.h","url_hash": "18e1778c5a4f2686ee7619a874c81944","licenses": [{"name": "MIT","patent_hints": "no", "copyleft": "no", "checklist_url": "https://www.osadl.org/fileadmin/checklists/unreflicenses/MIT.txt","osadl_updated": "2024-11-29T15:09:00+0000","source": "component_declared","url": "https://spdx.org/licenses/MIT.html"},{"name": "MIT","patent_hints": "no", "copyleft": "no", "checklist_url": "https://www.osadl.org/fileadmin/checklists/unreflicenses/MIT.txt","osadl_updated": "2024-11-29T15:09:00+0000","source": "license_file","url": "https://spdx.org/licenses/MIT.html"}],"server": {"version": "5.4.9","kb_version": {"monthly":"25.02", "daily":"25.02.26"}}}]}

As implied, it's a 100% match, but it's discovering that ONLY based on the md5sum, not the other winnowed bits. For instance, adding a single carriage return changes the md5sum of the file but NOT the winnowed bits:

dhoover@shoggoth:~/line_ending_woe$ cat linux-plus-cr.wfp
file=794e5778bfab77dbb2469a4389cde0b7,334,MirrorAnimationSystem.h
6=09d35107,9d9a1d1a
8=3b74cd81
13=501782c6
14=f30318b8

dhoover@shoggoth:~/line_ending_woe$ curl -X POST https://api.osskb.org/api/scan/direct -H "Content-Type: multipart/form-data" -F "[email protected]"
{"MirrorAnimationSystem.h": [{"id": "none","server": {"version": "5.4.9","kb_version": {"monthly":"25.02", "daily":"25.02.26"}}}]}

DIFFERENT Matches Based On Line Endings

Closely related, it's possible for files to get different matches based on this.

To simulate, I've grabbed a file and tested both it as verbatim from upstream and it plus a carriage return so the fingerprints are identical but the md5sum is NOT.

E.g. this file:

dhoover@shoggoth:~/line_ending_woe$ diff -u1 aksslider-*.wfp
--- aksslider-addedline.wfp     2025-02-27 10:21:44.540732673 -0800
+++ aksslider-intact.wfp        2025-02-27 10:21:10.540556654 -0800
@@ -1,2 +1,2 @@
-file=0641726468cbb89d10c66a93bd3f268c,8953,AkSSlider.cpp
+file=9912ad779e55aeac49f7f642e2c3a3c4,8951,AkSSlider.cpp
 4=da0762b5

dhoover@shoggoth:~/line_ending_woe$ curl -X POST https://api.osskb.org/api/scan/direct -H "Content-Type: multipart/form-data" -F "[email protected]"
{"AkSSlider.cpp": [{"id": "file","lines": "all","oss_lines": "all","matched": "100%","file_hash": "9912ad779e55aeac49f7f642e2c3a3c4","source_hash": "9912ad779e55aeac49f7f642e2c3a3c4","file_url": "https://api.osskb.org/file_contents/9912ad779e55aeac49f7f642e2c3a3c4","purl": ["pkg:github/cime-art/projetbroom"],"vendor": "Cime-Art","component": "ProjetBroom","version": "0.1","latest": "0.1","url": "https://github.com/Cime-Art/ProjetBroom","status": "pending","release_date": "2021-04-01","file": "ProjetBroom-0.1/Plugins/Wwise/Source/AkAudio/Private/AkWaapiSlate/Widgets/Input/AkSSlider.cpp","url_hash": "33006fb61fe332c61c26c4d1e63824df","licenses": [],"server": {"version": "5.4.9","kb_version": {"monthly":"25.02", "daily":"25.02.26"}}}]}

dhoover@shoggoth:~/line_ending_woe$ curl -X POST https://api.osskb.org/api/scan/direct -H "Content-Type: multipart/form-data" -F "[email protected]"
{"AkSSlider.cpp": [{"id": "snippet","lines": "5-161,176-272,292-302","oss_lines": "18-174,189-285,305-315","matched": "86%","file_hash": "06ed3b5547504d5ea38a215e4ee293a5","source_hash": "0641726468cbb89d10c66a93bd3f268c","file_url": "https://api.osskb.org/file_contents/06ed3b5547504d5ea38a215e4ee293a5","purl": ["pkg:github/medallyon/backtrace-game-jam"],"vendor": "Medallyon","component": "backtrace-game-jam","version": "1","latest": "1","url": "https://github.com/Medallyon/backtrace-game-jam","status": "pending","release_date": "2021-12-04","file": "Plugins/Wwise/Source/AkAudio/Private/AkWaapiSlate/Widgets/Input/AkSSlider.cpp","url_hash": "b17016741edbc4594cc0e701ef62a416","licenses": [{"name": "LicenseRef-scancode-commercial-license","source": "scancode"},{"name": "LicenseRef-scancode-proprietary-license","source": "scancode"}],"server": {"version": "5.4.9","kb_version": {"monthly":"25.02", "daily":"25.02.26"}}}]}

What to do?

As a result, our Windows-based developers get very different results than anyone else and/or our CI system:

Some files are missing completely
Some files match against different packages

Missing files means there's no UI to do anything about them with something like code compare, while different matches means it shows up in the UI, but anything they do based on it winds up being ignored elsewhere since the purl is completely different from what other developers/the CI system sees.

It would be possible to tell everything to not do md5-based whole-file checking, but we'd like to keep that enabled; these files are above the default minimum size threshold for scanning (256 bytes) and clearly are legitimate matches.

The text was updated successfully, but these errors were encountered:

ortizjeronimo self-assigned this Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scanning issue with line endings #102

Scanning issue with line endings #102

deadmoose commented Feb 27, 2025

Scanning issue with line endings #102

Scanning issue with line endings #102

Comments

deadmoose commented Feb 27, 2025

Sample file

Mostly-identical Fingerprints

Different results

DIFFERENT Matches Based On Line Endings

What to do?