-
Notifications
You must be signed in to change notification settings - Fork 24
[SP-2587] Add directory simhash, modify concatenated names to remove extensions #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
""" WalkthroughThe updates modify type hints and filtering logic in file filters, and enhance the folder hashing process with additional simhash calculations for directories and file names. Method signatures are updated for clarity and stricter typing, and the hashing methods now track and return more granular directory and file name uniqueness information. Protobuf and gRPC generated files are updated with runtime version checks and schema enhancements, including new fields and improved compatibility enforcement. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant FolderHasher
participant DirectoryNode
User->>FolderHasher: Initialize with scan_dir and config
FolderHasher->>DirectoryNode: Traverse directory tree
FolderHasher->>FolderHasher: _hash_calc(node)
FolderHasher->>FolderHasher: Collect unique file and dir names
FolderHasher->>FolderHasher: Compute simhashes (content, name, dir)
FolderHasher->>User: Return hash dictionary with content_hash, name_hash, dir_hash
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
src/scanoss/file_filters.py (1)
514-521
: 💡 Verification agent🧩 Analysis chain
Update the doc-string & type handling for
scan_root
The parameter was changed toOptional[str]
, but the doc-string (lines 518-521) and the run-time handling (line 530) still assume a non-None
str
. Callingrelative_to()
with aNone
value will raise, so an early‐return or guard clause is advisable, and the docs should be synchronised.@@ - scan_root (str): Root directory to scan or fingerprint + scan_root (Optional[str]): Root directory to scan or fingerprint. + When None the paths are returned as absolute paths.
🏁 Script executed:
#!/bin/bash # Locate the file_filters.py file file=$(fd file_filters.py -t f) echo "Inspecting: $file" # Print lines around the doc-string and the runtime handling of scan_root sed -n '500,580p' "$file"Length of output: 406
🏁 Script executed:
#!/bin/bash # Locate the src/scanoss/file_filters.py file specifically file=$(fd file_filters.py -t f | grep 'src/scanoss/file_filters.py') echo "Inspecting: $file" # Print lines around the doc-string and the runtime handling of scan_root sed -n '500,580p' "$file"Length of output: 3643
Update doc-string for
scan_root
inget_filtered_files_from_files
Thescan_root
parameter is nowOptional[str]
, so the doc-string should reflect the new type and describe the None-case behavior.• File:
src/scanoss/file_filters.py
• Method:get_filtered_files_from_files
Suggested diff:
Args: files (List[str]): List of files to scan or fingerprint - scan_root (str): Root directory to scan or fingerprint + scan_root (Optional[str]): Root directory to scan or fingerprint. + When None, returned paths will be absolute.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/scanoss/file_filters.py
(3 hunks)src/scanoss/scanners/folder_hasher.py
(4 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/scanoss/scanners/folder_hasher.py (5)
src/scanoss/utils/simhash.py (4)
simhash
(125-130)WordFeatureSet
(163-169)fingerprint
(100-109)vectorize_bytes
(84-97)src/scanoss/scanners/scanner_hfh.py (1)
present
(129-131)src/scanoss/results.py (1)
present
(273-275)src/scanoss/utils/abstract_presenter.py (1)
present
(28-55)src/scanoss/scanners/container_scanner.py (1)
present
(381-383)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (2)
src/scanoss/scanners/folder_hasher.py (2)
80-82
: Constructor signature change breaks backward compatibility
config
is now mandatory (FolderHasherConfig
, notOptional
), yet callers in the tree (e.g. tests, CLI entry points) may still passNone
. Consider providing a sensible default:def __init__(self, scan_dir: str, config: Optional[FolderHasherConfig] = None, ...) config = config or FolderHasherConfig()Without this, existing integrations will raise
TypeError
.
256-270
: Potential misuse ofvectorize_bytes
input
file_hashes.append(file.key)
appends the entire CRC digest;vectorize_bytes
will iterate through these 8-byte sequences as separate features, which is fine. Ifkey
accidentally becomesList[bytes]
(see earlier issue) you will instead pass a list of lists causing the FNV hash to fail.Fixing the type annotation as suggested earlier avoids this failure path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
src/scanoss/scanners/folder_hasher.py (1)
38-41
:key
parameter type should bebytes
, notList[bytes]
.As noted in the previous review,
CRC64.get_hash_buff()
returns an 8-byte digest (bytes object), and the code expects a bytes object, not a list. This type annotation is incorrect and will cause runtime errors.Apply this diff to fix the type annotation:
- def __init__(self, path: str, key: List[bytes], key_str: str): + def __init__(self, path: str, key: bytes, key_str: str):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/scanoss/api/common/v2/scanoss_common_pb2.py
(2 hunks)src/scanoss/api/common/v2/scanoss_common_pb2_grpc.py
(1 hunks)src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py
(2 hunks)src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py
(5 hunks)src/scanoss/scanners/folder_hasher.py
(6 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
src/scanoss/api/common/v2/scanoss_common_pb2_grpc.py
4-4: warnings
imported but unused
Remove unused import: warnings
(F401)
20-20: f-string without any placeholders
Remove extraneous f
prefix
(F541)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py
4-4: warnings
imported but unused
Remove unused import: warnings
(F401)
7-7: Line too long (122 > 120)
(E501)
22-22: f-string without any placeholders
Remove extraneous f
prefix
(F541)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py
30-30: Line too long (2520 > 120)
(E501)
37-37: Line too long (391 > 120)
(E501)
49-49: Line too long (130 > 120)
(E501)
51-51: Line too long (144 > 120)
(E501)
🪛 Pylint (3.3.7)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py
[error] 47-47: Module 'scanoss.api.scanning.v2.scanoss_scanning_pb2' has no 'HFHRequest' member
(E1101)
[error] 48-48: Module 'scanoss.api.scanning.v2.scanoss_scanning_pb2' has no 'HFHResponse' member
(E1101)
[error] 112-112: Module 'scanoss.api.common.v2.scanoss_common_pb2' has no 'EchoRequest' member
(E1101)
[error] 113-113: Module 'scanoss.api.common.v2.scanoss_common_pb2' has no 'EchoResponse' member
(E1101)
[error] 139-139: Module 'scanoss.api.scanning.v2.scanoss_scanning_pb2' has no 'HFHRequest' member
(E1101)
[error] 140-140: Module 'scanoss.api.scanning.v2.scanoss_scanning_pb2' has no 'HFHResponse' member
(E1101)
src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py
[error] 9-9: No name 'runtime_version' in module 'google.protobuf'
(E0611)
src/scanoss/api/common/v2/scanoss_common_pb2.py
[error] 9-9: No name 'runtime_version' in module 'google.protobuf'
(E0611)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (6)
src/scanoss/api/common/v2/scanoss_common_pb2.py (1)
1-50
: Auto-generated protobuf code looks good.The runtime version validation and code reorganization follow standard protobuf practices. The pylint warning about
runtime_version
can be safely ignored as it's available in protobuf 5.27.2.src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py (1)
44-49
: Registered method flags are correctly added.The addition of
_registered_method=True
to all gRPC method registrations and calls is correct and aligns with gRPC 1.67.0 requirements.Also applies to: 88-88, 122-122, 149-149
src/scanoss/api/scanning/v2/scanoss_scanning_pb2.py (1)
1-67
: Auto-generated protobuf schema updates look good.The schema enhancements adding
sim_hash_dir_names
andlang_extensions
fields align with the folder hasher implementation. Line length warnings for serialized descriptors can be safely ignored in auto-generated code.src/scanoss/scanners/folder_hasher.py (3)
197-205
: Good error handling for relative path calculation.The implementation properly handles the case where
relative_to()
might fail with a ValueError, providing a sensible fallback.
232-294
: Enhanced hash calculation with directory and language tracking.The implementation correctly:
- Tracks unique file names without extensions
- Collects unique directory names
- Counts file extensions by language
- Computes separate simhashes for files and directories
The logic is well-structured and aligns with the protobuf schema changes.
252-255
: Consider validating directory extraction logic.The logic extracts the last directory component but excludes cases where it equals the full path. Ensure this handles all edge cases correctly, especially for files in the root directory.
#!/bin/bash # Description: Check how the directory extraction logic handles edge cases # Create a test script to verify directory extraction behavior cat > test_dir_extraction.py << 'EOF' import os test_cases = [ "file.txt", # File in root "./file.txt", # File in current dir "dir/file.txt", # File in subdirectory "dir/subdir/file.txt", # File in nested directory "/absolute/path/file.txt", # Absolute path ] for path in test_cases: current_directory = os.path.dirname(path) if current_directory and current_directory != '.': last_directory = os.path.basename(current_directory) valid = last_directory != current_directory and last_directory not in ['.', '..'] print(f"Path: {path}") print(f" Current dir: {current_directory}") print(f" Last dir: {last_directory}") print(f" Valid: {valid}") else: print(f"Path: {path} - Skipped (empty or current dir)") print() EOF python test_dir_extraction.py rm test_dir_extraction.py
@@ -1,4 +1,24 @@ | |||
# Generated by the gRPC Python protocol compiler plugin. DO NOT EDIT! | |||
"""Client and server classes corresponding to protobuf-defined services.""" | |||
import grpc | |||
import warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import.
The warnings
module is imported but never used in the code.
Apply this diff to remove the unused import:
import grpc
-import warnings
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import warnings | |
import grpc |
🧰 Tools
🪛 Ruff (0.11.9)
4-4: warnings
imported but unused
Remove unused import: warnings
(F401)
🤖 Prompt for AI Agents
In src/scanoss/api/common/v2/scanoss_common_pb2_grpc.py at line 4, the warnings
module is imported but not used anywhere in the file. Remove the line importing
warnings to clean up unused imports and improve code clarity.
if _version_not_supported: | ||
raise RuntimeError( | ||
f'The grpc package installed is at version {GRPC_VERSION},' | ||
+ f' but the generated code in scanoss/api/common/v2/scanoss_common_pb2_grpc.py depends on' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary f-string prefix.
Line 20 uses an f-string prefix without any placeholders.
Apply this diff to fix the string formatting:
- + f' but the generated code in scanoss/api/common/v2/scanoss_common_pb2_grpc.py depends on'
+ + ' but the generated code in scanoss/api/common/v2/scanoss_common_pb2_grpc.py depends on'
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
+ f' but the generated code in scanoss/api/common/v2/scanoss_common_pb2_grpc.py depends on' | |
+ ' but the generated code in scanoss/api/common/v2/scanoss_common_pb2_grpc.py depends on' |
🧰 Tools
🪛 Ruff (0.11.9)
20-20: f-string without any placeholders
Remove extraneous f
prefix
(F541)
🤖 Prompt for AI Agents
In src/scanoss/api/common/v2/scanoss_common_pb2_grpc.py at line 20, remove the
unnecessary f-string prefix from the string since it contains no placeholders.
Change the line to a regular string literal without the leading 'f' to correct
the formatting.
if _version_not_supported: | ||
raise RuntimeError( | ||
f'The grpc package installed is at version {GRPC_VERSION},' | ||
+ f' but the generated code in scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py depends on' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary f-string prefix.
Line 22 uses an f-string prefix without any placeholders.
Apply this diff to fix the string formatting:
- + f' but the generated code in scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py depends on'
+ + ' but the generated code in scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py depends on'
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
+ f' but the generated code in scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py depends on' | |
+ ' but the generated code in scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py depends on' |
🧰 Tools
🪛 Ruff (0.11.9)
22-22: f-string without any placeholders
Remove extraneous f
prefix
(F541)
🤖 Prompt for AI Agents
In src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py at line 22, remove
the unnecessary f-string prefix from the string since it contains no
placeholders. Replace the f-string with a regular string literal to correct the
formatting.
@@ -1,10 +1,30 @@ | |||
# Generated by the gRPC Python protocol compiler plugin. DO NOT EDIT! | |||
"""Client and server classes corresponding to protobuf-defined services.""" | |||
import grpc | |||
import warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import.
The warnings
module is imported but never used in the code.
Apply this diff to remove the unused import:
import grpc
-import warnings
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import warnings | |
import grpc |
🧰 Tools
🪛 Ruff (0.11.9)
4-4: warnings
imported but unused
Remove unused import: warnings
(F401)
🤖 Prompt for AI Agents
In src/scanoss/api/scanning/v2/scanoss_scanning_pb2_grpc.py at line 4, the
warnings module is imported but not used anywhere in the file. Remove the line
importing warnings to clean up unused imports.
Summary by CodeRabbit
New Features
Bug Fixes
Refactor