Skip to content

[TIKA-XXXX] Refactor(core): Modularize Classes, Methods, and Associations for Clarity. #2171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

syedamisbahh
Copy link

This pull request includes multiple code refactorings aimed at improving clarity, readability, and maintainability in the Apache Tika codebase. The changes preserve original functionality while making the code more expressive and modular.

Refactorings Applied

  1. Extract Method + Decompose Conditional
    Location: MediaTypeRegistry#getSupertype()
    Replaced deeply nested if-else blocks with helper methods like isXmlSubtype(), isTextType(), isEmptyType(), etc.
    Used early returns to simplify control flow and reduce cyclomatic complexity.

  2. Rename Variable
    Locations: MediaTypeRegistry.java, JsonPipesIterator.java
    Renamed variables to improve self-documentation:
    type → mediaType
    t → tuple
    r → reader

  3. Introduce Explaining Constants
    Location: TextStatistics#looksLikeUTF8()
    Replaced magic numbers (e.g. 0x20, 0x80, 0xc0) with named constants for better readability and understanding of UTF-8 byte range logic.

Note: I am awaiting access to the Apache Tika Jira issue tracker to file a formal issue.
Once granted access, I will:

  • Create the corresponding [TIKA-XXXX] issue.
  • Update this PR title and description to include the issue reference.

@syedamisbahh
Copy link
Author

I have also committed the below changes into the files:

  1. Extract Class: Inner class RankedService extracted into its own top-level class RankedService.java to adhere to the Single Responsibility Principle, reducing complexity in ServiceLoader.java.

  2. Move Method: Method collectServiceClassNames() moved from ServiceLoader.java to a new utility class ServiceResourceUtils.java, enhancing separation of concerns and enabling independent testing and reuse.

  3. Pull-up Method: Generic method getDetectors() moved from DefaultDetector to superclass CompositeDetector, centralizing detector aggregation logic and improving code reuse.

  4. Introduce Polymorphism: Replaced repetitive conditional checks in TikaConfig.java with a polymorphic strategy pattern utilizing a strategy map, simplifying the logic and facilitating easier future expansions.

@syedamisbahh syedamisbahh changed the title [TIKA-XXXX] Refactor core logic: extract methods, simplify conditionals, and clarify UTF-8 byte checks. [TIKA-XXXX] Refactor(core): Modularize Classes, Methods, and Associations for Clarity. Mar 30, 2025
@THausherr
Copy link
Contributor

Why did you delete some javadocs?
Is your PR related to a school / university assignment, or to training / testing an AI? Did you use any tool for this?

@THausherr
Copy link
Contributor

THausherr commented Mar 31, 2025

There are some minor refactorings that I will look at.
https://issues.apache.org/jira/browse/TIKA-4397

@syedamisbahh
Copy link
Author

i removed it for a while for my convinience and forgot to add them back, ill add it back and request a PR if that is the only issue, and yes it was related to my university assignment. I used designite to detect the code smells and then refactored some code accordingly.

I will surely go through the link you have provided and make necessary changes, thank you. If there is anything specific that you would want me to do, please do let me know!

@THausherr
Copy link
Contributor

THausherr commented Apr 2, 2025

What you could do:

  • Restore the comments in this PR (not in an additional PR)
  • Avoid pure formatting changes
  • Apply the changes from the last 2 days to this PR so that it gets easier to see what remains.
  • TIKA-4397 into title

Be aware that I might still not apply the remaining changes, which may be frustrating to you. Maybe somebody else will, maybe not.

@THausherr
Copy link
Contributor

I'm wondering whether the change with the MIME_TYPE_TAG is correct. The original code has a null check for type. I assume that the developer wanted to avoid nested tags of this type. While the test passes with the change, this doesn't mean the change can be done. We can't assume that the tests cover everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants