Skip to content

[MODULE] - Paragraph contains regular/markdown table #346

Open
@jhoetter

Description

@jhoetter

Please describe the module you would like to add to bricks
In the context of RAG (Retrieval Augmented Generation):
If a paragraph contains a table, i want to easily filter for it; generally, it means the paragraph has a higher complexity.

Do you already have an implementation?
This is nowhere near perfect, it just is a first heuristic I used previously that detects if there likely is a table without any kind of markdown structure. E.g. in a pricing table, which contains some headers and then just is prices.

import re


def likely_contains_tabular_data(text):
    # Check for sequences of numbers and special symbols,
    # accounting for whitespace or tabs between elements
    number_pattern = re.compile(r"(\d+([.,]\d+)?)+")
    whitespace_pattern = re.compile(r"\s+")

    lines = text.split("\n")
    lines_with_patterns = [
        line for line in lines if len(number_pattern.findall(line)) >= 3
    ]

    # Heuristic 1: If we find many sequences of numbers separated by whitespace in the same line,
    # it might be tabular data. Let's assume we need at least 3 such sequences for a line to be considered.
    if any(len(number_pattern.findall(line)) >= 3 for line in lines):
        return True

    # Heuristic 2: Check if there are multiple lines with a similar structure of elements.
    # If the majority of these lines have a similar number of numerical elements,
    # it might be a sign of tabular data.
    if lines_with_patterns:
        elements_counts = [
            len(whitespace_pattern.split(line)) for line in lines_with_patterns
        ]
        avg_elements = sum(elements_counts) / len(elements_counts)

        # If most lines have a number of elements close to the average, consider it tabular
        if (
            sum(1 for count in elements_counts if abs(count - avg_elements) <= 1)
            >= len(elements_counts) * 0.7
        ):
            return True

    # No heuristic matched
    return False

Additional context
-

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions