Open
Description
Please describe the module you would like to add to bricks
In the context of RAG (Retrieval Augmented Generation):
If a paragraph contains a table, i want to easily filter for it; generally, it means the paragraph has a higher complexity.
Do you already have an implementation?
This is nowhere near perfect, it just is a first heuristic I used previously that detects if there likely is a table without any kind of markdown structure. E.g. in a pricing table, which contains some headers and then just is prices.
import re
def likely_contains_tabular_data(text):
# Check for sequences of numbers and special symbols,
# accounting for whitespace or tabs between elements
number_pattern = re.compile(r"(\d+([.,]\d+)?)+")
whitespace_pattern = re.compile(r"\s+")
lines = text.split("\n")
lines_with_patterns = [
line for line in lines if len(number_pattern.findall(line)) >= 3
]
# Heuristic 1: If we find many sequences of numbers separated by whitespace in the same line,
# it might be tabular data. Let's assume we need at least 3 such sequences for a line to be considered.
if any(len(number_pattern.findall(line)) >= 3 for line in lines):
return True
# Heuristic 2: Check if there are multiple lines with a similar structure of elements.
# If the majority of these lines have a similar number of numerical elements,
# it might be a sign of tabular data.
if lines_with_patterns:
elements_counts = [
len(whitespace_pattern.split(line)) for line in lines_with_patterns
]
avg_elements = sum(elements_counts) / len(elements_counts)
# If most lines have a number of elements close to the average, consider it tabular
if (
sum(1 for count in elements_counts if abs(count - avg_elements) <= 1)
>= len(elements_counts) * 0.7
):
return True
# No heuristic matched
return False
Additional context
-