Mistral OCR Processing Tool

A versatile command-line tool for processing PDFs and images with Mistral's OCR API. This script extracts text content as markdown and preserves any embedded images.

Features

Multi-format Support: Process both PDF documents and image files (JPG, PNG, GIF, BMP, TIFF, WEBP)
Text Extraction: Convert documents and images to clean markdown
Image Preservation: Extract embedded images from documents if they are available in array
Batch Processing: Process all supported files in a directory automatically
Structured Output: Creates markdown files with the same name as the input files

Requirements

Python 3.6+
mistralai Python package
Mistral API key

Installation

Clone this repository:

git clone https://github.com/yourusername/mistralocr.git
cd mistralocr

Install the required dependencies:
```
pip install mistralai
```
Set up your Mistral API key:
- Option 1: Edit the script and replace "your_api_key_here" with your actual API key
- Option 2: Set an environment variable:
```
export MISTRAL_API_KEY="your_api_key_here"
```

Usage

Place the script in a directory containing PDF and/or image files you want to process
Run the script:
```
python mistral_ocr.py
```
The script will:
- Find all supported files in the current directory
- Process each file with the Mistral OCR API
- Save extracted text as markdown files (.md)
- Extract embedded images to {filename}_images/ folders
- Save full JSON responses as {filename}_full.json files

Output Files

For each input file example.pdf or example.jpg, the script creates:

example.md: The extracted text content in markdown format
example_full.json: The complete JSON response from the Mistral OCR API
example_images/ (if images are found): A directory containing any extracted images

How It Works

File Detection: The script scans the current directory for PDF and image files
Upload: Files are uploaded to Mistral's API
OCR Processing: The API extracts text and identifies any embedded images
Content Extraction: The script parses the API response to get markdown content
Image Extraction: Any embedded images are saved to a separate directory
Output Generation: Markdown files are created with references to the extracted images

API Response Structure

The Mistral OCR API returns a JSON structure with the following key elements:

{
  "pages": [
    {
      "index": 0,
      "markdown": "# Document Title\n\nSample text content with ![img-0.jpeg](img-0.jpeg) image reference.",
      "images": [
        {
          "id": "img-0.jpeg",
          "top_left_x": 253,
          "top_left_y": 473,
          "bottom_right_x": 1630,
          "bottom_right_y": 1792,
          "image_base64": "[BASE64_DATA_REMOVED]",
          "format": "jpeg"
        }
      ],
      "dimensions": {"dpi": 200, "height": 5139, "width": 2387}
    }
  ],
  "model": "mistral-ocr-latest",
  "usage_info": {"pages_processed": 1, "doc_size_bytes": 199841}
}

Base64 data of images are removed because converted images are already stored separately. This make JSON cleaner and smaller.

Limitations

The script requires an active internet connection to access the Mistral API
Processing large files may take time depending on your network speed
The quality of text extraction depends on the clarity of the original documents
API usage may be subject to rate limits or costs, depending on your Mistral account

License

MIT License

Acknowledgements

This tool uses the Mistral OCR API

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
mistral_ocr.py		mistral_ocr.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mistral OCR Processing Tool

Features

Requirements

Installation

Usage

Output Files

How It Works

API Response Structure

Limitations

License

Acknowledgements

About

Releases

Packages

Languages

PetrAPConsulting/MistralOCR

Folders and files

Latest commit

History

Repository files navigation

Mistral OCR Processing Tool

Features

Requirements

Installation

Usage

Output Files

How It Works

API Response Structure

Limitations

License

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages