A versatile command-line tool for processing PDFs and images with Mistral's OCR API. This script extracts text content as markdown and preserves any embedded images.
- Multi-format Support: Process both PDF documents and image files (JPG, PNG, GIF, BMP, TIFF, WEBP)
- Text Extraction: Convert documents and images to clean markdown
- Image Preservation: Extract embedded images from documents if they are available in array
- Batch Processing: Process all supported files in a directory automatically
- Structured Output: Creates markdown files with the same name as the input files
- Python 3.6+
mistralai
Python package- Mistral API key
-
Clone this repository:
git clone https://github.com/yourusername/mistralocr.git cd mistralocr
-
Install the required dependencies:
pip install mistralai
-
Set up your Mistral API key:
- Option 1: Edit the script and replace
"your_api_key_here"
with your actual API key - Option 2: Set an environment variable:
export MISTRAL_API_KEY="your_api_key_here"
- Option 1: Edit the script and replace
- Place the script in a directory containing PDF and/or image files you want to process
- Run the script:
python mistral_ocr.py
- The script will:
- Find all supported files in the current directory
- Process each file with the Mistral OCR API
- Save extracted text as markdown files (
.md
) - Extract embedded images to
{filename}_images/
folders - Save full JSON responses as
{filename}_full.json
files
For each input file example.pdf
or example.jpg
, the script creates:
example.md
: The extracted text content in markdown formatexample_full.json
: The complete JSON response from the Mistral OCR APIexample_images/
(if images are found): A directory containing any extracted images
- File Detection: The script scans the current directory for PDF and image files
- Upload: Files are uploaded to Mistral's API
- OCR Processing: The API extracts text and identifies any embedded images
- Content Extraction: The script parses the API response to get markdown content
- Image Extraction: Any embedded images are saved to a separate directory
- Output Generation: Markdown files are created with references to the extracted images
The Mistral OCR API returns a JSON structure with the following key elements:
{
"pages": [
{
"index": 0,
"markdown": "# Document Title\n\nSample text content with  image reference.",
"images": [
{
"id": "img-0.jpeg",
"top_left_x": 253,
"top_left_y": 473,
"bottom_right_x": 1630,
"bottom_right_y": 1792,
"image_base64": "[BASE64_DATA_REMOVED]",
"format": "jpeg"
}
],
"dimensions": {"dpi": 200, "height": 5139, "width": 2387}
}
],
"model": "mistral-ocr-latest",
"usage_info": {"pages_processed": 1, "doc_size_bytes": 199841}
}
Base64 data of images are removed because converted images are already stored separately. This make JSON cleaner and smaller.
- The script requires an active internet connection to access the Mistral API
- Processing large files may take time depending on your network speed
- The quality of text extraction depends on the clarity of the original documents
- API usage may be subject to rate limits or costs, depending on your Mistral account
- This tool uses the Mistral OCR API