Skip to content

Commit aacec56

Browse files
authored
Added Python MarkItDown docs (#1977)
* Added Python MarkItDown docs * Simplified the task * Update example to use fetch and improved sidebar titles * Fixed link
1 parent 0df7af4 commit aacec56

6 files changed

+229
-3
lines changed

docs/docs.json

+1
Original file line numberDiff line numberDiff line change
@@ -312,6 +312,7 @@
312312
"group": "Python guides",
313313
"pages": [
314314
"guides/python/python-image-processing",
315+
"guides/python/python-doc-to-markdown",
315316
"guides/python/python-crawl4ai",
316317
"guides/python/python-pdf-form-extractor"
317318
]

docs/guides/introduction.mdx

+1
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ Get set up fast using our detailed walk-through guides.
2929
| [Cursor rules](/guides/cursor-rules) | Use Cursor rules to help write Trigger.dev tasks |
3030
| [Prisma](/guides/frameworks/prisma) | How to setup Prisma with Trigger.dev |
3131
| [Python image processing](/guides/python/python-image-processing) | Use Python and Pillow to process images |
32+
| [Python document to markdown](/guides/python/python-doc-to-markdown) | Use Python and MarkItDown to convert documents to markdown |
3233
| [Python PDF form extractor](/guides/python/python-pdf-form-extractor) | Use Python, PyMuPDF and Trigger.dev to extract data from a PDF form |
3334
| [Python web crawler](/guides/python/python-crawl4ai) | Use Python, Crawl4AI and Playwright to create a headless web crawler |
3435
| [Sequin database triggers](/guides/frameworks/sequin) | Trigger tasks from database changes using Sequin |

docs/guides/python/python-crawl4ai.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Python headless browser web crawler example"
3-
sidebarTitle: "Python headless web crawler"
3+
sidebarTitle: "Headless web crawler"
44
description: "Learn how to use Python, Crawl4AI and Playwright to create a headless browser web crawler with Trigger.dev."
55
---
66

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
---
2+
title: "Convert documents to markdown using Python and MarkItDown"
3+
sidebarTitle: "Convert docs to markdown"
4+
description: "Learn how to use Trigger.dev with Python to convert documents to markdown using MarkItDown."
5+
---
6+
7+
import PythonLearnMore from "/snippets/python-learn-more.mdx";
8+
9+
<Note>
10+
This project uses Trigger.dev v4 (which is currently in beta as of 28 April 2025). If you want to
11+
run this project you will need to [upgrade to v4](/upgrade-to-v4).
12+
</Note>
13+
14+
## Overview
15+
16+
Convert documents to markdown using Microsoft's [MarkItDown](https://github.com/microsoft/markitdown) library. This can be especially useful for preparing documents in a structured format for AI applications.
17+
18+
## Prerequisites
19+
20+
- A project with [Trigger.dev initialized](/quick-start)
21+
- [Python](https://www.python.org/) installed on your local machine. _This example requires Python 3.10 or higher._
22+
23+
## Features
24+
25+
- A Trigger.dev task which downloads a document from a URL and runs the Python script which converts it to markdown
26+
- A Python script to convert documents to markdown using Microsoft's [MarkItDown](https://github.com/microsoft/markitdown) library
27+
- Uses our [Python build extension](/config/extensions/pythonExtension) to install dependencies and run Python scripts
28+
29+
## GitHub repo
30+
31+
<Card
32+
title="View the project on GitHub"
33+
icon="GitHub"
34+
href="https://github.com/triggerdotdev/examples/tree/main/python-doc-to-markdown-converter"
35+
>
36+
Click here to view the full code for this project in our examples repository on GitHub. You can
37+
fork it and use it as a starting point for your own project.
38+
</Card>
39+
40+
## The code
41+
42+
### Build configuration
43+
44+
After you've initialized your project with Trigger.dev, add these build settings to your `trigger.config.ts` file:
45+
46+
```ts trigger.config.ts
47+
import { pythonExtension } from "@trigger.dev/python/extension";
48+
import { defineConfig } from "@trigger.dev/sdk/v3";
49+
50+
export default defineConfig({
51+
runtime: "node",
52+
project: "<your-project-ref>",
53+
// Your other config settings...
54+
build: {
55+
extensions: [
56+
pythonExtension({
57+
// The path to your requirements.txt file
58+
requirementsFile: "./requirements.txt",
59+
// The path to your Python binary
60+
devPythonBinaryPath: `venv/bin/python`,
61+
// The paths to your Python scripts to run
62+
scripts: ["src/python/**/*.py"],
63+
}),
64+
],
65+
},
66+
});
67+
```
68+
69+
<Info>
70+
Learn more about executing scripts in your Trigger.dev project using our Python build extension
71+
[here](/config/extensions/pythonExtension).
72+
</Info>
73+
74+
### Task code
75+
76+
This task uses the `python.runScript` method to run the `markdown-converter.py` script with the given document URL as an argument.
77+
78+
```ts src/trigger/convertToMarkdown.ts
79+
import { task } from "@trigger.dev/sdk/v3";
80+
import { python } from "@trigger.dev/python";
81+
import * as fs from "fs";
82+
import * as path from "path";
83+
import * as os from "os";
84+
85+
export const convertToMarkdown = task({
86+
id: "convert-to-markdown",
87+
run: async (payload: { url: string }) => {
88+
const { url } = payload;
89+
90+
// STEP 1: Create temporary file with unique name
91+
const tempDir = os.tmpdir();
92+
const fileName = `doc-${Date.now()}-${Math.random().toString(36).substring(2, 7)}`;
93+
const urlPath = new URL(url).pathname;
94+
const extension = path.extname(urlPath) || ".docx";
95+
const tempFilePath = path.join(tempDir, `${fileName}${extension}`);
96+
97+
// STEP 2: Download file from URL
98+
const response = await fetch(url);
99+
const buffer = await response.arrayBuffer();
100+
await fs.promises.writeFile(tempFilePath, Buffer.from(buffer));
101+
102+
// STEP 3: Run Python script to convert document to markdown
103+
const pythonResult = await python.runScript("./src/python/markdown-converter.py", [
104+
JSON.stringify({ file_path: tempFilePath }),
105+
]);
106+
107+
// STEP 4: Clean up temporary file
108+
fs.unlink(tempFilePath, () => {});
109+
110+
// STEP 5: Process result
111+
if (pythonResult.stdout) {
112+
const result = JSON.parse(pythonResult.stdout);
113+
return {
114+
url,
115+
markdown: result.status === "success" ? result.markdown : null,
116+
error: result.status === "error" ? result.error : null,
117+
success: result.status === "success",
118+
};
119+
}
120+
121+
return {
122+
url,
123+
markdown: null,
124+
error: "No output from Python script",
125+
success: false,
126+
};
127+
},
128+
});
129+
```
130+
131+
### Add a requirements.txt file
132+
133+
Add the following to your `requirements.txt` file. This is required in Python projects to install the dependencies.
134+
135+
```txt requirements.txt
136+
markitdown[all]
137+
```
138+
139+
### The Python script
140+
141+
The Python script uses MarkItDown to convert documents to Markdown format.
142+
143+
```python src/python/markdown-converter.py
144+
import json
145+
import sys
146+
import os
147+
from markitdown import MarkItDown
148+
149+
def convert_to_markdown(file_path):
150+
"""Convert a file to markdown format using MarkItDown"""
151+
# Check if file exists
152+
if not os.path.exists(file_path):
153+
raise FileNotFoundError(f"File not found: {file_path}")
154+
155+
# Initialize MarkItDown
156+
md = MarkItDown()
157+
158+
# Convert the file
159+
try:
160+
result = md.convert(file_path)
161+
return result.text_content
162+
except Exception as e:
163+
raise Exception(f"Error converting file: {str(e)}")
164+
165+
def process_trigger_task(file_path):
166+
"""Process a file and convert to markdown"""
167+
try:
168+
markdown_result = convert_to_markdown(file_path)
169+
return {
170+
"status": "success",
171+
"markdown": markdown_result
172+
}
173+
except Exception as e:
174+
return {
175+
"status": "error",
176+
"error": str(e)
177+
}
178+
179+
if __name__ == "__main__":
180+
# Get the file path from command line arguments
181+
if len(sys.argv) < 2:
182+
print(json.dumps({"status": "error", "error": "No file path provided"}))
183+
sys.exit(1)
184+
185+
try:
186+
config = json.loads(sys.argv[1])
187+
file_path = config.get("file_path")
188+
189+
if not file_path:
190+
print(json.dumps({"status": "error", "error": "No file path specified in config"}))
191+
sys.exit(1)
192+
193+
result = process_trigger_task(file_path)
194+
print(json.dumps(result))
195+
except Exception as e:
196+
print(json.dumps({"status": "error", "error": str(e)}))
197+
sys.exit(1)
198+
```
199+
200+
## Testing your task
201+
202+
1. Create a virtual environment `python -m venv venv`
203+
2. Activate the virtual environment, depending on your OS: On Mac/Linux: `source venv/bin/activate`, on Windows: `venv\Scripts\activate`
204+
3. Install the Python dependencies `pip install -r requirements.txt`. _Make sure you have Python 3.10 or higher installed._
205+
4. Copy the project ref from your [Trigger.dev dashboard](https://cloud.trigger.dev) and add it to the `trigger.config.ts` file.
206+
5. Run the Trigger.dev CLI `dev` command (it may ask you to authorize the CLI if you haven't already).
207+
6. Test the task in the dashboard by providing a valid document URL.
208+
7. Deploy the task to production using the Trigger.dev CLI `deploy` command.
209+
210+
## MarkItDown Conversion Capabilities
211+
212+
- Convert various file formats to Markdown:
213+
- Office formats (Word, PowerPoint, Excel)
214+
- PDFs
215+
- Images (with optional LLM-generated descriptions)
216+
- HTML, CSV, JSON, XML
217+
- Audio files (with optional transcription)
218+
- ZIP archives
219+
- And more
220+
- Preserve document structure (headings, lists, tables, etc.)
221+
- Handle multiple input methods (file paths, URLs, base64 data)
222+
- Optional Azure Document Intelligence integration for better PDF and image conversion
223+
224+
<PythonLearnMore />

docs/guides/python/python-image-processing.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Python image processing example"
3-
sidebarTitle: "Python image processing"
3+
sidebarTitle: "Process images"
44
description: "Learn how to use Trigger.dev with Python to process images from URLs and upload them to S3."
55
---
66

docs/guides/python/python-pdf-form-extractor.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Python PDF form extractor example"
3-
sidebarTitle: "Python PDF form extractor"
3+
sidebarTitle: "Extract form data from PDFs"
44
description: "Learn how to use Trigger.dev with Python to extract form data from PDF files."
55
---
66

0 commit comments

Comments
 (0)