A Docker-based pipeline that transcribes audio recordings and generates refined summaries/notes using AI LLM Models. It leverages:
- NVIDIA GPU for accelerating Whisper (transcription) and Phi (summarization) within Docker containers
- FastAPI + Uvicorn for a RESTful backend
- Streamlit for a user-friendly frontend UI
- Ollama for hosting the LLM model (Phi 3.5 mini-instruct) and performing advanced text summarization. Note the biggest reason for using Ollama is for the fact we are using GGUF models. The quantized Q4_K_M model provides quality and performance.
- Overview
- Architecture
- Folder Structure
- Installation Requirements
- Environment Variables
- Usage
- Technical Details
- Logging & Monitoring
- Additional Notes
- Troubleshooting
- License
This project aims to provide an end-to-end solution for:
- Transcribing long or short audio recordings via OpenAI Whisper - Medium Model
- Summarizing those transcripts using a Phi (model name:
phi3.5 mini-instruct
) running inside an Ollama container
- Simple Docker Compose stack with two services:
- app: Runs both FastAPI and Streamlit in one container
- ollama: Provides the Summarization Large Language Model
- Automatic GPU offloading if NVIDIA drivers and the NVIDIA Container Toolkit are available
- Streamlit frontend for easy user interaction: drag-and-drop audio, see transcription & summary
+----------------------------+
| Docker Container (ollama) |
| LLM Summarization |
| (Phi 3.5 mini-instruct) |
+--------------^------------+
|
+----------------------------------+
| Docker Container (app) |
| +----------------------------+ |
| | FastAPI | Streamlit | |
| | (Uvicorn) | (web UI) | |
| | :8000 | :8501 | |
| +----------------------------+ |
| | | |
| (Whisper) (User Uploads) |
+----------------------------------+
|
+----------------------+
| Whisper Transcription |
| (GPU-accelerated) |
+----------------------+
LocalAudioTran-LLM-Summar/
├─ .dockerignore
├─ .env
├─ .gitignore
├─ README.md
├─ docker-compose.yml
├─ Dockerfile
├─ backend/
│ ├─ requirements.txt
│ └─ app/
│ ├─ main.py
│ ├─ services/
│ │ ├─ transcription.py
│ │ ├─ summarization.py
│ │ └─ __init__.py
│ ├─ utils/
│ │ └─ logger.py
│ ├─ models/
│ │ ├─ schemas.py
│ │ └─ __init__.py
│ └─ __init__.py
├─ frontend/
│ ├─ requirements.txt
│ └─ src/
│ └─ app.py
└─ logs/
-
backend/
: Houses the FastAPI applicationmain.py
- Primary endpointstranscription.py
- Whisper-based audio transcriptionsummarization.py
- Ollama integration and multi-step summary approachlogger.py
- Rotating logs setup
-
frontend/
: Contains the Streamlit interface -
docker-compose.yml
: Definesapp
andollama
services -
Dockerfile
: System setup and dependencies
- Install Docker
- Install Docker Compose plugin
- Verify installation:
docker --version docker-compose --version
- Install NVIDIA Container Toolkit
- Verify GPU visibility:
nvidia-smi
- Disk Space:
- Docker images: >1GB
- Full environment + models: Several GB
- RAM: Minimum 32-64GB recommended
- GPU Memory: 12-16GB recommended (if using GPU)
- Internet connection: Required for downloading models
Create a .env
file at the repository root:
HF_TOKEN=hf_123yourhuggingfacetoken
PYTHONPATH=/app
NVIDIA_VISIBLE_DEVICES=all
Important: Never commit sensitive tokens to public repositories.
# Build Docker images
docker-compose build
# Start services
docker-compose up
This creates two containers:
app
: FastAPI (:8000
) + Streamlit (:8501
)ollama
: LLM server (:11434
)
- Frontend (Streamlit):
http://localhost:8501
- Backend (FastAPI):
http://localhost:8000
- Ollama: Port
11434
(internal use only)
- Open Streamlit interface
- Upload audio file (supported: mp3, wav, m4a)
- Click "Process Audio"
- View results in "Transcription" and "Summary" tabs
- Optional use the Clipboard to copy the summary as a text file
- FastAPI receives
UploadFile
- File saved to temporary storage
- Whisper processes audio (GPU-accelerated if available)
- Results returned to client
- Direct Processing: Transcript processed in a single pass using Phi model. The biggest reason to choose a large context window is to ensure the model can process the entire transcript without truncation, chunking, overlapping sections etc as the quality gets deteriorated with chunking
- Structured Output: Summary organized into clear sections:
- Overview
- Main Points
- Key Insights
- Action Items / Decisions
- Open Questions / Next Steps
- Conclusions
- Backend:
logs/api.log
logs/transcription.log
logs/summarization.log
- Frontend:
logs/frontend.log
View combined logs:
docker-compose logs -f
# Verify Docker GPU access
docker run --rm --gpus all nvidia/cuda:12.1.0-runtime-ubuntu22.04 nvidia-smi
# Check environment
nvidia-smi
- Check for CPU fallback
- Try smaller Whisper model in
transcription.py
- Reduce model size
- Lower context window
- Ensure sufficient GPU memory (16-24GB)
Default ports:
- FastAPI:
:8000
- Streamlit:
:8501
Solution: Edit port mappings in docker-compose.yml
MIT License
Copyright (c) [2025] [AskAresh.com]
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.