Skip to content

Commit 6f5eb2d

Browse files
Thibaud BourgeoisThibaud Bourgeois
Thibaud Bourgeois
authored and
Thibaud Bourgeois
committed
Added README.md
1 parent ca5badc commit 6f5eb2d

File tree

2 files changed

+99
-2
lines changed

2 files changed

+99
-2
lines changed

Diff for: .env.example

+2-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
PYTORCH_ENABLE_MPS_FALLBACK=
1+
PYTORCH_ENABLE_MPS_FALLBACK=
2+
TORCHAUDIO_USE_BACKEND_DISPATCHER=1

Diff for: README.md

+97-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,97 @@
1-
<h1 align="center">Whisper Diarization Using Nemo</h1>
1+
# Whisper + Diarization using Nemo
2+
3+
## Project Overview
4+
5+
This project combines advanced speech recognition and speaker diarization techniques to transcribe and identify speakers in audio recordings. We use OpenAI's Whisper model for transcription and NVIDIA's NeMo MSDD model for speaker diarization. The project can process various types of audio, including telephonic, meeting, and general conversations, with high accuracy and efficiency.
6+
7+
## Installation
8+
9+
### Prerequisites
10+
11+
- Python 3.10
12+
13+
- CUDA-enabled GPU (optional but recommended for faster processing)
14+
15+
### Setup
16+
17+
1. Clone the repository:
18+
19+
20+
```bash
21+
git clone https://github.com/thibaudbrg/whisper-diarization.git
22+
cd whisper-diarization
23+
```
24+
25+
2. Install dependencies using Poetry:
26+
27+
28+
```bash
29+
poetry install
30+
```
31+
32+
3. Configure your environment variables by creating a `.env` file (if necessary).
33+
34+
```bash
35+
cp .env.example .env
36+
```
37+
38+
## Usage
39+
40+
### Command Line Interface
41+
Run the main script `concurrent_diarize.py` with the required arguments:
42+
43+
```bash
44+
python whisper_diarization/concurrent_diarize.py -a <audio_file.wav> --whisper-model <model_name>
45+
```
46+
47+
### Command Line Arguments
48+
49+
- `-a, --audio`: Name of the target audio file (required).
50+
51+
- `--no-stem`: Disables source separation. This helps with long files that don't contain a lot of music.
52+
53+
- `--suppress_numerals`: Suppresses numerical digits, improving diarization accuracy by converting all digits into written text.
54+
55+
- `--whisper-model`: Name of the Whisper model to use (default: `medium.en`).
56+
57+
- `--batch-size`: Batch size for batched inference. Reduce if you run out of memory; set to 0 for non-batched inference (default: 8).
58+
59+
- `--language`: Language spoken in the audio. Specify None to perform language detection.
60+
61+
- `--device`: Device to run the model on. Use `cuda` if you have a GPU, otherwise `cpu`.
62+
63+
### Script Overview
64+
`concurrent_diarize.py`
65+
This script orchestrates the entire process of audio transcription and speaker diarization. Below is a high-level overview of the steps involved:
66+
67+
1. **Parsing Command Line Arguments** : The script accepts various arguments to customize the transcription and diarization process.
68+
69+
2. **Vocal Isolation** : Uses Demucs to separate vocals from background music if the `--no-stem` flag is not set.
70+
71+
3. **Transcription** : Utilizes the Whisper model for audio transcription.
72+
73+
4. **Forced Alignment** : Aligns the transcribed text with the audio using Wav2Vec2.
74+
75+
5. **Mono Audio Conversion** : Converts audio to mono for compatibility with NeMo MSDD.
76+
77+
6. **Speaker Diarization** : Performs speaker diarization using the NeMo MSDD model.
78+
79+
7. **Restoring Punctuation** : Restores punctuation in the transcribed text using a deep learning model.
80+
81+
8. **Writing Output Files** : Generates and saves the final speaker-aware transcript and SRT files.
82+
83+
## Configurations
84+
Configuration files for different diarization scenarios (general, meeting, telephonic) are stored in the `config` directory. You can customize these YAML files based on your specific needs.
85+
## Output
86+
Processed outputs, including transcribed text files and SRT subtitle files, are saved in the `outputs` directory.
87+
## Troubleshooting
88+
89+
- Ensure your audio files are in a supported format (e.g., WAV).
90+
91+
- Verify that you have the correct versions of all dependencies installed.
92+
93+
- For CUDA-related issues, make sure your GPU drivers and CUDA toolkit are correctly installed.
94+
95+
## License
96+
97+
This project is under `MIT License`. For more information, see the `LICENSE` file.

0 commit comments

Comments
 (0)