Skip to content

vad : add initial Voice Activity Detection (VAD) support #3065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

danbev
Copy link
Collaborator

@danbev danbev commented Apr 22, 2025

This commit add support for Voice Activity Detection (VAD). When enabled
this feature will process the audio input and detect speech segments.
This information is then used to reduce the number of samples that need
to be processed by whisper_full.

This initial support is based on the Silero VAD model which needs to be converted to GGML format:

$ (venv) pip install silero-vad
$ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin
Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin

There is test the tests the VAD support in isolation:

$ cmake --build build --target test-vad && \
    ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV

And one that tests VAD in combination with whisper_full:

$ cmake --build build --target test-vad-full && \
    ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV

Resolves: #3003


whisper-cli example output
./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav --vad --vad-model models/for-tests-silero-v5.1.2-ggml.bin
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i7-1260P)
whisper_init_with_params_no_state: devices    = 1
whisper_init_with_params_no_state: backends   = 1
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:          CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init_gpu: no GPU found
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.26 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   96.35 MB

system_info: n_threads = 4 / 16 | WHISPER : COREML = 0 | OPENVINO = 0 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

whisper_full_with_state: VAD is enabled, processing speach segments only
whisper_vad_init_from_file_with_params_no_state: loading VAD model from 'models/for-tests-silero-v5.1.2-ggml.bin'
whisper_vad_init_from_file_with_params_no_state: n_encoder_layers = 4
whisper_vad_init_from_file_with_params_no_state: encoder_in_channels[0] = 129
whisper_vad_init_from_file_with_params_no_state: encoder_in_channels[1] = 128
whisper_vad_init_from_file_with_params_no_state: encoder_in_channels[2] = 64
whisper_vad_init_from_file_with_params_no_state: encoder_in_channels[3] = 64
whisper_vad_init_from_file_with_params_no_state: encoder_out_channels[0] = 128
whisper_vad_init_from_file_with_params_no_state: encoder_out_channels[1] = 64
whisper_vad_init_from_file_with_params_no_state: encoder_out_channels[2] = 64
whisper_vad_init_from_file_with_params_no_state: encoder_out_channels[3] = 128
whisper_vad_init_from_file_with_params_no_state: kernel_sizes[0] = 3
whisper_vad_init_from_file_with_params_no_state: kernel_sizes[1] = 3
whisper_vad_init_from_file_with_params_no_state: kernel_sizes[2] = 3
whisper_vad_init_from_file_with_params_no_state: kernel_sizes[3] = 3
whisper_vad_init_from_file_with_params_no_state: lstm_input_size = 128
whisper_vad_init_from_file_with_params_no_state: lstm_hidden_size = 128
whisper_vad_init_from_file_with_params_no_state: final_conv_in = 128
whisper_vad_init_from_file_with_params_no_state: final_conv_out = 1
whisper_vad_init_from_file_with_params_no_state:          CPU total size =     0.88 MB
whisper_vad_init_from_file_with_params_no_state: model size    =    0.88 MB
whisper_backend_init_gpu: no GPU found
whisper_vad_build_graph: Building VAD graph
whisper_vad_build_graph: stft output shape = [4, 129, 1]
whisper_vad_build_encoder_layer: building encoder layer
whisper_vad_build_graph: endoder output shape = [1, 128, 1]
whisper_vad_build_lstm_layer: building LSTM layer
whisper_vad_build_lstm_layer: hidden dimension = 128
whisper_vad_build_graph: lstm output shape = [128, 1, 1]
whisper_vad_init_state: compute buffer (VAD)   =    1.59 MB
whisper_vad_detect_speech_timestamps: detecting speech timestamps in 176000 samples
whisper_vad_detect_speech: detecting speech in 176000 samples
whisper_vad_detect_speech: n_chunks: 344
whisper_vad_build_graph: Building VAD graph
whisper_vad_build_graph: stft output shape = [4, 129, 1]
whisper_vad_build_encoder_layer: building encoder layer
whisper_vad_build_graph: endoder output shape = [1, 128, 1]
whisper_vad_build_lstm_layer: building LSTM layer
whisper_vad_build_lstm_layer: hidden dimension = 128
whisper_vad_build_graph: lstm output shape = [128, 1, 1]
whisper_vad_detect_speech: props size: 344
whisper_vad_detect_speech: chunk_len: 384 < n_window: 512
whisper_vad_detect_speech: finished processing 176000 samples
whisper_vad_timestamps_from_probs: detecting speech timestamps using 344 probabilities
whisper_vad_timestamps_from_probs: Merged 0 adjacent segments, now have 5 segments
whisper_vad_timestamps_from_probs: Final speech segments after filtering: 5
whisper_vad_timestamps_from_probs: VAD segment 0: start = 0.29, end = 2.21 (duration: 1.92)
whisper_vad_timestamps_from_probs: VAD segment 1: start = 3.30, end = 3.77 (duration: 0.48)
whisper_vad_timestamps_from_probs: VAD segment 2: start = 4.00, end = 4.35 (duration: 0.35)
whisper_vad_timestamps_from_probs: VAD segment 3: start = 5.38, end = 7.65 (duration: 2.27)
whisper_vad_timestamps_from_probs: VAD segment 4: start = 8.16, end = 10.59 (duration: 2.43)
whisper_full_with_state: detected 5 speech segments
whisper_full_with_state: Including segment 0: 0.29 - 2.31 (duration: 2.02)
whisper_full_with_state: Including segment 1: 3.30 - 3.87 (duration: 0.58)
whisper_full_with_state: Including segment 2: 4.00 - 4.45 (duration: 0.45)
whisper_full_with_state: Including segment 3: 5.38 - 7.75 (duration: 2.37)
whisper_full_with_state: Including segment 4: 8.16 - 10.59 (duration: 2.43)
whisper_full_with_state: total duration of speech segments: 7.84 seconds
whisper_full_with_state: Reduced audio from 176000 to 131778 samples (25.1% reduction)

[00:00:00.000 --> 00:00:08.140]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   115.30 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    33.69 ms
whisper_print_timings:   sample time =   225.10 ms /   140 runs (     1.61 ms per run)
whisper_print_timings:   encode time =  9677.55 ms /     1 runs (  9677.55 ms per run)
whisper_print_timings:   decode time =    56.80 ms /     4 runs (    14.20 ms per run)
whisper_print_timings:   batchd time =  1573.50 ms /   132 runs (    11.92 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (     0.00 ms per run)
whisper_print_timings:    total time = 11899.06 ms

@danbev danbev force-pushed the vad branch 3 times, most recently from 5758650 to 9f0ed3d Compare April 25, 2025 05:27
@tannisroot
Copy link

Are there plans to add vad support for server or this is a goal after the PR is merged?

@danbev
Copy link
Collaborator Author

danbev commented Apr 26, 2025

Are there plans to add vad support for server or this is a goal after the PR is merged?

I think it would be nice to get an initial version merged first as this PR is quite large as it is. I can then start looking at adding support to the server, and hopefully during that time people can start trying this out and see what works and does not work.

I'm adding the remaining options to whisper-cli now and after that this should be ready for review.

@danbev danbev force-pushed the vad branch 2 times, most recently from b59768b to 798695f Compare April 28, 2025 14:21
@danbev danbev marked this pull request as ready for review April 28, 2025 14:21
@ggerganov
Copy link
Member

I am doing some initial testing using long audio and large-v3-turbo and it looks like the quality improves significantly when pre-processing the audio with a VAD.

I am wondering if we can somehow align the output timestamps with the original audio? Right now, I think that the audio that is cut out is not taken into account, so the final timestamps are not aligned with the input audio and it is a bit difficult to evaluate the results.

@danbev
Copy link
Collaborator Author

danbev commented Apr 30, 2025

I am wondering if we can somehow align the output timestamps with the original audio? Right now, I think that the audio that is cut out is not taken into account, so the final timestamps are not aligned with the input audio and it is a bit difficult to evaluate the results.

Ah yes, currently what is done is only the samples that are detected to contain speech are passed to whisper_pcm_to_mel_with_state and the reported timestamps in the output will be "according" to those samples. I'll take a closer look at how this can be handled.

With commit the output is now more inline with the original audio input.

gb0 without VAD
./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/gb0.wav
...
[00:00:00.000 --> 00:00:03.240]   Good morning, this Tuesday is Election Day.   
[00:00:03.240 --> 00:00:06.000]   After months of spirited debate and vigorous campaigning,
[00:00:06.000 --> 00:00:08.640]   the time has come for Americans to make important decisions
[00:00:08.640 --> 00:00:10.140]   about our nation's future.                    
[00:00:10.140 --> 00:00:13.740]   I encourage all Americans to go to the polls and vote.
[00:00:13.740 --> 00:00:16.140]   Election season brings out the spirit of competition
[00:00:16.140 --> 00:00:18.080]   between our political parties.                    
[00:00:18.080 --> 00:00:20.280]   And that competition is an essential part     
[00:00:20.280 --> 00:00:21.780]   of a healthy democracy.                       
[00:00:21.780 --> 00:00:23.520]   But as the campaigns come to a close,         
[00:00:23.520 --> 00:00:25.980]   Republicans, Democrats, and independents      
[00:00:25.980 --> 00:00:29.120]   can find common ground on at least one point. 
[00:00:29.120 --> 00:00:31.560]   Our system of representative democracy        
[00:00:31.560 --> 00:00:34.440]   is one of America's greatest strengths.       
[00:00:34.440 --> 00:00:36.240]   The United States was founded on the belief   
[00:00:36.240 --> 00:00:38.240]   that all men are created equal.               
[00:00:38.240 --> 00:00:41.440]   Every election day, millions of Americans of all races,
[00:00:41.440 --> 00:00:43.440]   religions, and backgrounds step into voting       
[00:00:43.440 --> 00:00:45.280]   booths throughout the nation.                     
[00:00:45.280 --> 00:00:47.780]   Whether they are rich or poor, old or young,  
[00:00:47.780 --> 00:00:50.680]   each of them has an equal share in choosing the path
[00:00:50.680 --> 00:00:52.440]   that our country will take.                   
[00:00:52.440 --> 00:00:54.920]   And every ballot they cast is a reminder      
[00:00:54.920 --> 00:00:58.280]   that our founding principles are alive and well.
[00:00:58.280 --> 00:00:59.760]   Voting is one of the great privileges         
[00:00:59.760 --> 00:01:01.760]   of American citizenship.                      
[00:01:01.760 --> 00:01:04.520]   And it has always required brave defenders.   
[00:01:04.520 --> 00:01:06.000]   As you head to the polls next week,           
[00:01:06.000 --> 00:01:08.400]   remember the sacrifices that have been made   
[00:01:08.400 --> 00:01:11.040]   by generations of Americans in uniform        
[00:01:11.040 --> 00:01:13.000]   to preserve our way of life.                  
[00:01:13.000 --> 00:01:14.840]   From Bunker Hill to Baghdad,                  
[00:01:14.840 --> 00:01:16.740]   the men and women of American armed forces    
[00:01:16.740 --> 00:01:19.940]   have been devoted guardians of our democracy. 
[00:01:19.940 --> 00:01:21.840]   All of us owe them and their families         
[00:01:21.840 --> 00:01:25.240]   a special debt of gratitude on Election Day.  
[00:01:25.240 --> 00:01:27.520]   Americans should also remember the important example
[00:01:27.520 --> 00:01:30.080]   that our election set throughout the world.   
[00:01:30.080 --> 00:01:32.080]   Young democracies from Georgia and Ukraine    
[00:01:32.080 --> 00:01:34.560]   to Afghanistan and Iraq can look to the United States
[00:01:34.560 --> 00:01:37.520]   for proof that self-government can endure.    
[00:01:37.520 --> 00:01:40.400]   And nations that still live under tyranny and oppression
[00:01:40.400 --> 00:01:44.080]   can find hope and inspiration in our commitment to liberty.
[00:01:44.080 --> 00:01:45.200]   For more than two centuries,                  
[00:01:45.200 --> 00:01:47.120]   Americans have demonstrated the ability       
[00:01:47.120 --> 00:01:49.600]   of free people to choose their own leaders.   
[00:01:49.600 --> 00:01:51.880]   Our nation has flourished because of its commitment
[00:01:51.880 --> 00:01:54.640]   to trusting the wisdom of our citizenry.      
[00:01:54.640 --> 00:01:57.200]   In this year's election, we will see this tradition
[00:01:57.200 --> 00:02:00.280]   continue, and we will be reminded once again  
[00:02:00.280 --> 00:02:02.640]   that we are blessed to live in a free nation  
[00:02:02.640 --> 00:02:05.520]   guided by the will of the people.             
[00:02:05.520 --> 00:02:06.720]   Thank you for listening.
gb0 with VAD
./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/gb0.wav --vad --vad-threshold 0.5 --vad-model models/for-tests-silero-v5.1.2-ggml.bin
...
[00:00:00.000 --> 00:00:03.280]   Good morning, this Tuesday is Election Day.
[00:00:03.280 --> 00:00:06.000]   After months of spirited debate and vigorous campaigning,
[00:00:06.000 --> 00:00:08.600]   the time has come for Americans to make important decisions
[00:00:08.600 --> 00:00:10.200]   about our nation's future.
[00:00:10.200 --> 00:00:13.790]   Encourage all Americans to go to the polls and vote.
[00:00:13.790 --> 00:00:16.120]   Election season brings out the spirit of competition
[00:00:16.120 --> 00:00:18.060]   between our political parties.
[00:00:18.060 --> 00:00:20.230]   And that competition is an essential part
[00:00:20.230 --> 00:00:21.820]   of a healthy democracy.
[00:00:21.820 --> 00:00:23.550]   But as the campaigns come to a close,
[00:00:23.550 --> 00:00:25.960]   Republicans, Democrats, and independents
[00:00:25.960 --> 00:00:29.180]   can find common ground on at least one point.
[00:00:29.180 --> 00:00:31.530]   Our system of representative democracy
[00:00:31.530 --> 00:00:34.470]   is one of America's greatest strengths.
[00:00:34.470 --> 00:00:36.250]   The United States was founded on the belief
[00:00:36.250 --> 00:00:38.310]   that all men are created equal.
[00:00:38.310 --> 00:00:40.740]   Every election day, millions of Americans
[00:00:40.740 --> 00:00:42.630]   of all races, religions, and backgrounds
[00:00:42.630 --> 00:00:45.340]   step into voting booths throughout the nation.
[00:00:45.340 --> 00:00:48.530]   Whether they are rich or poor, old or young, each of them
[00:00:48.530 --> 00:00:50.660]   has an equal share in choosing the path
[00:00:50.660 --> 00:00:52.480]   that our country will take.
[00:00:52.480 --> 00:00:54.910]   And every ballot they cast is a reminder
[00:00:54.910 --> 00:00:58.330]   that our founding principles are alive and well.
[00:00:58.330 --> 00:00:59.760]   Voting is one of the great privileges
[00:00:59.760 --> 00:01:01.810]   of American citizenship.
[00:01:01.810 --> 00:01:04.550]   And it is always required brave defenders.
[00:01:04.550 --> 00:01:06.050]   As you head to the polls next week,
[00:01:06.050 --> 00:01:08.380]   remember the sacrifices that have been made
[00:01:08.380 --> 00:01:11.580]   by generations of Americans in uniform to preserve
[00:01:11.580 --> 00:01:13.010]   our way of life.
[00:01:13.010 --> 00:01:15.450]   From Bunker Hill to Baghdad, the men and women
[00:01:15.450 --> 00:01:17.030]   of American armed forces have been
[00:01:17.030 --> 00:01:19.990]   devoted guardians of our democracy.
[00:01:19.990 --> 00:01:21.790]   All of us owe them and their families
[00:01:21.790 --> 00:01:25.260]   a special debt of gratitude on election day.
[00:01:25.260 --> 00:01:27.520]   Americans should also remember the important example
[00:01:27.520 --> 00:01:30.090]   that our elections set throughout the world.
[00:01:30.090 --> 00:01:32.070]   Young democracies from Georgia and Ukraine
[00:01:32.070 --> 00:01:34.520]   to Afghanistan and Iraq can look to the United States
[00:01:34.520 --> 00:01:37.450]   for proof that self-government can endure.
[00:01:37.450 --> 00:01:40.400]   And nations that still live under tyranny and oppression
[00:01:40.400 --> 00:01:44.080]   can find hope and inspiration in our commitment to liberty.
[00:01:44.080 --> 00:01:45.690]   For more than two centuries, Americans
[00:01:45.690 --> 00:01:47.730]   have demonstrated the ability of free people
[00:01:47.730 --> 00:01:49.600]   to choose their own leaders.
[00:01:49.600 --> 00:01:51.830]   Our nation has flourished because of its commitment
[00:01:51.830 --> 00:01:54.630]   to trusting the wisdom of our citizenry.
[00:01:54.630 --> 00:01:58.460]   In this year's election, we will see this tradition continue.
[00:01:58.460 --> 00:02:00.220]   And we will be reminded once again
[00:02:00.220 --> 00:02:02.590]   that we are blessed to live in a free nation
[00:02:02.590 --> 00:02:05.490]   guided by the will of the people.
[00:02:05.490 --> 00:02:06.650]   Thank you for listening.

danbev added 4 commits May 2, 2025 15:47
This commit add support for Voice Activity Detection (VAD). When enabled
this feature will process the audio input and detect speech segments.
This information is then used to reduce the number of samples that need
to be processed by whisper_full.

This initial support is based on the Silero VAD model which needs to
be converted to GGML format:
```console
$ (venv) pip install silero-vad
$ (venv) $ python models/convert-silero-vad-to-ggml.py --output models/silero.bin
 Saving GGML Silero-VAD model to models/silero-v5.1.2-ggml.bin
```

There is test the tests the VAD support in isolation:
```console
$ cmake --build build --target test-vad && \
    ctest -R ^test-vad$ --test-dir build -C Debug --output-on-failure -VV
```

And one that tests VAD in combination with whisper_full:
```console
$ cmake --build build --target test-vad-full && \
    ctest -R test-vad-full --test-dir build -C Debug --output-on-failure -VV
```

Resolves: ggml-org#3003
Example of format:
```console

$ ./build/bin/whisper-cli --help

usage: ./build/bin/whisper-cli [options] file0 file1 ...
supported audio formats: flac, mp3, ogg, wav

options:
  -h,        --help              [default] show this help message and exit
  ...

Voice Activity Detection (VAD) options:
  -v,        --vad                           [false  ] enable Voice Activity Detection (VAD)
  -vm FNAME, --vad-model FNAME               [       ] VAD model path
  -vt N,     --vad-threshold N               [0.50   ] VAD threshold for speech recognition
  -vs N,     --vad_window_size_samples     N [512    ] VAD window size
  -vspd N,   --vad_min_speech_duration_ms  N [250    ] VAD min speech duration
  -vsd N,    --vad_min_silence_duration_ms N [100    ] VAD min silence duration
  -vmsd N,   --vad_max_speech_duration_s   N [FLT_MAX] VAD max speech duration
  -vp N,     --vad_speech_pad_ms           N [30     ] VAD speech padding
  -vo N,     --vad_samples_overlap         N [0.10   ] VAD samples overlap size
```
The main reason for the separate VAD options section is that the VAD
options are longer and made the rest look a little ugly.
This commit adds a job to the CI pipeline to test the VAD model.
This will only test the VAD model in isolation, that is it does not
test whisper_full.
This commit adds a mapping of the original audio timestamps to the
timestamps of the segments in the VAD (Voice Activity Detection)
process.

The motivation for this change is when we process the original audio
signal and only pass the speech segments to whisper_full, the
timestamps that whisper returnes when calling functions like
whisper_full_get_segment_t0 are the timestamps for the "VAD"
segments and not the original audio.

The values are not identical to the the timestamps processed without VAD
enabled but they are close, and hopefully close enough.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

whisper : add Silero VAD built-in support
3 participants