VAD vs event-triggered for AI speech-to-speech applications

There are two common approaches: Voice Activity Detection (VAD) and event-triggered recording. In this article, we’ll explore these two methods in detail. We'll discuss their strengths, weaknesses, and how to select the most suitable approach based on specific use cases and performance requirements.
Voice Activity Detection (VAD)
Voice Activity Detection (VAD) is a technique used to identify the presence or absence of human speech in an audio signal. It continuously monitors incoming audio and classifies segments as either speech or non-speech (e.g., silence, noise). When speech is detected, VAD can trigger downstream actions, such as starting or stopping recording, transcribing, or streaming to a text-to-speech (TTS) engine.
Voice Activity Detection starts by cleaning and simplifying the audio signal - this helps reduce background noise and focuses on the range of sounds where human speech usually occurs. Then, the system looks at different characteristics of the sound, like how loud it is, how often it crosses the zero line, or more advanced patterns that capture the shape of speech. These details are used to decide whether the sound is actually someone talking or just noise or silence. Finally, to avoid mistakes like reacting to random sounds, the system adds some rules to smooth things out - for example, ignoring very short sounds or waiting briefly before deciding speech has ended.
Pros:
- Natural and seamless user experience
- Enables hands-free, continuous interaction
- Ideal for real-time, open-mic use cases (e.g., avatars, meetings)
Cons:
- Adds delay due to silence threshold before stopping (1-2 sec)
- Requires noise suppression and threshold tuning
- Higher complexity of infrastructure (requires “on-fly” input processing and analyzing)
Main VAD parameters
To make Voice Activity Detection (VAD) work effectively, especially in real-time applications like voice assistants or avatars, it’s important to configure a few key parameters. These settings control how quickly and accurately the system detects when someone starts or stops speaking, and how well it handles background noise or silence. The right balance between responsiveness and stability depends on the specific use case and environment (e.g., quiet room vs noisy street).
Below are the main parameters commonly used to tune VAD systems:
Frame size / Window length
This is the size of each chunk of audio the system analyzes at a time, usually in milliseconds (e.g., 10ms, 20ms, 30ms).
- Smaller frames allow faster reaction but use more processing power.
- Larger frames are slower but more stable and efficient.
Silence threshold
This defines how quiet the audio must be to count as silence. The system uses this to ignore background noise.
- A lower threshold picks up more sound (including noise), while a higher one may miss quiet speech.
Speech start trigger (hang-in time)
This sets how long someone needs to speak before the system starts responding.
- Prevents false starts from short or accidental noises.
- Typically, a short duration like 200- 500ms.
Speech end trigger (hang-out time)
This controls how long the system waits after speech stops before marking it as the end.
- Prevents cutting off speakers mid-sentence during natural pauses.
- Commonly 500- 1500ms depending on use case.
Aggressiveness level
This adjusts how strictly the system filters out background noise.
- Zero Level (least aggressive) allows more background sounds.
- Max Level (most aggressive) is better in noisy environments, but may clip quiet voices.
OpenAI voice activity detection overview
OpenAI’s Realtime API supports voice activity detection (VAD) to automatically detect when a user starts or stops speaking. VAD is enabled by default in speech-to-speech and transcription sessions, but can also be turned off or customized depending on your needs.
When enabled, the API emits specific events:
- speech_started when the user begins speaking
- speech_stopped when the system detects the end of speech
These events help structure the conversation into clear "turns" of speech, so your app knows when to process a transcript or start a response.
There are two modes of VAD available, depending on how you want speech turns to be detected:
1. Server VAD (default)
This mode uses traditional audio-based detection. It listens for periods of silence in the audio to decide when the speaker has paused or stopped.
You can customize how sensitive this detection is using these parameters:
- threshold - Controls how loud the speech must be to trigger detection. Higher values are better for noisy environments.
- prefix_padding_ms - Adds a short buffer of audio before the speech starts, so you don’t miss the first few milliseconds.
- silence_duration_ms - Defines how long the system waits in silence before declaring the user has finished speaking.
This mode is predictable and works well in general cases, but it may interrupt or cut off users who pause briefly while thinking.
2. Semantic VAD
Instead of relying on silence, semantic VAD uses a language-aware model to understand when a speaker has actually finished their thought based on the content of their words. It can distinguish between a long pause that’s part of the sentence and an actual end of a statement.
This approach is more natural for conversation-like interactions and is less likely to interrupt someone speaking slowly or pausing mid-sentence.
Key config for semantic VAD:
- eagerness - Controls how quickly the system decides the speaker is done.
- low lets the user take their time
- high responds as soon as possible
- auto (default) balances both
Semantic VAD is ideal when smooth, human-like turn-taking is important, especially in speech-to-speech use cases or when real-time transcriptions need to capture full thoughts without being broken mid-phrase.
OpenAI's VAD lets you choose between silence-based (server VAD) and meaning-based (semantic VAD) speech detection. You can adjust things like thresholds, silence timeouts, or eagerness to match your app's responsiveness and user experience. Semantic VAD is more intelligent and conversational, while server VAD is simpler and more mechanical. Both support customization and work well in real-time pipelines.
Event-triggered recording
In event-triggered recording, the system begins capturing audio only when something specific happens, like a user pressing a button, saying a wake word, or receiving a signal from the frontend. Instead of streaming audio in real time, the entire audio clip is recorded on the client side and only sent for transcription once the user finishes speaking and the recording stops. This method is known as micro-batching and is often simpler and more efficient than real-time streaming, especially in browser-based or mobile environments.
Several popular speech-to-text tools support this kind of usage. Services like Google Cloud Speech-to-Text, Amazon Transcribe, Microsoft Azure Speech, Deepgram, and AssemblyAI all allow sending complete audio files after recording. You can record the audio in formats like WAV or FLAC, and then send it to these APIs in one request to get the transcription back.
There are also open-source options like Whisper and Vosk that can run locally without needing to stream anything. These tools are useful when you want full control over the data or need offline processing. Since the audio is sent only after the recording ends, this setup is ideal for command-based interactions, short speech inputs, or voice notes. It’s also easier to manage since you don’t need to handle real-time processing or partial results.
Pros:
- More robust in noisy environments
- Predictable start and stop behavior
- No need to process audio continuously in real time
Cons:
- Less natural - requires explicit user action
- Not suitable for passive or always-on scenarios
- Can add latency depending on the UI interaction
You can also use event-triggered recording entirely on the client side with tools like React Speech Recognition, which rely on the browser’s built-in speech-to-text capabilities (typically powered by the Web Speech API in Chrome). This setup allows you to start recording after a user action, such as pressing a button, and handle the entire transcription process directly in the browser, without sending any data to external servers.
This approach works quite well for general English speech, especially in common conversation or command-based scenarios. However, it often struggles with personal names, unusual terms, and domain-specific vocabulary, since the browser’s speech model is not customizable or trainable.
To improve accuracy and user experience, it’s common to apply AI-powered post-processing after the initial transcription. You can use tools like language models to automatically fix misrecognized words, replace incorrect phrases, or enrich the text with missing context. This makes the approach suitable for applications where the spoken content is relatively predictable or where some level of intelligent correction is acceptable after transcription.
When to use VAD vs event-triggered recording in AI speech systems
Use VAD when you need natural, real-time interaction like avatars or live translation, where the system should react to speech automatically. Choose event-triggered when you want more control, especially in structured interfaces like mobile apps or web tools with a button or wake word. In general, prefer VAD for hands-free, fluid UX, and event-triggered for explicit, user-driven workflows.
Conclusion
In conclusion, choosing the right speech recording method—VAD or event-triggered—depends on the specific needs of your AI speech application. As a company offering AI consultation services, we can assist in selecting, implementing, and optimizing the best solution for your unique use case to ensure seamless and efficient user experiences.