VAD vs event-triggered for AI speech-to-speech applications

16 Jul 2025
VAD vs event-triggered for AI speech-to-speech applications

This article explains both approaches, where each shines, and how to choose the right one for your application. We go beyond definitions to cover implementation details, evaluation methodology, business trade-offs, and references to official documentation. The most relevant decision criteria for B2B teams typically include latency, accuracy and robustness in noise, control over user experience, operating cost, privacy and compliance constraints, and ease of scaling.

Voice Activity Detection (VAD)

Voice activity detection is a technique that classifies short audio frames as speech or non-speech (silence, background noise). A VAD gate typically runs continuously and triggers downstream actions when speech starts and ends, such as starting or stopping capture, invoking ASR (automatic speech recognition), or streaming to TTS for full speech-to-speech loops.

How it works at a glance:

  • Preprocessing. Normalize and denoise the signal to stabilize levels and suppress stationary noise.
  • Feature extraction. Compute features like energy, spectral characteristics, or embeddings learned by a small model.
  • Decision logic. Classify frames and apply hysteresis to smooth decisions so you do not trigger on clicks or cut off at every short pause.

For a concise primer, see the classic VAD overview in the ITU-T G.729 Annex B standard and modern open-source implementations such as WebRTC Audio Processing, RNNoise, and Silero VAD.

Pros:

  • Natural, hands-free interaction with fluid turn-taking
  • Supports always-on, open-mic use cases including assistants, meetings, and translation
  • Enables incremental streaming for lower perceived latency

Cons:

  • End-of-turn delay due to hang-out time adds 300 to 1500 ms in typical settings
  • Requires noise suppression and careful threshold tuning, especially in challenging acoustics
  • Higher runtime and infrastructure complexity for continuous, on-the-fly processing

Main VAD parameters

To make Voice Activity Detection (VAD) work effectively, especially in real-time applications like voice assistants or avatars, it’s important to configure a few key parameters. These settings control how quickly and accurately the system detects when someone starts or stops speaking, and how well it handles background noise or silence. The right balance between responsiveness and stability depends on the specific use case and environment (e.g., quiet room vs noisy street).

Below are the main parameters commonly used to tune VAD systems:

Frame size / Window length

The size of each audio chunk the system analyzes at a time — typically 10, 20, or 30 ms.

  • Smaller frames react faster but increase CPU load and can be jittery
  • Larger frames are more stable and efficient but slower to react

Silence threshold

The amplitude or SNR level below which audio is considered non-speech. This is how the system ignores background noise.

  • Lower thresholds pick up more sound, including quiet speech, but admit more noise
  • Higher thresholds suppress noise but may miss soft talkers

Speech start trigger (hang-in time)

The minimum continuous speech duration before triggering a start. Prevents false starts from short or accidental noises.

  • Typical range: 200 to 500 ms
  • Shorter hang-in improves responsiveness but increases false starts from transient sounds

Speech end trigger (hang-out time)

How long the system waits after speech stops before declaring the end of a turn. Prevents cutting off speakers mid-sentence during natural pauses.

  • Typical range: 500 to 1,500 ms depending on use case
  • Shorter hang-out reduces end-of-turn latency but risks clipping pauses mid-thought

Aggressiveness level

Combined tuning of thresholds and filters that bias toward rejecting or accepting marginal speech.

  • Lower aggressiveness allows more background sounds through — useful in quiet environments
  • Higher aggressiveness is better in noisy environments but may clip quiet or far-field voices

OpenAI voice activity detection: server VAD and semantic turn detection

OpenAI’s Realtime API provides built-in turn detection for speech sessions and lets developers adjust sensitivity. VAD is enabled by default in speech-to-speech and transcription sessions, but can be turned off or customized per session. When enabled, the API emits specific events:

  • speech_started — when the user begins speaking
  • speech_stopped — when the system detects the end of speech

These events structure the conversation into clear turns of speech, so your app knows when to process a transcript or start a response. There are two available modes:

1. Server VAD (default)

1. Server VAD (default)

Uses traditional audio-based, silence-period detection to decide when the speaker has paused or stopped. Predictable and works well in general cases, but may interrupt users who pause briefly while thinking.

Configurable parameters:

  • threshold — how loud speech must be to trigger detection; higher values suit noisy environments
  • prefix_padding_ms — a short buffer of audio before speech starts, so the first few milliseconds aren’t missed
  • silence_duration_ms — how long the system waits in silence before declaring the user has finished speaking

Example configuration:

{

"turn_detection": {

"type": "server_vad",

"threshold": 0.5,

"prefix_padding_ms": 200,

"silence_duration_ms": 700

}

}

2. Semantic VAD

Instead of relying on silence, semantic VAD uses a language-aware model to understand when a speaker has actually finished their thought, based on the content of their words. It distinguishes between a long pause that’s part of a sentence and an actual end of a statement — making it less likely to interrupt someone speaking slowly or pausing mid-thought.

This mode is ideal when smooth, human-like turn-taking is important, especially in speech-to-speech applications or when real-time transcriptions need to capture full thoughts without being broken mid-phrase.

Key configuration — eagerness controls how quickly the system decides the speaker is done:

  • low — lets the user take their time
  • high — responds as soon as possible
  • auto — (default) balances both

Example configuration:
{

"turn_detection": {

"type": "semantic",

"eagerness": "auto"

}

}

Note: Parameter names and defaults evolve. Always confirm the latest options in the official OpenAI Realtime API docs at platform.openai.com/docs/guides/realtime.

When to use which:

  • Server VAD: predictable, silence-based turns and straightforward tuning by environment. Best for general cases where users speak in clear, complete sentences.
  • Semantic VAD: natural conversation where uninterrupted thoughts matter most — coaching, tutoring, live translation, agent handovers, and speech-to-speech use cases.

Event-triggered recording

In event-triggered recording, the system begins capturing audio only when something specific happens — a button press, a wake word, or a UI event. Instead of streaming audio in real time, the entire audio clip is recorded on the client side and only sent for transcription after the user finishes speaking. This micro-batching pattern is common in browser-based or mobile environments and is often simpler and more efficient than full-duplex, real-time streaming.

Supported tools and services:

  • Cloud ASR via file upload or chunked requests: Google Cloud Speech-to-Text, Amazon Transcribe, Microsoft Azure Speech, Deepgram, AssemblyAI — all support sending complete audio files after recording (WAV, FLAC, etc.) in a single request.
  • On-device / edge ASR: OpenAI Whisper and Vosk for offline or private processing — useful when you need full control over data or offline capability.
  • Browser-based capture: The Web Speech API (via React Speech Recognition or similar) enables event-driven recognition directly in the browser without sending data to external servers. Works well for general English and common conversational speech.

Pros:

  • More robust in noisy environments — explicit start/stop boundaries eliminate open-mic ambiguity
  • Lower continuous compute cost; no need to analyze audio that isn’t speech
  • Simpler integration for short commands, voice notes, and form inputs
  • Easier to manage since no real-time partial results are needed

Cons:

  • Less natural — requires explicit user action (push-to-talk, click, or wake word)
  • Not ideal for passive listening, barge-in, or overlapping speech
  • Can add latency depending on the UI interaction plus upload time

Many teams pair event-triggered capture with AI post-processing to clean transcripts. The browser’s built-in speech model is not customizable, so it often struggles with personal names, unusual terms, and domain-specific vocabulary. A small prompt-engineered LLM pass can automatically fix misrecognized words, replace incorrect phrases, or enrich the text with missing context — making it suitable for applications where spoken content is relatively predictable.

VAD vs event-triggered: decision framework

Use this comparison to map your constraints to a sensible default, then validate with measurement.

Implementation considerations

Architecture patterns

  • Client only. Event-triggered record on device, transcribe locally with Whisper or Vosk, post results to backend. Great for privacy-sensitive workflows and offline use.
  • Client plus edge. Run VAD and denoising on device using WebRTC Audio Processing or Silero VAD, stream detected segments to the server for ASR and LLM inference. Reduces bandwidth and end-to-end latency.
  • Server-centric streaming. Keep the mic open, push audio frames over WebRTC or WebSocket, use server VAD or semantic turn detection in the ASR/LLM pipeline, and stream partial results to the UI for responsiveness.
  • Hybrid micro-batching. Buffer short chunks client-side, send batches when you detect likely turn boundaries to balance responsiveness and simplicity.

Noise suppression and gain control

  • Libraries: WebRTC Audio Processing and RNNoise are proven choices for denoising in production.
  • Calibration: Set gain levels carefully to avoid clipping and improve ASR stability across devices and mic types.

LLM orchestration and speech-to-speech loop

  • Coordinate ASR, LLM, and TTS with clear turn boundaries. For real-time agents, stream partial ASR to the LLM to begin planning before the user finishes speaking.
  • Keep prompts lean and cache reusable system context to stay within the model’s context window and control inference cost.
  • For deployment at scale, consider GPU scheduling, model serving concurrency, and backpressure strategies to protect latency SLOs.

Fallbacks and resilience

  • Provide a user-visible push-to-talk fallback when VAD confidence drops in noisy conditions.
  • Implement timeouts and “are you still there?” prompts to recover from ambiguous turns.
  • Log raw metrics and audio hashes, not full audio, if privacy or compliance limits retention.

Measuring quality: metrics and test methodology

Define objective and subjective metrics before tuning — not after.

Key metrics

  • Latency: time to first token (from speech start to first partial ASR output); end-of-turn latency (from final speech to system response start).
  • Accuracy: word error rate (WER) for ASR; entity accuracy for domain-specific terms; false start rate (non-speech events misclassified as starts); false stop rate (premature end-of-turn detections).
  • Robustness: performance across SNR levels, accents, speaking rates, and mic types.
  • UX quality: interruption rate in conversation, perceived responsiveness from user studies.

Test methodology

  • Build a representative audio corpus: quiet rooms, noisy cafes, cars, headsets, and laptop mics
  • Test multiple parameter sets using A/B or multi-armed bandit experiments in staging
  • Sweep hang-in and hang-out times and aggressiveness to map the Pareto frontier of latency vs stability
  • Record telemetry per session: SNR estimates, VAD confidence, end-of-turn delay, ASR WER proxies, and downstream task success

Cost, privacy, and compliance

Cost

  • VAD with continuous streaming consumes more CPU/GPU and bandwidth, but improves engagement with lower perceived latency.
  • Event-triggered micro-batching reduces steady-state cost and is easier to scale predictably.

Privacy and compliance

  • Event-triggered recording limits open-mic exposure and can keep data entirely on-device — helpful for GDPR or sectoral data residency rules.
  • If using cloud services, review data retention, encryption, and PII redaction options in vendor documentation. Most major ASR providers offer configurable retention policies.

On-device and edge options

  • On-device VAD and ASR via Whisper or Vosk reduce cloud dependency and support offline scenarios, at the expense of device compute and battery life.

Use cases that map cleanly

  • Conversational assistants and real-time agents. Choose VAD or semantic turn detection to enable natural, barge-in conversations. For avatars and voice assistants, see our guide to building AI avatar software.
  • Meeting capture and live translation. Prefer VAD with stable hang-out and speaker diarization. Prioritize reliability over ultra-low latency.
  • Push-to-talk commands and voice search. Event-triggered micro-batching keeps UX clear and predictable, with easy client integration.
  • Regulated or privacy-first workflows. Event-triggered with on-device ASR limits data exposure and simplifies compliance with strict data policies.

Implementation quick-start with OpenAI Realtime

  • Start with server VAD. Use sensible defaults from the OpenAI Realtime API. Tune silence_duration_ms and threshold for your mic and acoustic environment.
  • Test semantic turn detection if your users pause mid-thought or speak at a variable pace. Adjust eagerness and validate that end-of-turns feel natural in user tests.
  • Stream partial ASR to the LLM for fast planning. Constrain prompts and tools to keep inference within budget while maintaining response quality.
  • Build a corpus and sweep parameters to find your Pareto-optimal point on the latency vs stability frontier before going to production.

Conclusion

The VAD vs event-triggered decision is less about technology preference and more about UX, environment, and operating constraints. Use VAD or semantic turn detection when fluid, hands-free interaction is the product. Use event-triggered when explicit control, predictable cost, and privacy are paramount. In practice, many production systems implement both and switch dynamically based on context or user settings.

Looking to turn AI ideas into scalable production systems? Globaldev helps companies design and build practical AI solutions tailored to real product and business needs, from VAD tuning and noise suppression to real-time orchestration, GPU infrastructure, monitoring, and ongoing evaluation.