segment/v1 — Speech Segmentation + Diarization¶
Status: Draft (design-locked, ready for first implementation) · Stability: v1 will be frozen with the first reference backend (
energy-vad) · Implementations: in-tree only until v1 is frozen.
The segment/v1 surface sits between capture/v1 (audio frames) and
asr/v1 (transcription). A segment backend consumes a continuous
stream of audio frames and emits discrete speech segments — bounded
audio chunks containing detected speech, with optional speaker
attribution and multi-speaker flags. Each segment becomes the input to
exactly one ASR transcription call.
This is the most-constrained surface in the pipeline: both endpoints
(capture/v1 and asr/v1) are locked. Segment/v1's design space lives
entirely between those contracts.
Scope¶
segment/v1 covers:
- The backend interface (
ProcessFrame,OpenStream,CloseStream) - Voice Activity Detection (VAD) with a pluggable chain
- Segment boundary rules (silence duration, max duration, padding)
- Diarization layers (capture-hint passthrough → multi-speaker detection → speaker change splits → optional full speaker identification)
- Multi-stream coordination (independent vs cross-stream suppression)
- Optional preprocessing (denoise, AGC, high-pass, echo cancellation)
- Per-source VAD overrides
- Backpressure to ASR (drop-oldest queue policy)
- Error model + audit hook
- Versioning and stability rules
segment/v1 does not cover:
- Audio capture (
capture/v1) - Transcription (
asr/v1) - Intent classification or sink routing (
router/v1,sink/v1) - Speaker identification across sessions (
identity/v1, Enterprise) - Real-time UI surfaces (orchestrator subscriber concern, not a
segment/v1responsibility)
Input — what the backend consumes¶
The backend receives:
- A continuous stream of
capture/v1Frames viaProcessFrame(streamID, frame)calls SessionInfoatOpenStreamtime:
SessionInfo {
SessionID string
StreamID string
SourceKind SourceKind # "self" | "in-person" | "online" | "file"
CaptureHint string? # speaker label from capture layer (e.g., "self")
SampleRate uint32
Channels uint8
}
Backend state is per-stream — OpenStream returns a StreamHandle
that owns the VAD state, current in-progress segment, drop counters, etc.
Multi-stream sessions are independent at the backend level unless
cross-stream coordination is configured (see Multi-stream below).
Output — what flows to ASR¶
Each closed segment becomes a Segment:
Segment {
SegmentID string # unique within stream
SessionID string
StreamID string
StartedAt Timestamp # wall-clock of first speech sample (after pad_start_ms)
EndedAt Timestamp # wall-clock of last sample (after pad_end_ms)
Frames []Frame # the audio
SpeakerHint string? # capture-hint or diarization label
HasMultipleSpeakers bool # multi-speaker boolean
SpeakerEmbedding []float? # optional; when full identification enabled
Confidence float # segmentation confidence (0..1)
Reason SegmentEndReason # why this segment closed
Custom map<string, any> # provenance flags from preprocessing, diarization, etc.
}
SegmentEndReason {
EndedSilenceTimeout // min_silence_for_split reached
EndedMaxDuration // max_segment_duration hit
EndedSpeakerChange // diarization detected speaker change
EndedStreamClose // stream closing
EndedSuppressed // cross-stream suppression deleted this segment
}
Segment is the unit of input to asr/v1's Transcribe (offline) or the
audio fed into StreamFeed (streaming).
Voice Activity Detection — Pluggable Chain¶
VAD determines speech start/end. Three VADs ship in v1, run in a configurable chain.
The three VADs¶
| VAD | What it is | Latency | Accuracy | Cost |
|---|---|---|---|---|
energy |
RMS / spectral-flux threshold + smoothing | < 1 ms | Decent in clean audio; bad in noise | Free, no model |
webrtc |
Google's WebRTC VAD (battle-tested, fast) | < 5 ms | Good on clean speech; struggles with music / overlapping speakers | Free, no model |
silero |
Small neural VAD (ONNX, ~1 MB) | 10–50 ms | Best, especially in noise / mixed audio | Free, bundled model |
Chain semantics¶
All VADs in the chain run in parallel on every audio frame (not sequentially). Each emits a speech probability. The voting rule decides the final speech / no-speech decision.
segment:
vad:
chain:
- type: webrtc
aggressiveness: 2 # WebRTC VAD has 0-3 aggressiveness levels
frame_ms: 20 # match capture's default frame size
- type: silero
model_path: ~/.vox/models/silero/silero_vad.onnx
threshold: 0.5
- type: energy
threshold_db: -40
smoothing_ms: 50
voting: priority # priority | majority | weighted
# priority = first VAD with high confidence decides
# majority = at least N-of-M must agree
# weighted = sum(probability × weight) ≥ threshold
# Per-source overrides
by_source:
online:
chain: [silero, webrtc] # noisier; ML first
self:
chain: [webrtc, energy] # clean; cheap first
Failure handling within the VAD chain¶
| Failure | Action |
|---|---|
| One VAD in chain fails (model load error, runtime crash) | Mark unhealthy; chain continues with remaining VADs; voting rule adapts |
| All VADs fail | Fall back to fixed-length chunking (10 s chunks); emit segment.degraded audit event; pipeline keeps moving |
The pipeline never blocks on VAD failure.
Segment Boundaries¶
Speech start / end is detected by VAD; segment start / end is VAD plus boundary rules. The rules:
segment:
boundaries:
min_speech_duration: 250ms # ignore micro-utterances ("uh", coughs)
min_silence_for_split: 600ms # silence this long → segment break
max_segment_duration: 30s # force-cut at this point (prevents runaway)
pad_start_ms: 200 # include 200ms before detected speech start
pad_end_ms: 300 # include 300ms after detected speech end
speaker_change_splits: true # split at diarization-detected speaker changes
Why these defaults¶
min_speech_duration: 250ms— Skip filler sounds. Saves ASR cost on cloud backends and noise on the ASR output.min_silence_for_split: 600ms— Natural speech pauses are 150–400 ms; conversational turn boundaries are typically > 500 ms. 600 ms threads the needle.max_segment_duration: 30s— Hard cap. Prevents a "stuck VAD" scenario from producing a 6-hour segment. Aligns with ASR backend preferences (Whisper-class models are tuned on ~30 s windows).pad_start_ms: 200/pad_end_ms: 300— Critical for ASR accuracy. Whisper-class models lose the first/last phoneme if you cut too tight. Asymmetric (more trailing padding) because word offsets typically trail more than they lead.speaker_change_splits: true— When diarization detects a speaker change mid-segment, split. Downstream sees one segment per speaker turn, which is the right granularity forIntentEnvelopeper speaker.
All values configurable.
Diarization Layers¶
Per the precedence rule locked in router/v1 (segment > ASR backend >
capture hint), segment/v1 must produce something for Speaker.Label.
Four layers, three default-on:
| Layer | Status | Behavior |
|---|---|---|
| Capture-hint passthrough | Always on | The self adapter stamps Speaker.Label = "self" at capture; segment passes it through |
| Multi-speaker detection | Default on | Boolean signal: "this segment contains > 1 voice". Cheap; uses any VAD's secondary output |
| Speaker change detection within segment | Default on for online + in-person |
Mid-segment speaker changes split the segment |
| Full speaker identification | Opt-in | Stable labels (speaker-0, speaker-1, …) across segments; uses a pluggable model backend |
segment:
diarization:
capture_hint: passthrough # always passes through
multi_speaker_detection: enabled # boolean signal
speaker_change_within_segment: enabled # splits per boundary rules
speaker_identification:
enabled: false # opt-in
backend: pyannote-onnx # pyannote-onnx | nemo-ecapa | sherpa
model_path: ~/.vox/models/pyannote/segmentation.onnx
embedding_path: ~/.vox/models/pyannote/ecapa-tdnn.onnx
similarity_threshold: 0.7 # cosine similarity for label reuse
window_size: 30s # how far back to compare speakers
With speaker_identification enabled, each segment carries:
SpeakerHint: "speaker-0" / "speaker-1" / …(stable within session)- Optional
SpeakerEmbedding: []float(passed to ASR for backend-side diarization refinement)
Without speaker_identification, segments still carry the capture hint
and the multi-speaker boolean — sufficient for most use cases.
Multi-stream Coordination¶
When two streams run in parallel (your mic + system audio from an online call), how do their VAD pipelines interact?
| Mode | Behavior | Use case |
|---|---|---|
independent (default) |
Each stream segments independently. Downstream correlates via SessionID |
Default; covers the majority case where call clients already do echo cancellation |
cross-stream-suppression (opt-in) |
segment/v1 sees both streams; suppresses segments likely to be echo of another stream |
Aggressive de-duplication for captures without native AEC |
segment:
multi_stream:
mode: independent
cross_stream:
enabled: false
echo_detection: cross-correlation # cross-correlation | spectral
suppress_threshold_ms: 50 # correlation peak within 50ms → echo
suppress_stream: self # self loses to system-audio "ground truth"
Suppressed segments are still emitted with Reason: EndedSuppressed and
flow to the audit log; they just don't go to ASR. This preserves the
audit trail without re-paying transcription costs.
Preprocessing — Opt-in, Off by Default¶
Audio preprocessing (denoise / AGC / high-pass / echo cancellation) lives between capture and ASR. All stages are opt-in, all off by default.
Reasoning: - Modern ASR backends (especially Whisper) are trained on diverse raw audio; preprocessing can hurt accuracy - Cloud ASR backends do their own preprocessing internally; doubling it is wasteful - Online sources (call clients) usually already do echo cancellation
segment:
preprocessing:
denoise:
enabled: false
method: rnnoise # rnnoise | spectral-subtraction
agc: # automatic gain control
enabled: false
target_db: -16
high_pass:
enabled: false
cutoff_hz: 80 # remove HVAC rumble + low-frequency noise
echo_cancellation:
enabled: false
Each preprocessing stage emits a provenance flag into the segment's
Custom map when applied:
Custom {
"segment.preprocessing.rnnoise_applied": "true",
"segment.preprocessing.agc_applied": "true",
...
}
ASR sees these flags via the envelope's Provenance chain and can adapt
if it knows certain stages help or hurt its specific backend.
Backend Interface¶
SegmentBackend {
# Identity
Name() -> string
Capabilities() -> Capabilities
# Lifecycle
Open(config) -> Error
Close() -> Error
# Stream lifecycle
OpenStream(sessionInfo) -> StreamHandle | Error
CloseStream(handle) -> []Segment | Error
# Final flush — emits any in-progress segment
# Hot path
ProcessFrame(handle, frame) -> []Segment | Error
# Returns 0 or more complete segments. Most frames return [] (no
# segment boundary reached). A frame that closes one or more segments
# returns them.
# Diagnostics
Stats() -> Stats
Health() -> Health
}
Capabilities {
SupportedVADs []string # "energy" | "webrtc" | "silero"
SupportsDiarization bool
SupportsMultiSpeaker bool
SupportsCrossStreamSuppression bool
SupportedPreprocessing []string # "rnnoise" | "agc" | "high-pass" | "echo-cancel"
}
Orchestrator loop¶
- Capture adapter emits frames on its channel
- Orchestrator forwards each frame to
segment.ProcessFrame(handle, frame) ProcessFramereturns[]Segment— usually empty, occasionally one or more- For each returned segment, orchestrator hands it to
asr/v1per the source-kind routing - On stream close, orchestrator calls
segment.CloseStream(handle)to flush any in-progress segment
Concurrency¶
- One
StreamHandleper stream; segments from different streams flow independently ProcessFrameis called from a single goroutine per stream (the orchestrator's stream-forwarder)- Cross-stream suppression (when enabled) uses a shared lock-protected recent-segments structure; locking is internal to the backend
Backpressure (segment → ASR)¶
The orchestrator buffers segments destined for ASR. When ASR can't keep up:
segment:
output_queue:
buffer_segments: 32 # ~10 min of typical segments
drop_policy: drop-oldest
drop_alert_threshold_pct: 5.0
drop_alert_window_sec: 60
Default policy: drop-oldest — opposite of capture's drop-newest.
Reasoning:
- In capture, frames are sequential audio samples — losing the newest preserves continuity of the recent past, which is what live processing needs
- In segment, each segment is an independent meaning-unit — when the queue is backed up, the oldest segment is the most stale and least valuable to the user
Drop telemetry:
- segment.dropped counter — running total
- Structured WARN log per drop event (not per dropped segment)
- audit/v1 event when audit is loaded
- Loud-escalation if drop rate > 5% over 60 s (configurable)
Error Model¶
Typed errors, mirroring other Vox surfaces:
SegmentError {
Kind SegmentErrorKind
Stage string # "vad:webrtc" | "vad:silero" | "diarization" | "preprocessing:rnnoise" | etc.
Message string
Cause Error?
}
SegmentErrorKind {
ErrInvalidConfig
ErrVADUnavailable # one VAD in chain failed; chain continues
ErrAllVADsFailed # entire chain failed; falls back to fixed-length chunks
ErrModelNotFound # diarization / preprocessing model missing
ErrModelCorrupt # checksum mismatch
ErrUnsupported # capability requested not declared
ErrInternal
}
Failure handling¶
| Failure | Action |
|---|---|
| One VAD in chain fails | Mark unhealthy; chain continues; voting adapts |
| All VADs fail | Fall back to fixed-length chunking (10 s); emit segment.degraded audit event |
| Diarization model missing | Disable diarization features for the session; segments still emit with capture-hint only |
| Preprocessing failure | Skip the failing stage; pass audio through unchanged |
ProcessFrame crashes |
Orchestrator-side recover; current segment lost; backend restarted; segment.crashed audit event |
Segment exceeds max_segment_duration |
Force-close at the cap with EndedMaxDuration reason — not an error |
Key principle: segment/v1 NEVER blocks the pipeline. Worst case is
fixed-length chunking, which trades accuracy for continuity. The
pipeline keeps moving.
Audit Hook¶
When audit/v1 is loaded, the segment backend MUST emit a
SegmentDecisionEvent per segment:
SegmentDecisionEvent {
Timestamp Timestamp
SessionID string
StreamID string
SegmentID string
StartedAt Timestamp
EndedAt Timestamp
Duration Duration
EndReason SegmentEndReason
VADChainTrace []VADStep # which VADs fired, with speech probabilities
Diarization {
SpeakerHint string?
HasMultipleSpeakers bool
Source string # "capture-hint" | "identification:pyannote-onnx" | "splitting"
}
PreprocessingApplied []string # ["rnnoise", "agc"] etc., empty if none
ConfidenceScore float
FrameCount uint32
AudioBytes uint64 # approximate
}
VADStep {
Name string # "webrtc" | "silero" | "energy"
SpeechProbability float
Triggered bool
LatencyMS uint32
}
This lets an auditor reconstruct: "this segment was closed because Silero
detected silence for 700 ms, the capture-hint said self, no preprocessing
was applied, segmentation confidence was 0.92, and it covered 47 frames."
Combined with RouterDecisionEvent (from router/v1) and downstream sink
audit events, the complete fate of every utterance is traceable —
from microphone vibration to LLM response to email summary.
When audit/v1 is NOT loaded, the same data flows as structured logs at
DEBUG (configurable to INFO via segment.log_decisions_at: info).
Bundling¶
| Component | Bundling decision |
|---|---|
energy VAD |
Compiled into binary. Pure algorithm, no model |
webrtc VAD |
Compiled into binary (CGo binding to libwebrtcvad). No model file |
silero VAD |
Bundled ONNX model (~1 MB). Critical to "noise-robust out of the box" |
| Diarization models (pyannote-onnx, NeMo ECAPA, sherpa) | Downloaded via vox model download |
| Preprocessing models (RNNoise) | Downloaded for opt-in stages |
Bundling Silero specifically: ~1 MB binary overhead is cheap compared to the "works in a noisy office on first install" UX win.
Diarization and preprocessing models are NOT bundled because: - They're opt-in features (most use cases don't need them) - Sizes are larger (6-30 MB each) - Cross-platform installer size matters
Model lifecycle (download, storage, verification, versioning) follows the
same pattern as asr/v1: ~/.vox/models/<backend>/<model>, SHA-256
checksums on every load, explicit version names, no silent upgrades.
Reference Build Order¶
| Order | Component | Why |
|---|---|---|
| 1 | energy-vad |
Simplest; no dependencies. Builds the segment pipeline scaffolding (ProcessFrame, OpenStream/CloseStream, queue, drop policy). End-to-end pipeline tests pass with energy-VAD + file-wav capture + whisper-cpp ASR |
| 2 | webrtc-vad |
Stable, fast, well-tested. CGo binding to libwebrtcvad |
| 3 | silero-vad |
ONNX runtime integration; bundling the model. After this lands, "works in noise out of the box" is true |
| 4 | diarization scaffolding | Plumbing for SpeakerHint and SpeakerEmbedding; fake/synthetic speakers in test audio prove the path before real models land |
| 5 | pyannote-onnx diarization |
First real speaker-identification backend |
| 6 | rnnoise preprocessing |
First preprocessing stage; validates the optional layer |
| 7+ | additional backends | Same contract, different models |
The pipeline is usable after step 1 (energy VAD alone) and good after step 3 (Silero VAD). Diarization + preprocessing layer onto a proven core.
Versioning and Stability¶
segment/v1 is the contract above. Once frozen:
- Non-breaking changes (allowed in
v1.x): adding optional fields toCapabilities,Segment,SegmentDecisionEvent; adding new VAD types; adding new diarization backends; adding new preprocessing stages; adding newSegmentEndReasonvalues; adding newSegmentErrorKindvalues. - Breaking changes (require
v2): changing theSegmentBackendinterface signature; changing boundary-rule semantics; removing or repurposing any existing field; changing what frames a segment includes.
The core supports one vN of segment/ at a time, with overlap during
migrations.
Project Principle: Opinionated Defaults, Every Default Configurable¶
This contract continues the principle from the rest of v1. Every
behavior with a defensible default (min_silence_for_split: 600ms,
pad_end_ms: 300, max_segment_duration: 30s, voting: priority,
drop_policy: drop-oldest, auto_download: prompt, etc.) is exposed as
a config knob. Defaults reflect a considered recommendation for the
typical voice-to-LLM use case; the knobs exist so specialized workflows
can tune them.