Skip to content

capture/v1 — Audio Source Adapter Contract

Status: Draft (design-locked, ready for first implementation) · Stability: v1 will be frozen with the first reference adapter (file-wav) · Implementations: in-tree only until v1 is frozen.

The capture/v1 surface is the entry point of every Vox session. A capture adapter produces a stream of audio frames from a specific source (your microphone, a room mic, system audio from a call client, an audio file, etc.) and hands them to the rest of the pipeline. Everything downstream — segmentation, transcription, routing, sinks — is source-agnostic; the only place that knows what kind of input is coming in is the adapter itself.

This document is the contract. An implementation that conforms to it can be loaded by any version of the Vox core that supports capture/v1.


Scope

capture/v1 covers:

  • The lifecycle of an audio source (open, start, stop, close)
  • The wire format of audio frames
  • Configuration negotiation (sample rate, channel count, frame size)
  • Backpressure / drop policy at the adapter→consumer boundary
  • Disconnect / hot-swap behavior
  • Error reporting and permission errors
  • Source-type metadata so downstream stages can adjust behavior

capture/v1 does not cover:

  • Voice activity detection (segment/v1)
  • Speaker diarization (segment/v1)
  • Transcription (asr/v1)
  • Sample-rate conversion (handled in core, between adapter channel and consumer — see below)
  • Multi-source orchestration / session grouping (a higher layer; capture/v1 only exposes the SessionID correlation hook)
  • Mixing multiple sources into one stream

Adapters MUST do one thing: pull audio out of a source and emit frames. Anything more belongs in another surface.


Source kinds

Four source kinds are first-class. Adapters MUST declare which kind they implement, and the core uses the kind to set sensible downstream defaults (diarization on/off, default ASR backend, disconnect policy, etc.).

Kind Examples Typical channel count Typical sample rate
self Default mic, USB headset, AirPods 1 16 kHz or 48 kHz
in-person Lapel, conference array, room mic, paired phone 1–8 48 kHz
online System audio (loopback) for remote-participant capture 2 48 kHz
file Pre-recorded audio (testing, batch ingest, replay) any any

Adapters are single-source. An online adapter captures system-audio loopback only; the user's own mic comes from a parallel self adapter loaded alongside. The two streams are correlated by SessionID (see Frame format below). Adapters that internally mix multiple sources are out of contract for v1.

Custom kinds MAY be added in v2. For v1, adapters that don't fit one of the four MUST pick the closest match and document the discrepancy.


Frame format

A capture stream emits a sequence of frames through a bounded channel (see Transport). Each frame is a fixed-size buffer of audio samples with metadata.

Sample encoding

  • Samples are 32-bit IEEE-754 floats in the range [-1.0, +1.0] (f32, little-endian on the wire when serialized).
  • Adapters that read native int16 or int24 MUST convert to f32 before emitting. The cost of conversion belongs at the source edge so downstream code can assume a single format.
  • Adapters MAY support int16 mode for memory-constrained scenarios. That is opt-in via configuration and not the default.
  • The Encoding enum is reserved for additive future growth. Values beyond f32 and i16 (e.g., opus, aac, flac) are out of scope for v1 but explicitly anticipated; introducing them later requires no v1 breaking change.

Channel layout

  • Channels are emitted interleaved (frame[0]=ch0, frame[1]=ch1, frame[2]=ch0, ...).
  • Channel ordering follows WAV / FFmpeg conventions: front-left, front-right, center, low-frequency, back-left, back-right, side-left, side-right.
  • For self sources, mono (1 channel) is the default and recommended.
  • For online sources, stereo (2 channels) is the default — loopback capture is typically stereo system audio. Both channels represent remote-participant audio; the user's voice is NOT mixed in.
  • For in-person sources with multiple mics, channel count equals mic count. Spatial / array math (if any) is NOT done in capture/v1; emit the raw channels.

Frame structure

Frame {
  StreamID      string        # uuid4, stable for the lifetime of the stream;
                              # re-issued on device change (see Hot-swap)
  SessionID     string        # optional; correlation hook for multi-adapter
                              # session grouping (see Multi-stream). Empty if
                              # the stream isn't part of a session group.
  Sequence      uint64        # monotonically increasing from 0
  CapturedAt    Timestamp     # wall-clock at the start of this frame
  Duration      Duration      # frame duration (derived from samples + rate, included for convenience)

  SampleRate    uint32        # Hz; constant for the lifetime of the stream
  Channels      uint8         # constant for the lifetime of the stream
  Encoding      Encoding      # "f32" (default) or "i16"

  Samples       []f32 or []i16  # interleaved; length == FrameSize * Channels

  # Special markers (mutually exclusive with Samples)
  IsDropMarker    bool        # true when this frame represents a gap from dropped frames
  DroppedCount    uint32      # number of frames dropped before this one (if IsDropMarker)
  IsDeviceMarker  bool        # true when a transient device disconnect was recovered
  DeviceMarkerReason string   # e.g., "transient_reconnect"
}

StreamID is set by the adapter at Start() and stays constant until either a clean stream end OR a fallback-to-different-device event (see Hot-swap). SessionID is set by the orchestrator when the adapter is opened as part of a multi-adapter session group; empty otherwise. Sequence starts at 0 for the first frame and increments by one per frame; consumers detect drops by sequence gaps. CapturedAt is the adapter's best estimate of the wall-clock time of the first sample in the frame.

Frame size

The frame size (samples per channel per frame) is negotiated at Open(). The consumer requests a preferred frame size; the adapter MAY honor it exactly, or MAY round to the nearest device-natural value and return the actual size to the consumer.

  • Default request: 20 ms worth of samples (320 samples at 16 kHz, 960 at 48 kHz). This matches the WebRTC VAD frame size and most low-latency ASR backends.
  • Adapters MUST emit frames at a steady cadence matching the negotiated frame size, modulo unavoidable device jitter.

Time base

CapturedAt is wall-clock from the host system. Adapters MUST NOT back-correct timestamps for buffering or processing delay observed inside the adapter — the consumer assumes the timestamp is the source-side capture moment, not the moment the frame arrived in user space.

If the adapter cannot get a precise capture time from the OS (some platforms don't expose it cleanly), it MAY use the wall-clock at the moment the frame was assembled in user space and document that limitation.


Transport — bounded channel, adapter writes, consumer reads

Frames flow from adapter to consumer through a bounded channel (Go channel, equivalent in other languages: a bounded queue). The adapter owns the writer side; the consumer owns the reader side.

Adapter.Start(ctx, frames chan<- Frame) -> Error

The adapter MUST:

  • Write each captured frame to frames using a non-blocking try-write.
  • Close frames on graceful Stop().
  • Apply the configured drop policy (below) when the channel is full.

The consumer MUST:

  • Read from frames continuously.
  • Treat closed channel as graceful end-of-stream.
  • Surface drop markers and device markers to its own downstream stages.

Why a channel, not a callback

  • Decouples capture pace from consumer pace. Capture is real-time; the rest of the pipeline (VAD, ASR, routing, sinks) is variable-latency. Forcing the adapter thread to wait on consumer work is the wrong coupling.
  • Cross-language FFI hygiene. Function-pointer callbacks across CGo (CoreAudio / WASAPI / PipeWire bindings) are a known pain point — context loss, GC pinning, deadlock risk. A channel boundary keeps FFI complexity contained in one short writer function.
  • Backpressure is policy, not code. Channel-full behavior is selectable at config time (see below).

Channel capacity (default: 64 frames)

  • Default: 64 frames (~1.28 seconds at 20 ms / 48 kHz mono).
  • Rationale: longer than any reasonable GC pause or OS scheduler hiccup, but short enough that a real problem surfaces within ~2 seconds rather than being silently absorbed.
  • Override: capture.buffer_frames: <int>.
  • file source kind defaults higher (256) because there's no real-time constraint and faster drain is desirable.

Backpressure / drop policy

The contract: never block the consumer; drop a frame instead, and surface the drop loudly so the engineering effort can go into preventing it.

Drop policy when channel is full (default: drop-newest)

Policy Behavior Use case
drop-newest (default) Don't enqueue the new frame; record the gap in Stats; emit a DropMarker frame after the gap Recommended for ASR — the queued frames are about to be processed; losing the next word is visible and recoverable
drop-oldest Pop one frame off the channel, enqueue the new one Real-time monitoring where freshness > completeness
drop-window Drop a contiguous batch of frames, emit a single marker Aggressive recovery; experimental
block Block the adapter until space frees up Diagnostic only; violates the "never block" contract; do not use in production

Configurable via capture.drop_policy.

Drop markers

When a drop occurs, the adapter MUST enqueue a synthetic Frame with IsDropMarker: true and DroppedCount: k after the dropped range. This makes the discontinuity explicit to downstream consumers without forcing them to poll Stats().

Drop telemetry — four levels

Level Mechanism Default behavior
1. Counters Adapter.Stats() — running counters (frames_emitted, frames_dropped, drop_events, last_drop_at, longest_drop_burst) Always on; lock-free; pollable at any time
2. Logs Structured log line per drop event (not per dropped frame) — stream_id, sequence range, duration WARN level by default
3. Audit audit/v1 event on drop bursts > N frames Emitted when audit/v1 is loaded
4. Loud escalation Drop rate exceeds threshold over a window → callback, ERROR log, optional UI indicator Default trigger: > 1% of frames over 60 s (both configurable)

Engineering effort goes into preventing drops; telemetry exists to ensure drops can't hide.


Sample-rate conversion — in core, not in adapters

Resampling lives between the adapter channel and the consumer. Adapters do NOT resample.

Rules

  • Adapter declares its native sample rates in Capabilities.SampleRates.
  • Core opens the adapter at the closest native rate to what was requested (not the requested rate itself). E.g., if consumer wants 16 kHz and the device natively does {48, 44.1, 16}, the adapter opens at 16 — no conversion.
  • If consumer's requested rate is not natively supported, the core's resampler converts the adapter's native-rate output to the requested rate.
  • No-op fast path: when requested == native, the resampler is a zero-copy pass-through. No allocation, no math.
  • Per-consumer subscription. Multiple consumers may subscribe to the same adapter stream at different rates; the core resampler handles each.

Library choice (v1)

  • Pure-Go resampler (no CGo) to preserve the single-static-binary story across macOS / Windows / Linux on Intel + Apple Silicon + amd64 + arm64.
  • Quality target: equivalent to libsamplerate SINC_FASTEST for speech.
  • If quality benchmarks show real degradation on speech vs. libsamplerate, ship a CGo fallback via build tag (not the default).

Multi-stream — one adapter per source, SessionID for correlation

Each adapter produces exactly one stream. Multi-source scenarios (online call = system audio + your mic; hybrid meeting = remote dial-in + in-room lapels) are handled by loading multiple adapters, correlated via SessionID on each Frame.

Why two adapters instead of one multi-channel stream

  • Uniform pipeline: every stream is a stream. No source-kind-dependent channel-layout conventions for consumers to encode.
  • Cross-platform parity: Windows (WASAPI loopback) and Linux (PipeWire monitor) already require two captures regardless of approach. Two adapters matches reality on every platform.
  • Per-stream routing is one config away — your voice → fast local ASR; remote voices → accurate cloud ASR. This is the product (intent routing); multi-channel single-stream would force a split downstream anyway.
  • Generalizes to N sources (hybrid meetings, multi-mic rooms).
  • Independent drop accounting per source.
  • Per-stream ASR backend selection falls out for free.

Correlation

The orchestrator (a higher-level surface, out of scope for capture/v1) opens both adapters together with the same SessionID. Each frame carries that SessionID. Downstream stages correlate by it. Time-base alignment between streams uses CapturedAt wall-clock; sub-millisecond drift between two OS audio clocks is acceptable for ASR / intent routing.


Hot-swap and disconnect — pause-then-fallback by default

When a device disappears mid-stream (USB unplug, Bluetooth drop, OS device removal), behavior is pause-then-fallback by default, with per-source-kind overrides.

Default policy per source kind

Source kind Default policy Reasoning
self pause-fallback (10 s wait → default device → hard fail) Dictation should survive a headset reconnect; if the cable's truly gone, falling back to the laptop mic keeps the user productive
in-person pause-fallback (same) Default device is usually an acceptable fallback
online pause-only (10 s wait → hard fail, no fallback) "Fallback to default" doesn't make sense for system-audio loopback — silently capturing the wrong source is worse than failing
file hard-fail File disconnect = something broken (disk error); don't recover

All configurable: capture.disconnect_policy: pause-fallback | pause-only | hard-fail | auto-restart and capture.disconnect_timeout: <duration> (default 10 s).

Stream identity on disconnect

Event StreamID Sequence Action
Transient disconnect → same device returns within timeout Same Continues with gap Emit Frame{IsDeviceMarker: true, DeviceMarkerReason: "transient_reconnect"}
Fallback to different device New Restarts at 0 Close old channel; new channel with new StreamID; orchestrator stitches via SessionID
Timeout expired, no fallback available StreamID closes Close channel; return ErrDeviceNotFound via the async error path

Device change == new StreamID because downstream caches (ASR acoustic baseline, speaker embeddings, VAD threshold) are stream-scoped and become stale on device change. A new StreamID forces a clean reset.

Telemetry — four event types

Every disconnect emits a structured event (same telemetry path as drop events):

  • capture.device_disconnected — adapter detected loss
  • capture.device_reconnected — same device returned within timeout
  • capture.device_changed — fell back to a different device (new StreamID)
  • capture.device_failed — timeout expired, no fallback, stream closed

Default log level for device_changed: WARN (silent fallback would be the wrong UX — the user must know their mic switched).


Adapter interface

The interface is presented in language-neutral pseudocode. The reference binding will be Go (the open core is Go); bindings for other languages are derivative.

Adapter {
  // Identity ----------------------------------------------------------------
  Kind()          -> SourceKind         # "self" | "in-person" | "online" | "file"
  Name()          -> string             # adapter identifier (e.g., "coreaudio", "wasapi")
  Capabilities()  -> Capabilities       # what this adapter supports

  // Lifecycle ---------------------------------------------------------------
  Open(req: OpenRequest)              -> OpenResult | Error
  Start(ctx: Context, frames: chan<- Frame, errs: chan<- Error) -> Error
  Pause()                             -> Error     # optional; capabilities.Pausable
  Resume()                            -> Error     # optional; capabilities.Pausable
  Stop()                              -> Error     # closes `frames` channel
  Close()                             -> Error

  // Diagnostics -------------------------------------------------------------
  Stats()         -> Stats              # frames_emitted, frames_dropped, drop_events,
                                        # last_drop_at, longest_drop_burst,
                                        # device_disconnect_events, etc.
}

Capabilities {
  SampleRates   []uint32   # rates the adapter can produce natively
  ChannelCounts []uint8    # channel counts supported
  Encodings     []Encoding # always includes "f32"; MAY include "i16"
  Pausable      bool       # supports Pause/Resume mid-stream
  PTT           bool       # supports push-to-talk (silence on, audio when held)
  DeviceList    []DeviceInfo  # available devices (for kinds where this is meaningful)
  HotSwap       bool       # supports pause-then-resume on device reconnect
}

OpenRequest {
  DeviceID       string         # optional; specific device from Capabilities.DeviceList
  SampleRate     uint32         # requested rate; core selects nearest native at Open(),
                                # then resamples in core if needed
  Channels       uint8          # requested channel count
  Encoding       Encoding       # requested encoding; "f32" default
  FrameSizeHint  uint32         # preferred samples per channel per frame (e.g., 320 = 20ms@16k)
  Mode           CaptureMode    # "always-on" | "push-to-talk"
  PTTKey         string         # for "push-to-talk" mode; platform-specific
  SessionID      string         # optional; correlation hook stamped onto every emitted Frame

  # Backpressure + disconnect policy (defaults from source kind)
  BufferFrames     uint32          # channel capacity (default 64)
  DropPolicy       DropPolicy      # "drop-newest" (default) | "drop-oldest" | "drop-window" | "block"
  DisconnectPolicy DisconnectPolicy # "pause-fallback" | "pause-only" | "hard-fail" | "auto-restart"
  DisconnectTimeout Duration       # default 10s
}

OpenResult {
  StreamID       string         # uuid4 for this stream
  SampleRate     uint32         # actual native rate the adapter opened at
  Channels       uint8          # actual
  Encoding       Encoding       # actual
  FrameSize      uint32         # actual samples per channel per frame
  DeviceID       string         # actual device chosen
  DeviceName     string         # human-readable device name
}

Lifecycle state machine

                  Open()              Start(ctx, frames, errs)
   [Closed]  --------------->  [Opened]  ----------------> [Running]
       ^                          |                              |
       |                          |                              | Pause()
       |                       Close()                           v
       |                          |                          [Paused]
       |                          v                              |
       +--- Close() -----------[Closing] <--- Stop() ----+       |
                                                         |   Resume()
                                                         |       |
                                                       [Running] <+

Calling out of order is a programming error and the adapter MUST return an explicit ErrInvalidState rather than crash. Close() is idempotent.

Concurrency

  • Open, Start, Stop, Close, Pause, Resume are called from a single control thread. Adapters MAY assume serialization.
  • Frames are written to the frames channel from a single adapter-owned thread. Adapters MUST NOT write from multiple threads concurrently.
  • Async errors (e.g., device disconnect after Start()) are written to the errs channel from the same adapter-owned thread.
  • Stats() is safe to call from any thread at any time.

Configuration schema

Configuration is per-adapter; the core just routes the values through. Each adapter MUST publish its schema. A common skeleton:

capture:
  adapter: coreaudio          # or "wasapi", "pipewire", "loopback-zoom", etc.
  source_kind: self           # "self" | "in-person" | "online" | "file"
  device_id: ""               # empty = default device
  sample_rate: 16000          # 0 = adapter default
  channels: 1                 # 0 = adapter default
  encoding: f32               # "f32" | "i16"
  frame_size_hint: 0          # 0 = let adapter pick (~20ms)
  mode: always-on             # or "push-to-talk"
  ptt_key: ""                 # required if mode=push-to-talk; platform-specific
  session_id: ""              # set by orchestrator for multi-adapter sessions

  # Backpressure
  buffer_frames: 64           # channel capacity
  drop_policy: drop-newest    # drop-newest | drop-oldest | drop-window | block
  drop_alert_threshold_pct: 1.0   # loud-escalation trigger
  drop_alert_window_sec: 60

  # Disconnect
  disconnect_policy: pause-fallback  # pause-fallback | pause-only | hard-fail | auto-restart
  disconnect_timeout: 10s

  # adapter-specific knobs go here:
  options:
    foo: bar

The core validates the generic fields. Anything under options is passed through to the adapter verbatim.


Error model

All errors are typed:

Error {
  Kind      ErrorKind     # see below
  Adapter   string        # adapter name
  Op        string        # which method was being called
  Message   string        # human-readable, no PII
  Cause     Error?        # optional wrapped cause
}

ErrorKind {
  ErrInvalidState        # called out of order
  ErrInvalidConfig       # config rejected at Open()
  ErrUnsupported         # capability requested that this adapter doesn't have
  ErrDeviceNotFound      # named device not available
  ErrDeviceBusy          # device exists but is in use by another process
  ErrDeviceDisconnected  # device disappeared after Start() (async error channel)
  ErrPermissionDenied    # OS-level permission missing (see Permissions below)
  ErrPlatformUnsupported # adapter cannot run on this OS / OS version
  ErrIO                  # transient device error
  ErrInternal            # bug in the adapter; the consumer SHOULD restart it
}

ErrPermissionDenied MUST include a Cause or message that names the specific OS permission missing (e.g., "missing macOS microphone privacy permission", "missing Windows microphone privacy setting"). The core surfaces this directly to the user so they can fix it.

ErrDeviceDisconnected is the async error emitted on the errs channel when a device disappears after Start(). The disconnect-policy machinery then decides whether to pause-and-wait, fall back, or hard-fail.


Permissions

Capture is a permission-sensitive surface. Adapters MUST declare which OS permissions they require and check them eagerly at Open().

OS Permission Required by
macOS Microphone (Privacy & Security) Any local mic adapter
macOS Screen Recording System-audio (online) adapters using ScreenCaptureKit
Windows Microphone privacy setting Any local mic adapter
Windows (none specific) WASAPI loopback
Linux PipeWire / PulseAudio access Any local mic adapter
Linux PipeWire monitor source System-audio (online)

If a permission is missing, Open() MUST return ErrPermissionDenied. The adapter MUST NOT block waiting for the user to grant permission — the host application is responsible for prompting.


Per-source guidance

self — your own voice

Recommended adapters:

OS Adapter Notes
macOS coreaudio CoreAudio HAL; works on Intel + Apple Silicon
Windows wasapi WASAPI shared mode for low latency
Linux pipewire PipeWire is the modern default; PulseAudio fallback

Defaults: mono, 16 kHz, f32, 20 ms frame size, always-on, disconnect_policy: pause-fallback. Push-to-talk adapters MAY ship later; the contract supports it via Mode = push-to-talk and PTTKey.

in-person — meetings, 1:1s, room audio

Same OS adapters as self, but typically with a different device chosen (lapel mic, conference array, paired phone). Channel count > 1 is common. Defaults: 1–4 channels (device-determined), 48 kHz, f32, disconnect_policy: pause-fallback. Downstream diarization (in segment/v1) does the speaker math.

online — Zoom, Meet, Teams, Discord (remote-participant audio only)

The online source captures system audio loopback — what's coming out of your speakers / headphones, where the remote participants' voices are. The user's own voice is captured by a parallel self adapter; the two are correlated via SessionID.

OS Adapter Mechanism
macOS screencapturekit-audio macOS 13+ ScreenCaptureKit audio-only capture
macOS blackhole-loopback Fallback for older macOS or when SCK is unavailable; requires user-installed BlackHole driver
Windows wasapi-loopback WASAPI loopback (built into Windows)
Linux pipewire-monitor PipeWire monitor source on the default sink

Defaults: 2 channels (system audio is typically stereo), 48 kHz, f32, disconnect_policy: pause-only (no fallback).

file — recorded audio

A file adapter reads WAV / FLAC / Ogg / MP3 / Opus / WebM (subset of formats per adapter implementation) and emits frames. Decoding happens inside the adapter at the file boundary; the pipeline still sees PCM frames.

pace_mode is a first-class config on file adapters:

pace_mode Behavior Use case
realtime (default) Sleep between frames to maintain the file's natural duration cadence Integration tests; mimics live capture timing
asap Stream frames as fast as the consumer can drain Batch processing; bulk re-transcription; CI
accelerated Configurable speedup factor (2x, 4x, 10x) Long-meeting replay during development

Defaults: match the file's native format, pace_mode: realtime, buffer_frames: 256, disconnect_policy: hard-fail.


Loader registration

Adapters are loaded at startup based on the capture.adapter config value. Each adapter ships with a registration function:

RegisterCaptureAdapter(name: string, factory: () -> Adapter)

Registration is package-init in Go, equivalent in other languages. The core maintains a single registry; duplicate names panic at startup (intentional).

Enterprise plugins register the same way against the same registry — the core does not distinguish open vs. enterprise adapters at the loader level.


Versioning and stability

capture/v1 is the contract above. Once frozen:

  • Non-breaking changes (allowed in v1.x): adding optional fields with sensible defaults to OpenRequest, OpenResult, Capabilities, Stats, or Frame; adding new ErrorKind values; adding new Encoding values (e.g., opus, flac); adding new source kinds via the "custom kind" escape hatch.
  • Breaking changes (require v2): removing or renaming any existing field or method; changing the meaning of an existing field; changing Sample.Encoding semantics for existing values; changing channel layout convention.

The core supports one vN of capture/ at a time, with overlap during migrations. Adapters declare which version they target via their Name() return value or a parallel SupportedVersions() method (TBD before freeze).


Reference implementations (planned)

Order Adapter Source kind OS Why
1 file-wav file all First. Unblocks tests for every downstream surface with deterministic, reproducible input. Required for CI without audio hardware. Supports pace_mode to exercise real-time pipeline timing without a live mic.
2 coreaudio self macOS First live-capture adapter; primary dev platform.
3 wasapi self Windows Second live-capture; verifies cross-platform contract.
4 pipewire self Linux Third live-capture; closes Linux tier-1 support.
5 screencapturekit-audio online macOS 13+ First online adapter; pairs with coreaudio via SessionID.
6 wasapi-loopback online Windows Cross-platform parity.
7 pipewire-monitor online Linux Cross-platform parity.

The build order matches the contract drafting order: get a file substrate working end-to-end with the full pipeline first, then add live capture once the downstream stages are stable.


Project principle: opinionated defaults, every default configurable

Throughout this contract, every behavior with a defensible default (buffer_frames: 64, drop_policy: drop-newest, disconnect_timeout: 10s, etc.) is exposed as a config knob. The defaults reflect a considered recommendation for the typical voice-to-LLM use case; the knobs exist so specialized workflows can tune them.

This principle applies to all future Vox extension surfaces.