Skip to content

tts/v1 — Text-to-Speech Backends

Status: Draft (design-locked, ready for first implementation) · Stability: v1 will be frozen with the first reference backend (tts-say) · Implementations: in-tree only until v1 is frozen.

The tts/v1 surface is the output side of the voice loop. A TTS backend takes text and produces audio bytes. Vox uses this to speak LLM responses, meeting summaries, action confirmations — closing the conversational loop from voice-in to voice-out.

tts/v1 is the sixth pipeline contract, alongside capture/v1, segment/v1, asr/v1, router/v1, sink/v1. It's an output-side peer of the input-side asr/v1.

This document is the contract. Backends conforming to it can be loaded by any version of the Vox core that supports tts/v1.


Scope

tts/v1 covers:

  • The backend interface (offline + streaming synthesis)
  • Output destination model — backends emit bytes; callers choose destination
  • Voice discovery and selection (backend-native IDs, no contract-level normalization)
  • Streaming chunk shape and cadence
  • BYOK authentication + shared credentials
  • Cost controls (budgets, rate limits, pre-flight estimation)
  • Fallback chain for backend failure
  • Per-envelope cost transparency
  • Capabilities advertisement
  • Error model + lifecycle
  • Versioning and stability rules

tts/v1 does not cover:

  • Audio playback to speakers (separate audio-playback subsystem)
  • Half-duplex coordination with capture (lives in the playback subsystem)
  • SSML (deferred to v1.x as additive — see Sub-decision (c))
  • Voice cloning workflow UX (per-backend; not contract surface)

Backend Interface

TTSBackend {
  # Identity
  Name()          -> string
  Capabilities()  -> Capabilities

  # Lifecycle
  Open(config)    -> Error
  Close()         -> Error

  # Voice discovery
  Voices()        -> []VoiceInfo | Error

  # Synthesis — offline (single buffer)
  Synthesize(ctx, text, opts) -> *AudioBuffer | Error

  # Synthesis — streaming (chunks arrive as the backend produces them)
  StreamSynthesize(ctx, text, opts) -> <-chan AudioChunk | Error

  # Diagnostics
  Stats()         -> Stats
  Health()        -> Health
}

Lifecycle state machine

                  Open()
   [Closed]  --------------->  [Open]
       ^                         |
       |                         | (any number of Synthesize / StreamSynthesize calls)
       |                         |
       +--------- Close() -------+
  • Open() resolves auth, downloads/validates models if needed, prepares the backend. Idempotent re-open after Close() is allowed.
  • Synthesize / StreamSynthesize are stateless w.r.t. backend lifecycle — no per-call open. Both are safe to call concurrently from multiple goroutines on the same backend instance.
  • Close() releases resources (HTTP connections, model file handles, miniaudio contexts if any). Idempotent.

Output Destination Model

Backends produce bytes. The caller decides what to do with them.

Three concerns, three components:

Component Responsibility
TTS backend (this contract) Text → audio bytes
Audio-playback subsystem Audio bytes → speakers (or file)
Capture adapter Microphone → frames (and pauses during playback for half-duplex)

The TTS sink (internal/sink/tts/, separate bead) is the orchestrator between these three components. It calls Synthesize on the backend, hands the bytes to the playback subsystem, and signals capture to pause.

This keeps the contract slim and testable: a TTS backend never depends on audio hardware. Tests can capture the returned bytes and assert against them without any speakers in the loop.


Wire Types

AudioBuffer — offline output

AudioBuffer {
  Samples      []byte          // raw bytes, format per SampleFormat
  SampleFormat string          // "f32" | "i16" | "mp3" | "opus" | "aiff" | "raw"
  SampleRate   uint32          // 0 for compressed formats
  Channels     uint8           // 0 for compressed formats
}

The backend returns whatever format it natively produces. No transcoding inside the backend — a separate audio/transcode helper package handles cross-format conversion when callers need a specific output format.

AudioChunk — streaming output

AudioChunk {
  Bytes       []byte
  Encoding    string           // "f32" | "i16" | "mp3" | "opus" | "aiff" — what's in Bytes
  SampleRate  uint32           // for raw formats; 0 for compressed
  Channels    uint8            // for raw formats; 0 for compressed
  IsFinal     bool             // true on the last chunk; receiver should close
  Sequence    uint64           // monotonically increasing from 0
  Custom      map<string, any> // backend-specific (e.g., word-timing markers)
}
  • IsFinal: true MUST appear exactly once on the last chunk. After IsFinal, the backend closes the channel.
  • Channel close without IsFinal = abnormal termination. The caller treats the assembled audio as truncated.
  • No enforced chunk cadence. Backends emit at their natural rate. Chunks aligned to codec frame boundaries (MP3 ~20-60ms, Opus ~20ms, PCM at backend's choice) are normal.

VoiceInfo — voice discovery

VoiceInfo {
  ID          string           // backend-native; what you pass to SynthesizeOptions.Voice
  DisplayName string           // human-readable: "Samantha (Premium)", "Rachel (ElevenLabs)"
  Tags        []string         // backend-attached, free-form: ["en-US", "female", "warm", "premium"]
  Languages   []string         // BCP-47 codes the voice supports
  Preview     string?          // optional: URL or path to a sample audio clip
  Custom      map<string, any>  // backend-specific extras (style transfer, emotions, etc.)
}

SynthesizeOptions

SynthesizeOptions {
  Voice    string              // backend-native ID; empty = backend default
  Speed    float64              // 0.5–2.0 typical; 1.0 = natural; backend clamps if out of range
  Pitch    float64              // -1.0 to +1.0; 0 = natural; advisory; backend interprets
  Format   string               // requested output format; backend uses native if unsupported
  Custom   map<string, any>     // backend-specific knobs (stability, style, etc.)
}

Speed and Pitch are advisory — backends honor when possible, ignore when not. Format is a request-with-fallback — backend uses its native if requested format isn't supported and stamps the actual format in AudioBuffer.SampleFormat / AudioChunk.Encoding.


Capabilities

Capabilities {
  # Modes
  SupportsStreaming      bool
  StreamingLatencyMS     uint32      // typical time-to-first-chunk; informational

  # Voices
  VoiceCount             uint32      // total voices available
  DefaultVoice           string      // backend's default voice ID

  # Format
  SupportedFormats       []string    // "f32" | "i16" | "mp3" | "opus" | "aiff" | "raw"
  NativeFormat           string      // what this backend prefers to emit

  # Languages
  SupportedLanguages     []string    // BCP-47 tags

  # Features
  SupportsSpeed          bool        // honors opts.Speed
  SupportsPitch          bool        // honors opts.Pitch
  SupportsSSML           bool        // false in v1; placeholder for v1.x
  SupportsVoiceCloning   bool        // pluggable; some backends support it
  SupportsWordTimings    bool        // emits word-level timing in chunks

  # Cost
  CostPerMillionCharsUSD float64     // 0.0 for local
  IsLocal                bool

  # Limits
  MaxTextChars           uint32      // hard cap; longer text rejected per-call
}

Voice Selection

Backend-native IDs only. No contract-level normalization across backends. opts.Voice = "Samantha" works for tts-say; opts.Voice = "21m00Tcm4TlvDq8ikWAM" works for tts-elevenlabs. Each backend speaks its own language.

Discovery via Voices()

vox tts voices --backend elevenlabs
# → table with ID, DisplayName, Tags, Languages, Preview

vox tts voices --backend say --tag female
# → filtered list

Default per backend

Each backend declares a DefaultVoice in its capabilities. When opts.Voice = "", the backend uses its default.

Reasonable defaults:

Backend Default voice
tts-say Samantha
tts-piper First model alphabetically (depends on installed)
tts-elevenlabs Rachel
tts-openai alloy

Cross-backend aliases — user config, not contract

Users who want "same voice across backends" can define aliases above the contract:

tts:
  voice_aliases:
    house:   { tts-say: Samantha, tts-elevenlabs: Rachel, tts-piper: en_US-libritts-high }
    formal:  { tts-say: Daniel,   tts-elevenlabs: Adam,   tts-piper: en_US-ryan-high }

sinks:
  - name: speak-out
    type: tts-elevenlabs
    voice: "@house"          # resolved via voice_aliases above the contract

Backends only see backend-native strings. Alias resolution is orchestrator/CLI concern.


Streaming

For low-latency conversational mode, StreamSynthesize returns a channel of AudioChunks. The first chunk should arrive within Capabilities.StreamingLatencyMS.

Chunk encoding per chunk

Each chunk declares its own encoding via Encoding. Backends emit their natural format:

Backend Streaming format
tts-piper Raw f32 PCM
tts-elevenlabs MP3 (default) or PCM (opt-in)
tts-openai MP3, AAC, Opus per response_format

The audio-playback subsystem handles decoding before pushing to the audio device.

Backpressure

The backend writes to a bounded-buffer channel (default capacity 32 chunks ≈ 5-10s of audio at typical chunk rates). On full channel: backend blocks on send. Not drop — streaming TTS is meant for sub-second start-of-playback; if the consumer is 32 chunks behind, something is broken and blocking is the right signal.

Buffer size configurable via stream_buffer_chunks.

Slow-backend marker

If a backend can't keep up with playback (one chunk takes longer to generate than its audible duration), it emits a marker chunk with Custom["backend.slow"] = true so the playback layer can react — insert brief silence, swap backends, surface a warning.

Word-timing in Custom map

Some backends (ElevenLabs with_timestamps, Google Cloud TTS) emit word-level timing alongside audio. For v1 this lives in the chunk's Custom map under tts.word_timings. First-class field promotion is a v1.x decision once usage warrants.

Latency telemetry

When a backend exceeds its declared StreamingLatencyMS by 2×, it emits a tts.latency_exceeded event. Same pattern as asr/v1's partial-latency telemetry.


BYOK Authentication

Same precedence chain as asr/v1 and sink/v1 LLM:

  1. Explicit env var (ELEVENLABS_API_KEY, OPENAI_API_KEY, etc.)
  2. OS keychain (default; populated via vox auth set <backend>)
  3. Config file api_key field (deprecated; warns once)
  4. External secrets manager (future secrets/v1)

Shared credentials

Backends that piggyback on existing Vox-side credentials declare so:

sinks:
  - name: speak-out
    type: tts-openai
    auth:
      shares_credential_with: llm-openai     # reuse OPENAI_API_KEY from the llm-openai sink

When shares_credential_with is set, the TTS backend skips its own credential lookup and reads the resolved credential from the named sink. One credential, two surfaces.


Cost Controls (Cloud Backends)

Cloud TTS is metered by character. ElevenLabs at ~$0.30/1k chars adds up fast. First-class guardrails per backend:

sinks:
  - name: speak-out
    type: tts-elevenlabs
    cost_controls:
      budget_daily_usd: 2.00
      budget_monthly_usd: 50.00
      on_budget_warn_pct: 80                 # WARN log + event at 80% of budget
      on_budget_exceed: halt                 # halt | warn | fallback
      fallback_backend: tts-piper            # required if on_budget_exceed: fallback
      rate_limit_per_minute: 30
      max_text_chars: 5000                   # reject implausibly long single-utterance synthesis

Behavior

Setting What it does
budget_daily_usd / budget_monthly_usd Hard caps. Vox tracks consumed chars × CostPerMillionCharsUSD
on_budget_warn_pct Emit tts.budget_warning event + WARN log + audit event at this threshold
on_budget_exceed halt = stop synthesizing; warn = continue + log; fallback = route to fallback_backend
rate_limit_per_minute Soft throttle. Synthesize calls queue if exceeded
max_text_chars Reject calls with text longer than this — prevents accidental novel-length synthesis

Spend tracking shares the ~/.vox/state/cost-tracker.db SQLite store with ASR + LLM tracking, in a tts_costs table.

Pre-flight cost estimation

tts:
  preflight_cost_estimate: true              # default true for paid backends
  preflight_cost_log_threshold_usd: 0.05

Cheap: chars × CostPerMillionCharsUSD. Logged at INFO when over threshold — surfaces runaway-long synthesis before it bills.

Cost transparency on the envelope

Per-envelope cost metadata when TTS runs on an llm_response envelope:

envelope.Provenance.Custom {
  "tts.backend":           "elevenlabs",
  "tts.voice":             "Rachel",
  "tts.audio_seconds":     "8.3",
  "tts.chars_synthesized": "127",
  "tts.cost_usd":          "0.0381",
  "tts.model":             "eleven_turbo_v2",
  "tts.elapsed_ms":        "412"
}

Controlled by tts.emit_cost_metadata: true|false (default true for cloud, false for local — no cost to report).


Fallback Chain

When a backend goes unhealthy (network, auth, quota, budget exceeded with on_budget_exceed: fallback, timeout), route the next synthesis through the next healthy backend.

tts:
  fallback_chains:
    default: [tts-elevenlabs, tts-openai, tts-piper, tts-say]

Semantics

  • Failed backend out of rotation for health_recovery_interval (default 5 min)
  • Subsequent calls route to the next healthy backend
  • After recovery interval, Vox probes the primary; resumes if healthy
  • Fallback events emit tts.backend_fallback telemetry + audit event
  • User-visible: structured WARN log the first time fallback fires per session

Emergency local fallback

If the entire chain fails, Vox uses an always-available emergency local fallback (tts-piper with the smallest bundled voice, or tts-say on macOS). If even that fails: the orchestrator emits a structured warning and SKIPS speech output for that envelope — pipeline never blocks.


Tier-1 Backends (ship with v1)

Backend Local/Cloud Quality Cost Why tier-1
tts-say Local (macOS native) OK / Good $0 Zero install on macOS; Siri voices are excellent; out-of-the-box path
tts-piper Local Good $0 Cross-platform; MIT-licensed; real-time on commodity hardware; no-API-key baseline
tts-elevenlabs Cloud BYOK Best-in-class $0.30/1k chars (~$0.05/min) Highest perceived voice quality; what users will ask for
tts-openai Cloud BYOK Very good $0.015/1k chars (~$0.003/min) Cheapest cloud; reuses OPENAI_API_KEY from llm-openai

Tier-2 backends (community-contributable; same contract)

  • tts-google (Wavenet)
  • tts-azure (Neural voices)
  • tts-aws-polly
  • tts-coqui (open source; voice cloning)
  • tts-bark (Suno's emotive voices)
  • tts-deepgram-aura
  • tts-espeak-ng (Linux native; robotic but always-available fallback)
  • Windows SAPI shell-out

Error Model

Typed errors, mirroring other Vox surfaces:

TTSError {
  Kind     TTSErrorKind
  Backend  string
  Op       string                           // "open" | "synthesize" | "stream-synthesize" | etc.
  Message  string
  Cause    Error?
}

TTSErrorKind {
  ErrInvalidConfig
  ErrAuthFailed
  ErrQuotaExceeded                          // rate-limit / quota / budget
  ErrUnsupported                            // voice / format / feature not supported
  ErrModelNotFound                          // local model file missing (Piper)
  ErrModelCorrupt                           // checksum mismatch
  ErrTimeout
  ErrBackendUnavailable                     // provider down, network unreachable
  ErrTextTooLong                            // input exceeds max_text_chars
  ErrInvalidText                            // empty, control characters, etc.
  ErrTransient                              // retry may help
  ErrPersistent                             // retry won't help
  ErrInternal
}

The orchestrator handles failures via the fallback chain. TTS backends NEVER block the pipeline — worst case is "speech output skipped for this envelope, structured warning logged."


SSML — Deferred to v1.x

v1 ships plain-text input only. SSML support varies wildly across backends (tts-say and tts-piper have none; tts-elevenlabs partial; tts-google full) and isn't core to the LLM-response use case.

When v1.x adds SSML:

  • Capabilities.SupportsSSML flips to advisory
  • SynthesizeOptions.IsSSML flag added
  • Plain-text input remains the default and always supported

Versioning and Stability

tts/v1 is the contract above. Once frozen:

  • Non-breaking changes (allowed in v1.x): adding optional fields to Capabilities, SynthesizeOptions, AudioBuffer, AudioChunk, VoiceInfo, Stats; adding new TTSErrorKind values; adding new built-in backends; adding SSML support (per Sub-decision c); adding word-timing markers as first-class fields (currently in Custom); adding new shared-credential targets.
  • Breaking changes (require v2): changing the TTSBackend interface signature; changing the meaning of any existing field; changing chunk-encoding semantics; changing the streaming channel contract.

The core supports one vN of tts/ at a time, with overlap during migrations.


Audio Playback — Out of Scope

To be explicit: the audio-playback subsystem is NOT part of tts/v1. Backends produce bytes. What happens to those bytes (file write, speaker playback, network stream) is the caller's concern.

The actual playback subsystem (internal/audio/playback/, separate bead blackrim-vox-40r) provides:

  • Cross-platform audio output (CoreAudio / WASAPI / PipeWire via miniaudio — same dependency as capture/v1)
  • Format decoding (MP3 / Opus / AIFF → PCM)
  • Sample-rate conversion when device-native rate ≠ source rate
  • playback.started / playback.ended events for half-duplex coordination with capture

tts/v1 interoperates with the playback subsystem but doesn't define it. Clean separation: synthesize, play, capture.


Reference Build Order

Order Backend Status Why this order
1 tts-say Shipped (internal/tts/say/) First. Zero install on macOS; native voices; no model files; no network; no CGo. Unblocks the entire downstream wiring (TTS sink, audio playback, half-duplex) with the simplest possible backend.
2 tts-piper Shipped (internal/tts/piper/) First cross-platform backend; local neural TTS via Piper (MIT); recommended open-source backend for non-macOS and cross-platform deployments. Validates "local + no API key" baseline.
3 tts-elevenlabs Shipped (internal/tts/elevenlabs/) First cloud backend; validates BYOK + cost controls + streaming mode
4 tts-openai Planned Validates shares_credential_with reuse (with llm-openai); validates non-streaming-first cloud path
5+ tier-2 backends Planned Same contract, different wire protocols

Build the audio-playback subsystem (blackrim-vox-40r) and TTS sink (blackrim-vox-38t) with tts-say alone first. Add backends after the base interface + playback path are proven.


Project Principle: Opinionated Defaults, Every Default Configurable

This contract continues the principle established in the other v1 contracts. Every behavior with a defensible default (stream_buffer_chunks: 32, health_recovery_interval: 5m, emit_cost_metadata: true, preflight_cost_estimate: true, the tier-1 backend choices, etc.) is exposed as a config knob. Defaults reflect a considered recommendation for the typical voice-to-voice conversational use case; the knobs exist so specialized workflows can tune them.


TTS Sink wiring layer

internal/sink/tts (bead blackrim-vox-38t) provides the wiring layer between tts.Backend and audio/playback. It implements sink.Sink and consumes llm_response envelopes from the orchestrator, calling backend.Synthesize() and routing the resulting AudioBuffer to player.Play(). See docs/internal/tts-sink.md for the full contract.

Enable in vox listen with --tts (default backend: say).