`tts/v1` — Text-to-Speech Backends¶

Status: Draft (design-locked, ready for first implementation) · Stability: v1 will be frozen with the first reference backend (tts-say) · Implementations: in-tree only until v1 is frozen.

The tts/v1 surface is the output side of the voice loop. A TTS backend takes text and produces audio bytes. Vox uses this to speak LLM responses, meeting summaries, action confirmations — closing the conversational loop from voice-in to voice-out.

tts/v1 is the sixth pipeline contract, alongside capture/v1, segment/v1, asr/v1, router/v1, sink/v1. It's an output-side peer of the input-side asr/v1.

This document is the contract. Backends conforming to it can be loaded by any version of the Vox core that supports tts/v1.

Scope¶

tts/v1 covers:

The backend interface (offline + streaming synthesis)
Output destination model — backends emit bytes; callers choose destination
Voice discovery and selection (backend-native IDs, no contract-level normalization)
Streaming chunk shape and cadence
BYOK authentication + shared credentials
Cost controls (budgets, rate limits, pre-flight estimation)
Fallback chain for backend failure
Per-envelope cost transparency
Capabilities advertisement
Error model + lifecycle
Versioning and stability rules

tts/v1 does not cover:

Audio playback to speakers (separate audio-playback subsystem)
Half-duplex coordination with capture (lives in the playback subsystem)
SSML (deferred to v1.x as additive — see Sub-decision (c))
Voice cloning workflow UX (per-backend; not contract surface)

Backend Interface¶

TTSBackend {
  # Identity
  Name()          -> string
  Capabilities()  -> Capabilities

  # Lifecycle
  Open(config)    -> Error
  Close()         -> Error

  # Voice discovery
  Voices()        -> []VoiceInfo | Error

  # Synthesis — offline (single buffer)
  Synthesize(ctx, text, opts) -> *AudioBuffer | Error

  # Synthesis — streaming (chunks arrive as the backend produces them)
  StreamSynthesize(ctx, text, opts) -> <-chan AudioChunk | Error

  # Diagnostics
  Stats()         -> Stats
  Health()        -> Health
}

Lifecycle state machine¶

                  Open()
   [Closed]  --------------->  [Open]
       ^                         |
       |                         | (any number of Synthesize / StreamSynthesize calls)
       |                         |
       +--------- Close() -------+

Open() resolves auth, downloads/validates models if needed, prepares the backend. Idempotent re-open after Close() is allowed.
Synthesize / StreamSynthesize are stateless w.r.t. backend lifecycle — no per-call open. Both are safe to call concurrently from multiple goroutines on the same backend instance.
Close() releases resources (HTTP connections, model file handles, miniaudio contexts if any). Idempotent.

Output Destination Model¶

Backends produce bytes. The caller decides what to do with them.

Three concerns, three components:

Component	Responsibility
TTS backend (this contract)	Text → audio bytes
Audio-playback subsystem	Audio bytes → speakers (or file)
Capture adapter	Microphone → frames (and pauses during playback for half-duplex)

The TTS sink (internal/sink/tts/, separate bead) is the orchestrator between these three components. It calls Synthesize on the backend, hands the bytes to the playback subsystem, and signals capture to pause.

This keeps the contract slim and testable: a TTS backend never depends on audio hardware. Tests can capture the returned bytes and assert against them without any speakers in the loop.

Wire Types¶

`AudioBuffer` — offline output¶

AudioBuffer {
  Samples      []byte          // raw bytes, format per SampleFormat
  SampleFormat string          // "f32" | "i16" | "mp3" | "opus" | "aiff" | "raw"
  SampleRate   uint32          // 0 for compressed formats
  Channels     uint8           // 0 for compressed formats
}

The backend returns whatever format it natively produces. No transcoding inside the backend — a separate audio/transcode helper package handles cross-format conversion when callers need a specific output format.

`AudioChunk` — streaming output¶

AudioChunk {
  Bytes       []byte
  Encoding    string           // "f32" | "i16" | "mp3" | "opus" | "aiff" — what's in Bytes
  SampleRate  uint32           // for raw formats; 0 for compressed
  Channels    uint8            // for raw formats; 0 for compressed
  IsFinal     bool             // true on the last chunk; receiver should close
  Sequence    uint64           // monotonically increasing from 0
  Custom      map<string, any> // backend-specific (e.g., word-timing markers)
}

IsFinal: true MUST appear exactly once on the last chunk. After IsFinal, the backend closes the channel.
Channel close without IsFinal = abnormal termination. The caller treats the assembled audio as truncated.
No enforced chunk cadence. Backends emit at their natural rate. Chunks aligned to codec frame boundaries (MP3 ~20-60ms, Opus ~20ms, PCM at backend's choice) are normal.

`VoiceInfo` — voice discovery¶

VoiceInfo {
  ID          string           // backend-native; what you pass to SynthesizeOptions.Voice
  DisplayName string           // human-readable: "Samantha (Premium)", "Rachel (ElevenLabs)"
  Tags        []string         // backend-attached, free-form: ["en-US", "female", "warm", "premium"]
  Languages   []string         // BCP-47 codes the voice supports
  Preview     string?          // optional: URL or path to a sample audio clip
  Custom      map<string, any>  // backend-specific extras (style transfer, emotions, etc.)
}

`SynthesizeOptions`¶

SynthesizeOptions {
  Voice    string              // backend-native ID; empty = backend default
  Speed    float64              // 0.5–2.0 typical; 1.0 = natural; backend clamps if out of range
  Pitch    float64              // -1.0 to +1.0; 0 = natural; advisory; backend interprets
  Format   string               // requested output format; backend uses native if unsupported
  Custom   map<string, any>     // backend-specific knobs (stability, style, etc.)
}

Speed and Pitch are advisory — backends honor when possible, ignore when not. Format is a request-with-fallback — backend uses its native if requested format isn't supported and stamps the actual format in AudioBuffer.SampleFormat / AudioChunk.Encoding.

Capabilities¶

Capabilities {
  # Modes
  SupportsStreaming      bool
  StreamingLatencyMS     uint32      // typical time-to-first-chunk; informational

  # Voices
  VoiceCount             uint32      // total voices available
  DefaultVoice           string      // backend's default voice ID

  # Format
  SupportedFormats       []string    // "f32" | "i16" | "mp3" | "opus" | "aiff" | "raw"
  NativeFormat           string      // what this backend prefers to emit

  # Languages
  SupportedLanguages     []string    // BCP-47 tags

  # Features
  SupportsSpeed          bool        // honors opts.Speed
  SupportsPitch          bool        // honors opts.Pitch
  SupportsSSML           bool        // false in v1; placeholder for v1.x
  SupportsVoiceCloning   bool        // pluggable; some backends support it
  SupportsWordTimings    bool        // emits word-level timing in chunks

  # Cost
  CostPerMillionCharsUSD float64     // 0.0 for local
  IsLocal                bool

  # Limits
  MaxTextChars           uint32      // hard cap; longer text rejected per-call
}

Voice Selection¶

Backend-native IDs only. No contract-level normalization across backends. opts.Voice = "Samantha" works for tts-say; opts.Voice = "21m00Tcm4TlvDq8ikWAM" works for tts-elevenlabs. Each backend speaks its own language.

Discovery via `Voices()`¶

vox tts voices --backend elevenlabs
# → table with ID, DisplayName, Tags, Languages, Preview

vox tts voices --backend say --tag female
# → filtered list

Default per backend¶

Each backend declares a DefaultVoice in its capabilities. When opts.Voice = "", the backend uses its default.

Reasonable defaults:

Backend	Default voice
`tts-say`	`Samantha`
`tts-piper`	First model alphabetically (depends on installed)
`tts-elevenlabs`	`Rachel`
`tts-openai`	`alloy`

Cross-backend aliases — user config, not contract¶

Users who want "same voice across backends" can define aliases above the contract:

tts:
  voice_aliases:
    house:   { tts-say: Samantha, tts-elevenlabs: Rachel, tts-piper: en_US-libritts-high }
    formal:  { tts-say: Daniel,   tts-elevenlabs: Adam,   tts-piper: en_US-ryan-high }

sinks:
  - name: speak-out
    type: tts-elevenlabs
    voice: "@house"          # resolved via voice_aliases above the contract

Backends only see backend-native strings. Alias resolution is orchestrator/CLI concern.

Streaming¶

For low-latency conversational mode, StreamSynthesize returns a channel of AudioChunks. The first chunk should arrive within Capabilities.StreamingLatencyMS.

Chunk encoding per chunk¶

Each chunk declares its own encoding via Encoding. Backends emit their natural format:

Backend	Streaming format
`tts-piper`	Raw f32 PCM
`tts-elevenlabs`	MP3 (default) or PCM (opt-in)
`tts-openai`	MP3, AAC, Opus per `response_format`

The audio-playback subsystem handles decoding before pushing to the audio device.

Backpressure¶

The backend writes to a bounded-buffer channel (default capacity 32 chunks ≈ 5-10s of audio at typical chunk rates). On full channel: backend blocks on send. Not drop — streaming TTS is meant for sub-second start-of-playback; if the consumer is 32 chunks behind, something is broken and blocking is the right signal.

Buffer size configurable via stream_buffer_chunks.

Slow-backend marker¶

If a backend can't keep up with playback (one chunk takes longer to generate than its audible duration), it emits a marker chunk with Custom["backend.slow"] = true so the playback layer can react — insert brief silence, swap backends, surface a warning.

Word-timing in `Custom` map¶

Some backends (ElevenLabs with_timestamps, Google Cloud TTS) emit word-level timing alongside audio. For v1 this lives in the chunk's Custom map under tts.word_timings. First-class field promotion is a v1.x decision once usage warrants.

Latency telemetry¶

When a backend exceeds its declared StreamingLatencyMS by 2×, it emits a tts.latency_exceeded event. Same pattern as asr/v1's partial-latency telemetry.

BYOK Authentication¶

Same precedence chain as asr/v1 and sink/v1 LLM:

Explicit env var (ELEVENLABS_API_KEY, OPENAI_API_KEY, etc.)
OS keychain (default; populated via vox auth set <backend>)
Config file api_key field (deprecated; warns once)
External secrets manager (future secrets/v1)

Shared credentials¶

Backends that piggyback on existing Vox-side credentials declare so:

sinks:
  - name: speak-out
    type: tts-openai
    auth:
      shares_credential_with: llm-openai     # reuse OPENAI_API_KEY from the llm-openai sink

When shares_credential_with is set, the TTS backend skips its own credential lookup and reads the resolved credential from the named sink. One credential, two surfaces.

Cost Controls (Cloud Backends)¶

Cloud TTS is metered by character. ElevenLabs at ~$0.30/1k chars adds up fast. First-class guardrails per backend:

sinks:
  - name: speak-out
    type: tts-elevenlabs
    cost_controls:
      budget_daily_usd: 2.00
      budget_monthly_usd: 50.00
      on_budget_warn_pct: 80                 # WARN log + event at 80% of budget
      on_budget_exceed: halt                 # halt | warn | fallback
      fallback_backend: tts-piper            # required if on_budget_exceed: fallback
      rate_limit_per_minute: 30
      max_text_chars: 5000                   # reject implausibly long single-utterance synthesis

Behavior¶

Setting	What it does
`budget_daily_usd` / `budget_monthly_usd`	Hard caps. Vox tracks consumed chars × `CostPerMillionCharsUSD`
`on_budget_warn_pct`	Emit `tts.budget_warning` event + WARN log + audit event at this threshold
`on_budget_exceed`	`halt` = stop synthesizing; `warn` = continue + log; `fallback` = route to `fallback_backend`
`rate_limit_per_minute`	Soft throttle. Synthesize calls queue if exceeded
`max_text_chars`	Reject calls with text longer than this — prevents accidental novel-length synthesis

Spend tracking shares the ~/.vox/state/cost-tracker.db SQLite store with ASR + LLM tracking, in a tts_costs table.

Pre-flight cost estimation¶

tts:
  preflight_cost_estimate: true              # default true for paid backends
  preflight_cost_log_threshold_usd: 0.05

Cheap: chars × CostPerMillionCharsUSD. Logged at INFO when over threshold — surfaces runaway-long synthesis before it bills.

Cost transparency on the envelope¶

Per-envelope cost metadata when TTS runs on an llm_response envelope:

envelope.Provenance.Custom {
  "tts.backend":           "elevenlabs",
  "tts.voice":             "Rachel",
  "tts.audio_seconds":     "8.3",
  "tts.chars_synthesized": "127",
  "tts.cost_usd":          "0.0381",
  "tts.model":             "eleven_turbo_v2",
  "tts.elapsed_ms":        "412"
}

Controlled by tts.emit_cost_metadata: true|false (default true for cloud, false for local — no cost to report).

Fallback Chain¶

When a backend goes unhealthy (network, auth, quota, budget exceeded with on_budget_exceed: fallback, timeout), route the next synthesis through the next healthy backend.

tts:
  fallback_chains:
    default: [tts-elevenlabs, tts-openai, tts-piper, tts-say]

Semantics¶

Failed backend out of rotation for health_recovery_interval (default 5 min)
Subsequent calls route to the next healthy backend
After recovery interval, Vox probes the primary; resumes if healthy
Fallback events emit tts.backend_fallback telemetry + audit event
User-visible: structured WARN log the first time fallback fires per session

Emergency local fallback¶

If the entire chain fails, Vox uses an always-available emergency local fallback (tts-piper with the smallest bundled voice, or tts-say on macOS). If even that fails: the orchestrator emits a structured warning and SKIPS speech output for that envelope — pipeline never blocks.

Tier-1 Backends (ship with v1)¶

Backend	Local/Cloud	Quality	Cost	Why tier-1
`tts-say`	Local (macOS native)	OK / Good	$0	Zero install on macOS; Siri voices are excellent; out-of-the-box path
`tts-piper`	Local	Good	$0	Cross-platform; MIT-licensed; real-time on commodity hardware; no-API-key baseline
`tts-elevenlabs`	Cloud BYOK	Best-in-class	$0.30/1k chars (~$0.05/min)	Highest perceived voice quality; what users will ask for
`tts-openai`	Cloud BYOK	Very good	$0.015/1k chars (~$0.003/min)	Cheapest cloud; reuses `OPENAI_API_KEY` from `llm-openai`

Tier-2 backends (community-contributable; same contract)¶

tts-google (Wavenet)
tts-azure (Neural voices)
tts-aws-polly
tts-coqui (open source; voice cloning)
tts-bark (Suno's emotive voices)
tts-deepgram-aura
tts-espeak-ng (Linux native; robotic but always-available fallback)
Windows SAPI shell-out

Error Model¶

Typed errors, mirroring other Vox surfaces:

TTSError {
  Kind     TTSErrorKind
  Backend  string
  Op       string                           // "open" | "synthesize" | "stream-synthesize" | etc.
  Message  string
  Cause    Error?
}

TTSErrorKind {
  ErrInvalidConfig
  ErrAuthFailed
  ErrQuotaExceeded                          // rate-limit / quota / budget
  ErrUnsupported                            // voice / format / feature not supported
  ErrModelNotFound                          // local model file missing (Piper)
  ErrModelCorrupt                           // checksum mismatch
  ErrTimeout
  ErrBackendUnavailable                     // provider down, network unreachable
  ErrTextTooLong                            // input exceeds max_text_chars
  ErrInvalidText                            // empty, control characters, etc.
  ErrTransient                              // retry may help
  ErrPersistent                             // retry won't help
  ErrInternal
}

The orchestrator handles failures via the fallback chain. TTS backends NEVER block the pipeline — worst case is "speech output skipped for this envelope, structured warning logged."

SSML — Deferred to v1.x¶

v1 ships plain-text input only. SSML support varies wildly across backends (tts-say and tts-piper have none; tts-elevenlabs partial; tts-google full) and isn't core to the LLM-response use case.

When v1.x adds SSML:

Capabilities.SupportsSSML flips to advisory
SynthesizeOptions.IsSSML flag added
Plain-text input remains the default and always supported

Versioning and Stability¶

tts/v1 is the contract above. Once frozen:

Non-breaking changes (allowed in v1.x): adding optional fields to Capabilities, SynthesizeOptions, AudioBuffer, AudioChunk, VoiceInfo, Stats; adding new TTSErrorKind values; adding new built-in backends; adding SSML support (per Sub-decision c); adding word-timing markers as first-class fields (currently in Custom); adding new shared-credential targets.
Breaking changes (require v2): changing the TTSBackend interface signature; changing the meaning of any existing field; changing chunk-encoding semantics; changing the streaming channel contract.

The core supports one vN of tts/ at a time, with overlap during migrations.

Audio Playback — Out of Scope¶

To be explicit: the audio-playback subsystem is NOT part of tts/v1. Backends produce bytes. What happens to those bytes (file write, speaker playback, network stream) is the caller's concern.

The actual playback subsystem (internal/audio/playback/, separate bead blackrim-vox-40r) provides:

Cross-platform audio output (CoreAudio / WASAPI / PipeWire via miniaudio — same dependency as capture/v1)
Format decoding (MP3 / Opus / AIFF → PCM)
Sample-rate conversion when device-native rate ≠ source rate
playback.started / playback.ended events for half-duplex coordination with capture

tts/v1 interoperates with the playback subsystem but doesn't define it. Clean separation: synthesize, play, capture.

Reference Build Order¶

Order	Backend	Status	Why this order
1	`tts-say`	Shipped (`internal/tts/say/`)	First. Zero install on macOS; native voices; no model files; no network; no CGo. Unblocks the entire downstream wiring (TTS sink, audio playback, half-duplex) with the simplest possible backend.
2	`tts-piper`	Shipped (`internal/tts/piper/`)	First cross-platform backend; local neural TTS via Piper (MIT); recommended open-source backend for non-macOS and cross-platform deployments. Validates "local + no API key" baseline.
3	`tts-elevenlabs`	Shipped (`internal/tts/elevenlabs/`)	First cloud backend; validates BYOK + cost controls + streaming mode
4	`tts-openai`	Planned	Validates `shares_credential_with` reuse (with `llm-openai`); validates non-streaming-first cloud path
5+	tier-2 backends	Planned	Same contract, different wire protocols

Build the audio-playback subsystem (blackrim-vox-40r) and TTS sink (blackrim-vox-38t) with tts-say alone first. Add backends after the base interface + playback path are proven.

Project Principle: Opinionated Defaults, Every Default Configurable¶

This contract continues the principle established in the other v1 contracts. Every behavior with a defensible default (stream_buffer_chunks: 32, health_recovery_interval: 5m, emit_cost_metadata: true, preflight_cost_estimate: true, the tier-1 backend choices, etc.) is exposed as a config knob. Defaults reflect a considered recommendation for the typical voice-to-voice conversational use case; the knobs exist so specialized workflows can tune them.

TTS Sink wiring layer¶

internal/sink/tts (bead blackrim-vox-38t) provides the wiring layer between tts.Backend and audio/playback. It implements sink.Sink and consumes llm_response envelopes from the orchestrator, calling backend.Synthesize() and routing the resulting AudioBuffer to player.Play(). See docs/internal/tts-sink.md for the full contract.

Enable in vox listen with --tts (default backend: say).

tts/v1 — Text-to-Speech Backends¶