tts/v1 — Text-to-Speech Backends¶
Status: Draft (design-locked, ready for first implementation) · Stability: v1 will be frozen with the first reference backend (
tts-say) · Implementations: in-tree only until v1 is frozen.
The tts/v1 surface is the output side of the voice loop. A TTS
backend takes text and produces audio bytes. Vox uses this to speak
LLM responses, meeting summaries, action confirmations — closing the
conversational loop from voice-in to voice-out.
tts/v1 is the sixth pipeline contract, alongside capture/v1,
segment/v1, asr/v1, router/v1, sink/v1. It's an output-side
peer of the input-side asr/v1.
This document is the contract. Backends conforming to it can be loaded
by any version of the Vox core that supports tts/v1.
Scope¶
tts/v1 covers:
- The backend interface (offline + streaming synthesis)
- Output destination model — backends emit bytes; callers choose destination
- Voice discovery and selection (backend-native IDs, no contract-level normalization)
- Streaming chunk shape and cadence
- BYOK authentication + shared credentials
- Cost controls (budgets, rate limits, pre-flight estimation)
- Fallback chain for backend failure
- Per-envelope cost transparency
- Capabilities advertisement
- Error model + lifecycle
- Versioning and stability rules
tts/v1 does not cover:
- Audio playback to speakers (separate audio-playback subsystem)
- Half-duplex coordination with capture (lives in the playback subsystem)
- SSML (deferred to v1.x as additive — see Sub-decision (c))
- Voice cloning workflow UX (per-backend; not contract surface)
Backend Interface¶
TTSBackend {
# Identity
Name() -> string
Capabilities() -> Capabilities
# Lifecycle
Open(config) -> Error
Close() -> Error
# Voice discovery
Voices() -> []VoiceInfo | Error
# Synthesis — offline (single buffer)
Synthesize(ctx, text, opts) -> *AudioBuffer | Error
# Synthesis — streaming (chunks arrive as the backend produces them)
StreamSynthesize(ctx, text, opts) -> <-chan AudioChunk | Error
# Diagnostics
Stats() -> Stats
Health() -> Health
}
Lifecycle state machine¶
Open()
[Closed] ---------------> [Open]
^ |
| | (any number of Synthesize / StreamSynthesize calls)
| |
+--------- Close() -------+
Open()resolves auth, downloads/validates models if needed, prepares the backend. Idempotent re-open afterClose()is allowed.Synthesize/StreamSynthesizeare stateless w.r.t. backend lifecycle — no per-call open. Both are safe to call concurrently from multiple goroutines on the same backend instance.Close()releases resources (HTTP connections, model file handles, miniaudio contexts if any). Idempotent.
Output Destination Model¶
Backends produce bytes. The caller decides what to do with them.
Three concerns, three components:
| Component | Responsibility |
|---|---|
| TTS backend (this contract) | Text → audio bytes |
| Audio-playback subsystem | Audio bytes → speakers (or file) |
| Capture adapter | Microphone → frames (and pauses during playback for half-duplex) |
The TTS sink (internal/sink/tts/, separate bead) is the orchestrator
between these three components. It calls Synthesize on the backend,
hands the bytes to the playback subsystem, and signals capture to
pause.
This keeps the contract slim and testable: a TTS backend never depends on audio hardware. Tests can capture the returned bytes and assert against them without any speakers in the loop.
Wire Types¶
AudioBuffer — offline output¶
AudioBuffer {
Samples []byte // raw bytes, format per SampleFormat
SampleFormat string // "f32" | "i16" | "mp3" | "opus" | "aiff" | "raw"
SampleRate uint32 // 0 for compressed formats
Channels uint8 // 0 for compressed formats
}
The backend returns whatever format it natively produces. No
transcoding inside the backend — a separate audio/transcode helper
package handles cross-format conversion when callers need a specific
output format.
AudioChunk — streaming output¶
AudioChunk {
Bytes []byte
Encoding string // "f32" | "i16" | "mp3" | "opus" | "aiff" — what's in Bytes
SampleRate uint32 // for raw formats; 0 for compressed
Channels uint8 // for raw formats; 0 for compressed
IsFinal bool // true on the last chunk; receiver should close
Sequence uint64 // monotonically increasing from 0
Custom map<string, any> // backend-specific (e.g., word-timing markers)
}
IsFinal: trueMUST appear exactly once on the last chunk. After IsFinal, the backend closes the channel.- Channel close without IsFinal = abnormal termination. The caller treats the assembled audio as truncated.
- No enforced chunk cadence. Backends emit at their natural rate. Chunks aligned to codec frame boundaries (MP3 ~20-60ms, Opus ~20ms, PCM at backend's choice) are normal.
VoiceInfo — voice discovery¶
VoiceInfo {
ID string // backend-native; what you pass to SynthesizeOptions.Voice
DisplayName string // human-readable: "Samantha (Premium)", "Rachel (ElevenLabs)"
Tags []string // backend-attached, free-form: ["en-US", "female", "warm", "premium"]
Languages []string // BCP-47 codes the voice supports
Preview string? // optional: URL or path to a sample audio clip
Custom map<string, any> // backend-specific extras (style transfer, emotions, etc.)
}
SynthesizeOptions¶
SynthesizeOptions {
Voice string // backend-native ID; empty = backend default
Speed float64 // 0.5–2.0 typical; 1.0 = natural; backend clamps if out of range
Pitch float64 // -1.0 to +1.0; 0 = natural; advisory; backend interprets
Format string // requested output format; backend uses native if unsupported
Custom map<string, any> // backend-specific knobs (stability, style, etc.)
}
Speed and Pitch are advisory — backends honor when possible,
ignore when not. Format is a request-with-fallback — backend uses
its native if requested format isn't supported and stamps the actual
format in AudioBuffer.SampleFormat / AudioChunk.Encoding.
Capabilities¶
Capabilities {
# Modes
SupportsStreaming bool
StreamingLatencyMS uint32 // typical time-to-first-chunk; informational
# Voices
VoiceCount uint32 // total voices available
DefaultVoice string // backend's default voice ID
# Format
SupportedFormats []string // "f32" | "i16" | "mp3" | "opus" | "aiff" | "raw"
NativeFormat string // what this backend prefers to emit
# Languages
SupportedLanguages []string // BCP-47 tags
# Features
SupportsSpeed bool // honors opts.Speed
SupportsPitch bool // honors opts.Pitch
SupportsSSML bool // false in v1; placeholder for v1.x
SupportsVoiceCloning bool // pluggable; some backends support it
SupportsWordTimings bool // emits word-level timing in chunks
# Cost
CostPerMillionCharsUSD float64 // 0.0 for local
IsLocal bool
# Limits
MaxTextChars uint32 // hard cap; longer text rejected per-call
}
Voice Selection¶
Backend-native IDs only. No contract-level normalization across
backends. opts.Voice = "Samantha" works for tts-say; opts.Voice =
"21m00Tcm4TlvDq8ikWAM" works for tts-elevenlabs. Each backend speaks
its own language.
Discovery via Voices()¶
vox tts voices --backend elevenlabs
# → table with ID, DisplayName, Tags, Languages, Preview
vox tts voices --backend say --tag female
# → filtered list
Default per backend¶
Each backend declares a DefaultVoice in its capabilities. When
opts.Voice = "", the backend uses its default.
Reasonable defaults:
| Backend | Default voice |
|---|---|
tts-say |
Samantha |
tts-piper |
First model alphabetically (depends on installed) |
tts-elevenlabs |
Rachel |
tts-openai |
alloy |
Cross-backend aliases — user config, not contract¶
Users who want "same voice across backends" can define aliases above the contract:
tts:
voice_aliases:
house: { tts-say: Samantha, tts-elevenlabs: Rachel, tts-piper: en_US-libritts-high }
formal: { tts-say: Daniel, tts-elevenlabs: Adam, tts-piper: en_US-ryan-high }
sinks:
- name: speak-out
type: tts-elevenlabs
voice: "@house" # resolved via voice_aliases above the contract
Backends only see backend-native strings. Alias resolution is orchestrator/CLI concern.
Streaming¶
For low-latency conversational mode, StreamSynthesize returns a
channel of AudioChunks. The first chunk should arrive within
Capabilities.StreamingLatencyMS.
Chunk encoding per chunk¶
Each chunk declares its own encoding via Encoding. Backends emit
their natural format:
| Backend | Streaming format |
|---|---|
tts-piper |
Raw f32 PCM |
tts-elevenlabs |
MP3 (default) or PCM (opt-in) |
tts-openai |
MP3, AAC, Opus per response_format |
The audio-playback subsystem handles decoding before pushing to the audio device.
Backpressure¶
The backend writes to a bounded-buffer channel (default capacity 32 chunks ≈ 5-10s of audio at typical chunk rates). On full channel: backend blocks on send. Not drop — streaming TTS is meant for sub-second start-of-playback; if the consumer is 32 chunks behind, something is broken and blocking is the right signal.
Buffer size configurable via stream_buffer_chunks.
Slow-backend marker¶
If a backend can't keep up with playback (one chunk takes longer to
generate than its audible duration), it emits a marker chunk with
Custom["backend.slow"] = true so the playback layer can react —
insert brief silence, swap backends, surface a warning.
Word-timing in Custom map¶
Some backends (ElevenLabs with_timestamps, Google Cloud TTS) emit
word-level timing alongside audio. For v1 this lives in the chunk's
Custom map under tts.word_timings. First-class field promotion is
a v1.x decision once usage warrants.
Latency telemetry¶
When a backend exceeds its declared StreamingLatencyMS by 2×, it
emits a tts.latency_exceeded event. Same pattern as asr/v1's
partial-latency telemetry.
BYOK Authentication¶
Same precedence chain as asr/v1 and sink/v1 LLM:
- Explicit env var (
ELEVENLABS_API_KEY,OPENAI_API_KEY, etc.) - OS keychain (default; populated via
vox auth set <backend>) - Config file
api_keyfield (deprecated; warns once) - External secrets manager (future
secrets/v1)
Shared credentials¶
Backends that piggyback on existing Vox-side credentials declare so:
sinks:
- name: speak-out
type: tts-openai
auth:
shares_credential_with: llm-openai # reuse OPENAI_API_KEY from the llm-openai sink
When shares_credential_with is set, the TTS backend skips its own
credential lookup and reads the resolved credential from the named
sink. One credential, two surfaces.
Cost Controls (Cloud Backends)¶
Cloud TTS is metered by character. ElevenLabs at ~$0.30/1k chars adds up fast. First-class guardrails per backend:
sinks:
- name: speak-out
type: tts-elevenlabs
cost_controls:
budget_daily_usd: 2.00
budget_monthly_usd: 50.00
on_budget_warn_pct: 80 # WARN log + event at 80% of budget
on_budget_exceed: halt # halt | warn | fallback
fallback_backend: tts-piper # required if on_budget_exceed: fallback
rate_limit_per_minute: 30
max_text_chars: 5000 # reject implausibly long single-utterance synthesis
Behavior¶
| Setting | What it does |
|---|---|
budget_daily_usd / budget_monthly_usd |
Hard caps. Vox tracks consumed chars × CostPerMillionCharsUSD |
on_budget_warn_pct |
Emit tts.budget_warning event + WARN log + audit event at this threshold |
on_budget_exceed |
halt = stop synthesizing; warn = continue + log; fallback = route to fallback_backend |
rate_limit_per_minute |
Soft throttle. Synthesize calls queue if exceeded |
max_text_chars |
Reject calls with text longer than this — prevents accidental novel-length synthesis |
Spend tracking shares the ~/.vox/state/cost-tracker.db SQLite store
with ASR + LLM tracking, in a tts_costs table.
Pre-flight cost estimation¶
tts:
preflight_cost_estimate: true # default true for paid backends
preflight_cost_log_threshold_usd: 0.05
Cheap: chars × CostPerMillionCharsUSD. Logged at INFO when over
threshold — surfaces runaway-long synthesis before it bills.
Cost transparency on the envelope¶
Per-envelope cost metadata when TTS runs on an llm_response envelope:
envelope.Provenance.Custom {
"tts.backend": "elevenlabs",
"tts.voice": "Rachel",
"tts.audio_seconds": "8.3",
"tts.chars_synthesized": "127",
"tts.cost_usd": "0.0381",
"tts.model": "eleven_turbo_v2",
"tts.elapsed_ms": "412"
}
Controlled by tts.emit_cost_metadata: true|false (default true for
cloud, false for local — no cost to report).
Fallback Chain¶
When a backend goes unhealthy (network, auth, quota, budget exceeded
with on_budget_exceed: fallback, timeout), route the next synthesis
through the next healthy backend.
tts:
fallback_chains:
default: [tts-elevenlabs, tts-openai, tts-piper, tts-say]
Semantics¶
- Failed backend out of rotation for
health_recovery_interval(default 5 min) - Subsequent calls route to the next healthy backend
- After recovery interval, Vox probes the primary; resumes if healthy
- Fallback events emit
tts.backend_fallbacktelemetry + audit event - User-visible: structured WARN log the first time fallback fires per session
Emergency local fallback¶
If the entire chain fails, Vox uses an always-available emergency
local fallback (tts-piper with the smallest bundled voice, or
tts-say on macOS). If even that fails: the orchestrator emits a
structured warning and SKIPS speech output for that envelope —
pipeline never blocks.
Tier-1 Backends (ship with v1)¶
| Backend | Local/Cloud | Quality | Cost | Why tier-1 |
|---|---|---|---|---|
tts-say |
Local (macOS native) | OK / Good | $0 | Zero install on macOS; Siri voices are excellent; out-of-the-box path |
tts-piper |
Local | Good | $0 | Cross-platform; MIT-licensed; real-time on commodity hardware; no-API-key baseline |
tts-elevenlabs |
Cloud BYOK | Best-in-class | $0.30/1k chars (~$0.05/min) | Highest perceived voice quality; what users will ask for |
tts-openai |
Cloud BYOK | Very good | $0.015/1k chars (~$0.003/min) | Cheapest cloud; reuses OPENAI_API_KEY from llm-openai |
Tier-2 backends (community-contributable; same contract)¶
tts-google(Wavenet)tts-azure(Neural voices)tts-aws-pollytts-coqui(open source; voice cloning)tts-bark(Suno's emotive voices)tts-deepgram-auratts-espeak-ng(Linux native; robotic but always-available fallback)- Windows SAPI shell-out
Error Model¶
Typed errors, mirroring other Vox surfaces:
TTSError {
Kind TTSErrorKind
Backend string
Op string // "open" | "synthesize" | "stream-synthesize" | etc.
Message string
Cause Error?
}
TTSErrorKind {
ErrInvalidConfig
ErrAuthFailed
ErrQuotaExceeded // rate-limit / quota / budget
ErrUnsupported // voice / format / feature not supported
ErrModelNotFound // local model file missing (Piper)
ErrModelCorrupt // checksum mismatch
ErrTimeout
ErrBackendUnavailable // provider down, network unreachable
ErrTextTooLong // input exceeds max_text_chars
ErrInvalidText // empty, control characters, etc.
ErrTransient // retry may help
ErrPersistent // retry won't help
ErrInternal
}
The orchestrator handles failures via the fallback chain. TTS backends NEVER block the pipeline — worst case is "speech output skipped for this envelope, structured warning logged."
SSML — Deferred to v1.x¶
v1 ships plain-text input only. SSML support varies wildly across
backends (tts-say and tts-piper have none; tts-elevenlabs partial;
tts-google full) and isn't core to the LLM-response use case.
When v1.x adds SSML:
Capabilities.SupportsSSMLflips to advisorySynthesizeOptions.IsSSMLflag added- Plain-text input remains the default and always supported
Versioning and Stability¶
tts/v1 is the contract above. Once frozen:
- Non-breaking changes (allowed in
v1.x): adding optional fields toCapabilities,SynthesizeOptions,AudioBuffer,AudioChunk,VoiceInfo,Stats; adding newTTSErrorKindvalues; adding new built-in backends; adding SSML support (per Sub-decision c); adding word-timing markers as first-class fields (currently inCustom); adding new shared-credential targets. - Breaking changes (require
v2): changing theTTSBackendinterface signature; changing the meaning of any existing field; changing chunk-encoding semantics; changing the streaming channel contract.
The core supports one vN of tts/ at a time, with overlap during
migrations.
Audio Playback — Out of Scope¶
To be explicit: the audio-playback subsystem is NOT part of
tts/v1. Backends produce bytes. What happens to those bytes (file
write, speaker playback, network stream) is the caller's concern.
The actual playback subsystem (internal/audio/playback/, separate
bead blackrim-vox-40r) provides:
- Cross-platform audio output (CoreAudio / WASAPI / PipeWire via
miniaudio — same dependency as
capture/v1) - Format decoding (MP3 / Opus / AIFF → PCM)
- Sample-rate conversion when device-native rate ≠ source rate
playback.started/playback.endedevents for half-duplex coordination with capture
tts/v1 interoperates with the playback subsystem but doesn't define
it. Clean separation: synthesize, play, capture.
Reference Build Order¶
| Order | Backend | Status | Why this order |
|---|---|---|---|
| 1 | tts-say |
Shipped (internal/tts/say/) |
First. Zero install on macOS; native voices; no model files; no network; no CGo. Unblocks the entire downstream wiring (TTS sink, audio playback, half-duplex) with the simplest possible backend. |
| 2 | tts-piper |
Shipped (internal/tts/piper/) |
First cross-platform backend; local neural TTS via Piper (MIT); recommended open-source backend for non-macOS and cross-platform deployments. Validates "local + no API key" baseline. |
| 3 | tts-elevenlabs |
Shipped (internal/tts/elevenlabs/) |
First cloud backend; validates BYOK + cost controls + streaming mode |
| 4 | tts-openai |
Planned | Validates shares_credential_with reuse (with llm-openai); validates non-streaming-first cloud path |
| 5+ | tier-2 backends | Planned | Same contract, different wire protocols |
Build the audio-playback subsystem (blackrim-vox-40r) and TTS sink
(blackrim-vox-38t) with tts-say alone first. Add backends after
the base interface + playback path are proven.
Project Principle: Opinionated Defaults, Every Default Configurable¶
This contract continues the principle established in the other v1
contracts. Every behavior with a defensible default (stream_buffer_chunks:
32, health_recovery_interval: 5m, emit_cost_metadata: true,
preflight_cost_estimate: true, the tier-1 backend choices, etc.) is
exposed as a config knob. Defaults reflect a considered recommendation
for the typical voice-to-voice conversational use case; the knobs
exist so specialized workflows can tune them.
TTS Sink wiring layer¶
internal/sink/tts (bead blackrim-vox-38t) provides the wiring layer
between tts.Backend and audio/playback. It implements sink.Sink and
consumes llm_response envelopes from the orchestrator, calling
backend.Synthesize() and routing the resulting AudioBuffer to
player.Play(). See docs/internal/tts-sink.md for the full contract.
Enable in vox listen with --tts (default backend: say).