Skip to content

ADR: Mobile Client Architecture

Status: Locked, v1.0 Decision date: 2026-05-17 Parent epic: blackrim-vox-boq Decision owner: Jay German


1. Context

Blackrim Vox today targets macOS (Intel + Apple Silicon), Windows, and Linux. The pipeline — capture → segment → ASR → route → sink — is implemented in Go, sized for low CPU and thermal footprint, and runs as a long-lived process on a developer workstation. The boq epic extends that reach to iOS + Android.

Why mobile matters

Vox's value proposition is capturing spoken intent wherever the user's attention is. Workstations are not always in reach. Mobile devices are nearly always present — on a walk, in a meeting room without a laptop, or when the laptop lid is closed. A first-class mobile client closes the gap between having a thought and routing it to the right sink, regardless of which device is in hand.

Subscription framing

Mobile is commercial-only from the first release. It sits in the Enterprise / Individual-subscription tier, not in the Apache-2.0 open core. That framing matters architecturally: the mobile client is not a public, open-source surface the community builds against. It is a product SKU the team ships and controls. Design choices that optimize for community hackability over product reliability are the wrong axis for this decision.

Platform constraints that shape the option space

iOS background audio capture is the hardest constraint in the design space. Apple imposes three controls that compound:

  • Apps may only use the microphone in the background if they have the audio Background Mode entitlement AND hold an AVAudioSession with mixWithOthers or allowBluetooth active. The entitlement triggers additional App Store review scrutiny.
  • Background execution is suspended by the OS after roughly 30 seconds without audible output unless the app is actively playing audio through the speaker. A capture-only session (no TTS response) must sustain a silent audio output track to keep the session alive — a known trick that Apple has flagged as abuse in App Store reviews.
  • App Store review guidelines (4.2, 2.5.4) explicitly prohibit apps that consume background permissions beyond their stated purpose. A voice-capture app that transcribes continuously will be scrutinized on each submission.

Android background restrictions are real but more tractable. A foreground service with FOREGROUND_SERVICE_TYPE_MICROPHONE (API 30+) and a persistent notification is the documented path for continuous capture. It works reliably; the regulatory and UX surface is manageable.

Battery and thermal constraints. Running a full VAD + ASR pipeline on a mobile SoC in real time is viable (Whisper small.en fits in ~500 MB RAM, and Apple Neural Engine / Android NNAPI make inference practical), but it is a significant engineering investment in power profiling, thermal management, and background-execution lifecycle. That investment needs to be justified against the architecture chosen.

Open-core / commercial boundary

The Go pipeline is Apache-2.0. Any architecture that embeds Go code into an iOS or Android app must grapple with how that code is licensed, distributed, and updated. gomobile specifically produces compiled framework bundles (.framework / .aar), not source-level Go — so end-users do not receive Go source under Apache-2.0. But the architecture must not inadvertently couple the open-core pipeline's release cadence to the App Store submission cycle.


2. Decision

Mobile is a thin client. Audio capture and pipeline processing run on a paired desktop (LAN) or hosted Vox server (cloud). The mobile app handles capture initiation, streaming audio to the server, and presenting responses.

The mobile apps (iOS and Android) are native shells — Swift on iOS, Kotlin on Android — that implement the platform-specific audio capture stack and a lightweight streaming transport (WebSocket or gRPC-over-HTTP/2). The Go pipeline on the server side is unchanged; it ingests the audio stream through a capture/v1 network adapter and processes it normally. The mobile app receives structured responses via the same envelope wire format the desktop clients already use.

Why this one

The competing paths (gomobile, full native reimplementation) both require the pipeline to run on the device. Running the pipeline on the device on iOS collides directly with Apple's background execution model: the Go runtime has no native integration with AVAudioSession's background-mode lifecycle, and any gomobile-based solution requires non-trivial native bridging to keep the process alive. Full native reimplementation avoids the gomobile runtime problem but introduces two new pipeline implementations that must track the Go implementation indefinitely — the highest ongoing engineering cost of the three options.

The server model sidesteps both problems. The iOS constraint on background audio becomes a well-understood foreground-capture pattern: the app captures while foregrounded, streams to the server, and the server processes. For the paired-desktop case (the P0 deployment), the mobile device needs only a LAN connection — no cloud infrastructure required to ship the first version. For the hosted case (the P1 deployment), the server is behind the subscription paywall, which maps cleanly onto the existing commercial boundary: mobile + hosted processing = paid tier.

The tradeoff is a real dependency: the mobile client is useless without a reachable server. That is an acceptable product constraint for the target user (a knowledge worker who already runs Vox on a desktop and wants to extend reach to mobile). It is not an acceptable constraint for a standalone recorder, which is a different product. Vox mobile is not a standalone recorder.


3. Considered options

Option A — gomobile (golang.org/x/mobile)

Approach

Compile the Go capture/segment/ASR/router/sink chain into platform-native framework bundles using gomobile bind. iOS receives a .framework via XCFramework packaging; Android receives an .aar. Thin native UI shells call into the compiled Go framework for all business logic. Audio is captured via platform APIs and handed to the Go runtime via JNI (Android) or Swift/ObjC bridging (iOS).

Pros

  • Reuses the Go pipeline directly. Bugs fixed in Go are fixed on mobile without a reimplementation lag.
  • Wire format, router logic, and sink adapters are byte-identical with the desktop — no drift surface.
  • One codebase for all pipeline logic across all platforms.
  • The open-core model is preserved cleanly: the Go code is the same Apache-2.0 source regardless of platform.

Cons

  • Background execution on iOS is the blocker. The Go runtime does not integrate with AVAudioSession's background lifecycle. Keeping a Go-based capture loop alive in the background requires a native wrapper that sustains an audio session, which is exactly what the native Swift path would do anyway. The gomobile layer adds complexity without solving the constraint.
  • gomobile does not support the full Go standard library. net, os/exec, and several syscall-level packages are restricted or absent on iOS. Vox's pipeline is not pathologically dependent on these today, but any future extension that uses them requires a mobile-specific fork of the path.
  • No Cgo on iOS. The Go pipeline uses malgo (a Cgo wrapper around miniaudio) for audio device access. That path does not exist on iOS under gomobile. The mobile capture layer must be written in Swift/ObjC regardless, which reintroduces native code and defeats a portion of the reuse argument.
  • Build pipeline complexity. XCFramework packaging, gomobile bind build tags, and the bridging layer between Go and Swift add substantial CI complexity. Build times lengthen; debugging across the Go/Swift boundary is harder than debugging pure Swift.
  • App Store binary size. A gomobile-compiled framework for Whisper + the full pipeline will be large. App Store thinning helps but does not eliminate the concern.
  • Apple Silicon caveats. gomobile's iOS simulator support on Apple Silicon has historically had rough edges (arm64 simulator vs arm64 device ABI collision). These are solvable but add ramp time.

Cost estimate

  • MVP: 10–14 engineering weeks. Includes: native audio capture bridge (Swift to Go), background execution wrapper, gomobile CI, TestFlight submission.
  • Android MVP: 6–8 additional weeks. JNI bridge is more straightforward than the iOS path; fewer background-execution constraints.

Ongoing maintenance burden

Medium-high. Every Go pipeline change must be validated on mobile. The gomobile build pipeline must be kept current with Go toolchain releases. iOS SDK changes that affect AVAudioSession require native-layer changes that interact with the Go boundary.

Distribution implications

  • iOS: App Store submission with audio Background Mode entitlement. Review scrutiny is higher for continuously-capturing apps. Binary size may trigger App Store cellular download limits (~200 MB OTA cap).
  • Android: Play Store standard submission. AAR bundles are well-understood.

Option B — Native (Swift + Kotlin)

Approach

Reimplement the pipeline — or the relevant subset of it — in Swift (iOS) and Kotlin (Android) independently. The Go codebase remains the canonical implementation; mobile implementations are parallel. Wire format contracts (the envelope schema, sink interface definitions) are shared at the spec level, not at the code level.

Pros

  • Best possible platform integration. AVAudioSession, SFSpeechRecognizer, AVSpeechSynthesizer, AVAudioEngine on iOS; AudioRecord, SpeechRecognizer, MediaPlayer on Android. Platform teams at companies like Otter.ai and Notion ship with these stacks; the ecosystem is mature.
  • No background execution compromise. Native apps can hold audio sessions in the background using documented, App-Store-approved patterns that the OS does not penalize.
  • Smallest binary. No Go runtime embedded. iOS apps in the 20–40 MB range are achievable.
  • Full access to on-device ML hardware. CoreML (iOS) and NNAPI (Android) are accessible natively; local ASR (Whisper via CoreML) is well-documented.
  • No cross-language debugging overhead. Crash traces, Instruments profiles, and Android Studio profilers work without bridging.

Cons

  • Two new pipeline implementations to maintain forever. Every feature added to the Go pipeline must be tracked and replicated in Swift and Kotlin. Every bug fix must be applied three times. Drift is not a future risk; it is a certainty if the team does not dedicate resources to it.
  • Highest upfront engineering cost. Three full ASR integrations (Go, Swift, Kotlin) plus three router implementations is substantial duplication.
  • Wire-format discipline is the only enforcement mechanism. Without shared code, only the envelope schema and integration tests enforce consistency between platforms. Schema discipline must be excellent from day one.
  • Staffing requirement. Vox today is a Go project. Adding Swift and Kotlin as maintained languages requires either hiring or significant ramp-up.
  • Open-core boundary becomes complicated. Do the Swift and Kotlin implementations live in the open-core repo? In the enterprise repo? In new repos? Each answer has tradeoffs for licensing, contribution model, and build complexity.

Cost estimate

  • iOS MVP: 16–22 engineering weeks. ASR integration (CoreML Whisper or SFSpeechRecognizer), VAD, router subset, sink adapters, background execution, TestFlight submission.
  • Android MVP: 12–16 additional weeks. AudioRecord + VAD + Whisper.cpp JNI or Sherpa-ONNX, foreground service, Play Store submission.
  • Total first-release cost: 28–38 engineering weeks across two platforms.

Ongoing maintenance burden

High. Three codebases track the same pipeline logic. Feature parity is a perpetual coordination cost. Regression in one platform does not automatically surface in CI for the others unless cross-platform integration tests are maintained.

Distribution implications

  • iOS: App Store, clean entitlement profile, smallest binary, no unusual review triggers. Best-case path from a distribution standpoint.
  • Android: Play Store standard. Foreground service with notification is well-understood. No unusual review concerns.

Option C — Server (thin mobile client) [CHOSEN]

Approach

The mobile app captures audio from the device microphone using platform-native APIs and streams it — in near-real-time — to a server running the Vox Go pipeline. Two server deployment modes:

  1. Paired desktop (P0): The user's existing Vox install on their macOS / Windows / Linux machine acts as the server. Mobile and desktop are on the same LAN. Service discovery uses mDNS (_vox._tcp.local) so no manual configuration is required. This is zero additional infrastructure: users who already have Vox on their laptop get mobile access immediately.

  2. Hosted server (P1): Vox cloud (behind the subscription paywall) runs the pipeline. Mobile streams audio over HTTPS/WebSocket to the hosted endpoint. No desktop required; works on cellular. This is the Enterprise / Individual subscription value proposition for mobile.

The mobile app implements: - Platform audio capture (AVAudioSession / AVAudioEngine on iOS; AudioRecord + foreground service on Android) - Opus encoding (reduces bandwidth; Opus is well-supported on both platforms) - WebSocket or gRPC-over-HTTP/2 streaming to the server - Response rendering (transcript display, LLM response display, TTS playback if the server sends synthesized audio back) - Server discovery (mDNS for paired desktop; subscription credentials for hosted)

The server implements a new capture/v1 network adapter that accepts inbound audio streams and presents them to the existing pipeline as a network source kind.

Pros

  • No pipeline reimplementation. The Go pipeline is unmodified. New features in the desktop pipeline are immediately available on mobile — zero replication cost.
  • iOS background constraint is a non-issue for the primary use case. The app does not need to process audio in the background; it only needs to stream. For always-on use cases, the user leaves the app foregrounded (or uses the desktop client). The foreground-capture scenario (mobile as a handheld mic during a meeting) is the primary mobile use case and requires no background entitlement.
  • Paired-desktop deployment is zero new infrastructure. First-time mobile users pair with their laptop. The laptop already runs Vox. No cloud spend, no server deployment, no subscription required for the basic case.
  • Subscription boundary is clean. Hosted server access is a natural subscription gate. The mobile app is open-core (or enterprise-only, TBD); the hosted server is subscription. The commercial model is explicit and observable.
  • Mobile binary is small. No ASR models, no pipeline code, no Go runtime. Codec + transport + UI. iOS app in the 10–20 MB range.
  • Fastest path to TestFlight. Implementing audio capture + Opus encoding + WebSocket streaming is 4–6 weeks of native iOS work. That is the smallest useful mobile increment: point phone at server, speak, see transcript on phone. Android parity adds 3–4 weeks.

Cons

  • Hard dependency on a reachable server. The mobile app is inoperable without a paired desktop or hosted server. Users who want truly offline, standalone mobile capture are not the target user — but this constraint must be communicated clearly in the product.
  • LAN dependency for the paired-desktop mode. Pairing via mDNS requires the phone and laptop to be on the same network. Coffee shop WiFi, guest networks, and corporate networks that isolate clients all break LAN pairing. The hosted server is the fallback, but it requires a subscription.
  • New server-side component: network capture adapter. The capture/v1 contract must be extended with a network source kind. This is new engineering, but it is Go and it is scoped: one new adapter behind the existing extension point.
  • Audio latency. Encoding, transmission, processing, and response transmission adds latency the desktop path does not have. Opus at 20 ms frames + WAN RTT means 200–500 ms of additional latency in the hosted case. For transcription use cases this is acceptable. For real-time TTS response (voice-in, voice-out), it is noticeable. VAD on the mobile client can gate transmission to reduce unnecessary round-trips.
  • Transport security is load-bearing. Audio data transits the network. TLS is mandatory; the server must validate client identity (subscription token or paired-device secret). A weak implementation here is a privacy liability.

Cost estimate

  • iOS MVP (paired-desktop): 6–8 engineering weeks. Audio capture, Opus encoding, WebSocket transport, mDNS discovery, basic transcript display.
  • Server-side network adapter: 2–3 engineering weeks. network source kind capture adapter, WebSocket accept loop, TLS, paired-device handshake.
  • Android MVP (paired-desktop): 4–5 engineering weeks. Foreground service, AudioRecord, Opus, WebSocket.
  • Total first-release cost (P0, both platforms + server): 12–16 engineering weeks.
  • Hosted server (P1): additional 4–6 weeks for cloud deployment, subscription token validation, and multi-tenant session isolation.

Ongoing maintenance burden

Low on the pipeline side (zero replication cost). Moderate on the mobile side: two native apps (Swift, Kotlin) that each implement transport + capture + UI, but no domain logic. Platform OS updates that affect audio APIs require mobile app updates, which is normal. Server-side changes to the network source adapter are isolated to one Go file.

Distribution implications

  • iOS: App Store. No audio Background Mode entitlement needed for the primary use case (foreground capture while streaming). Standard submission profile. No unusual binary size. Lowest App Store risk of the three options.
  • Android: Play Store. Foreground service with FOREGROUND_SERVICE_TYPE_MICROPHONE is required for capture; this is the documented, approved pattern. Standard submission profile.

4. Consequences

Unblocked by this ADR

The following sub-beads in the boq epic are now unblocked:

Bead Description Now unblocked
blackrim-vox-3cm iOS MVP Yes — thin client + WebSocket spec is the build target
blackrim-vox-3n9 Android MVP Yes — same architecture, Kotlin client
blackrim-vox-638 iOS polish Depends on 3cm landing
blackrim-vox-tqk Android polish Depends on 3n9 landing
blackrim-vox-8di iOS App Store submission Depends on 638
blackrim-vox-l31 RevenueCat billing Depends on hosted server (P1)
blackrim-vox-jyt Cross-device sync Architecture-compatible; sync is at the envelope layer

Blocked or deferred

  • Offline mobile operation (capture + process with no server) is not supported by this architecture. If that becomes a requirement, it would require revisiting this ADR — most likely toward a gomobile path for on-device ASR only (a hybrid not considered here).
  • A capture/v1 network source adapter must be designed and landed in the open core before 3cm / 3n9 can integrate end-to-end. This adapter is a dependency of both mobile MVP beads.

New infrastructure required

  1. network source capture adapter — new capture/v1 implementation in Go. Accepts inbound audio streams (WebSocket), presents them as a CaptureSource to the pipeline. Includes: TLS termination, paired-device HMAC handshake, Opus decode, jitter buffer.
  2. mDNS service advertisement — Vox desktop advertises _vox._tcp.local when server mode is enabled. Mobile discovers via standard mDNS browse.
  3. Hosted server deployment (P1) — containerized Vox binary behind a load balancer, with subscription token validation middleware. Multi-tenant session isolation (one audio stream per authenticated session, no cross-leakage).
  4. Opus codec dependency — both mobile apps encode in Opus; the server decodes. The Go pipeline gains an Opus decode step at the network source adapter layer. gopus or a CGo Opus binding is the likely choice; evaluate against the no-Cgo policy for mobile builds (no impact since Opus runs server-side).

Contract surfaces needing versioning

  • capture/v1 gains a new source kind: network. The source-kind enum must be versioned at the envelope level so downstream stages can inspect it.
  • The mobile-to-server streaming protocol (handshake, frame format, session termination) must be documented as a versioned contract before 3cm lands. It is not a public extension surface, but it must be stable enough that iOS and Android clients do not need to be updated every time the server changes.
  • The paired-device secret exchange (the mechanism by which a phone is authorized to stream to a specific desktop instance) requires a documented protocol. It must be resistant to passive eavesdropping on LAN (use a short-lived TOTP-style token displayed on the desktop, entered or scanned on the mobile).

Open-core / commercial boundary

The decision clarifies the boundary:

  • The network source capture adapter lands in the open core (Apache-2.0). Any Vox desktop instance can act as a server for any audio source that can stream to it — a DIY home assistant, a web browser, or a mobile app not built by this team.
  • The mobile apps (iOS, Android) are Enterprise / Individual tier. The pairing protocol and transport are documented (open), but the apps themselves are not open-sourced in v1. This may change.
  • The hosted server is Enterprise tier. Subscription token issuance and multi-tenant session management are enterprise-only surfaces.

5. Open questions

  1. Foreground-only vs background capture on iOS. The primary use case is foreground capture (phone screen on while streaming). For users who want to pocket the phone and continue capturing, a background audio session is required. The App Store path for this is navigable but requires the audio Background Mode entitlement and a clear user-facing justification. This decision is deferred to the iOS-specific ADR (a follow-up per platform).

  2. Streaming protocol: WebSocket vs gRPC. Both are viable. WebSocket is simpler and has first-class support in iOS URLSession and Android OkHttp. gRPC-over-HTTP/2 provides better flow control and a typed IDL. The choice should be made before 3cm starts — file a dedicated bead for the protocol decision.

  3. On-device VAD before transmission. Running a lightweight VAD on the mobile client to gate transmission (only send frames when speech is detected) reduces bandwidth and server load significantly. This is not required for MVP but is a near-term optimization. Apple's AVAudioEngine tap with a simple energy threshold is a low-cost starting point; a CoreML VAD (Silero port) is the upgrade path.

  4. Envelope wire format version for network sources. The existing envelope schema was designed for local pipeline use. Transmitting envelopes from server back to mobile client (transcript lines, LLM responses) must be validated against the mobile rendering requirements. A schema review bead should be filed before 3cm starts.

  5. Cross-device sync scope (jyt). This bead is architecture-compatible but its scope — what gets synced, in what format, with what conflict model — is undefined. The hosted server is a natural sync hub; the paired-desktop mode has no natural sync coordinator when the laptop is offline. Scope jyt carefully before starting it.

  6. RevenueCat integration scope (l31). RevenueCat manages subscription entitlements on both iOS and Android. The hosted server must validate entitlements (the mobile app passes a RevenueCat entitlement token; the server validates against the RevenueCat API or a cached grant). The exact validation flow needs a dedicated design doc.


6. References

  • gomobile official documentation: https://github.com/golang/go/wiki/Mobile
  • golang.org/x/mobile package: https://pkg.go.dev/golang.org/x/mobile
  • Apple AVAudioSession background modes: https://developer.apple.com/documentation/avfaudio/avaudiosession/1616503-setcategory
  • Apple App Store Review Guideline 2.5.4 (background processes): https://developer.apple.com/app-store/review/guidelines/#software-requirements
  • Android foreground service (microphone type): https://developer.android.com/guide/components/foreground-services#types
  • Opus codec: https://opus-codec.org/
  • Whisper.cpp on iOS (CoreML): https://github.com/ggml-org/whisper.cpp/tree/master/examples/whisper.objc
  • Vox extension points: docs/extension-points.md
  • Vox capture contract: docs/extensions/capture-v1.md
  • Vox backlog and mobile epic: docs/development/backlog.md