# Pax — Research Report: Production Voice AI Stack ("Jarvis-grade")

**To:** Larry / Jimmie
**From:** Pax
**Date:** 2026-05-22
**Re:** Skill profile + tech-stack inventory for a senior voice/multimodal engineer in 2026

**Scope note:** This is the greenfield gold-standard view of what a senior voice AI engineer would specify if asked to build a Jarvis-grade interface from scratch. Atlas is producing the parallel bolt-on view against your existing infrastructure. Neither report is a build plan — this is the skill + stack inventory plus the tradeoffs that matter.

---

## Section 1 — Skill profile of a senior voice/multimodal engineer

### Hard skills (must-have)

- **Real-time audio pipelines** — chunked streaming, VAD (voice activity detection), AEC (acoustic echo cancellation), AGC (auto gain control), noise suppression (RNNoise, krisp.ai)
- **STT integration at the streaming API level** — not file-upload Whisper. Partial transcripts, end-of-utterance detection, confidence scoring, language hints
- **TTS at the streaming API level** — first-audio-chunk latency, SSML, prosody control, voice cloning (XTTS, ElevenLabs Pro Voice), audio output buffering
- **Wake-word / keyword spotting** — Porcupine, OpenWakeWord, Snowboy fork; training custom wake words with synthetic data; false-positive vs. false-negative tuning
- **Low-latency systems engineering** — WebSocket vs. WebRTC, UDP audio transport, jitter buffers, network adaptation
- **LLM orchestration** — function calling, intent classification, multi-agent routing, streaming-aware prompting (don't wait for full LLM response before starting TTS)
- **Native desktop frameworks** — Tauri (Rust + WebView), Electron (Node + Chromium), or Avalonia/WinUI 3 for true native; system tray, global hotkeys, always-on-top overlays
- **Audio device handling** — Windows WASAPI / Mac CoreAudio / Linux PulseAudio device enumeration, default device tracking, headphone vs. speaker routing
- **Conversational state machines** — turn-taking, barge-in (user interrupts AI mid-speech), backchannel ("mhm", "uh-huh"), repair (clarification requests)

### Domain knowledge

- **Latency budgets** — knows that "feels instant" is sub-200ms for backchannels, sub-500ms for turn responses, sub-1500ms for substantive answers
- **Voice UX patterns** — "earcons" (notification sounds), confirmation tones, ambient state indicators, when NOT to use voice (sensitive info, public spaces)
- **The Uncanny Valley of voice** — when high-quality TTS sounds *worse* than obviously-synthetic, and how to dodge it
- **Assistant memory architectures** — episodic vs. semantic vs. procedural; how OS-level personal assistants (Alexa, Google Assistant, Siri) layered context
- **Privacy primitives** — on-device vs. cloud tradeoff, what data leaves the machine, opt-in recording, transcript retention, "delete my voice data" flows
- **Multimodality** — text + audio + vision pipelines, when to fall back across modes

### Soft skills

- **UX taste** — voice is unforgiving of bad UX. They iterate on their own assistant daily and notice "off" feelings most engineers miss
- **Latency obsession** — will profile every link in the chain and shave ms even when "good enough" works, because cumulative ms = uncanny
- **Failure-mode storytelling** — can describe how it will break (cocktail party, baby crying, music in background, accent drift) and how they'll degrade gracefully
- **Cross-functional comfort** — works equally well with ML researchers, frontend devs, audio DSP engineers, and product
- **A demo-able mindset** — ships something talkable in week 1; perfectionists never finish voice projects

### Common day-to-day tasks

- Read latency traces from the audio capture → STT → LLM → TTS chain, identify the longest pole
- Tune wake-word false-positive rate against false-negative rate on their actual environment
- A/B test two TTS voices for naturalness, latency, and cost
- Debug "the assistant cuts me off when I pause" or "the assistant doesn't realize I'm done talking"
- Write the prompt that routes intents to sub-agents (and the eval set that proves the routing is right)
- Build the "what happens when the wake word fires but the user said nothing useful" fallback
- Monitor cost per conversation (STT minutes + LLM tokens + TTS characters) and optimize hot paths
- Triage user feedback like "it sounds robotic" or "it talks too fast" into specific knobs (prosody, speaking rate, voice ID)

### Decision-making patterns

- **Latency > accuracy** within reason. A 92%-accurate STT at 300ms beats 96% at 1200ms because users repeat themselves when the assistant feels slow, eroding the 4% accuracy gain
- **Streaming everywhere** — never let any link in the chain wait for the previous one to fully finish
- **Local for hot path, cloud for cold path** — wake word and partial-transcript locally; full LLM routing and TTS often cloud (until local catches up, which it is)
- **Voice-first UX, but always provide a text fallback** — accessibility, noisy environments, sensitive content
- **Build the worst-case demo first** — noisy office, accented speech, partial sentences. If it works there, the easy cases take care of themselves

### Quality standards

- End-to-end p50 latency from speech-end to first TTS audio: **<800ms** for "feels instant," **<1500ms** for "feels Jarvis-y," **>2500ms** kills the illusion
- Wake-word false-accept rate: **<1 per 24 hours of ambient room audio** while maintaining >95% true-accept on the user's voice
- STT word error rate (WER): **<8% on conversational English**, **<15% on accented or technical vocab**
- TTS naturalness: blind A/B test shows users can't reliably distinguish from human in **<3 second clips**
- Barge-in: AI stops speaking within **<150ms** of user starting to talk
- Privacy: zero audio retained on disk unless user explicitly opted in; opt-out is one click

---

## Section 2 — Stack inventory by layer

### Wake word

| Option | License | Local/Cloud | Custom wake words | Latency | Tradeoff |
|---|---|---|---|---|---|
| **Picovoice Porcupine** | Commercial (free tier) | Local | Yes — trained on Picovoice console | <50ms | Industry standard. Cheap. Excellent accuracy. Custom wake words trained via web console. The defensible default for production. |
| **OpenWakeWord** | Apache 2.0 | Local | Yes — train your own with synthetic data | <100ms | Fully open. Used by Home Assistant. More work to train but no vendor lock. Accuracy ~85-90% of Porcupine. |
| **Snowboy (community forks)** | Various | Local | Yes | <80ms | Original is dead (KITT.AI shut down 2020). Community forks exist but unmaintained risk. |
| **DIY KWS** (Keras + TFLite) | Open | Local | Fully custom | Variable | Only if you have an ML engineer dedicated to it. Reinvents the wheel. |
| **No wake word** (push-to-talk) | — | — | — | 0ms | Skips the hardest engineering. Trades convenience for reliability. Many real-world deployments do this. |

**Senior engineer's pick:** Porcupine for production, OpenWakeWord if open-source ethos matters more than the last 5% of accuracy. **Skip wake word entirely for v1** unless you have a strong reason — wake word is where most hobby projects die.

### STT (Speech-to-Text)

| Option | License | Local/Cloud | Streaming | Latency (p50) | WER | Tradeoff |
|---|---|---|---|---|---|---|
| **Deepgram Nova-3** | Commercial | Cloud | Yes | 250-400ms | ~7% | Best-in-class streaming latency in 2026. Polished API. ~$0.0043/min. Production default for latency-critical apps. |
| **AssemblyAI Universal-2** | Commercial | Cloud | Yes | 350-550ms | ~6.5% | Slightly higher accuracy than Deepgram, slightly slower. Strong on technical vocab. ~$0.005/min. |
| **OpenAI gpt-4o-transcribe / Whisper API** | Commercial | Cloud | Limited streaming | 500-1200ms | ~6% | Excellent accuracy, slower. gpt-4o-transcribe (2025) added streaming. ~$0.006/min. |
| **Whisper.cpp (large-v3-turbo)** | MIT | Local | Yes (chunked) | 200-500ms on GPU, 800-1500ms CPU | ~7% | Free. Local. Decent latency with a GPU. Needs careful chunking + VAD to feel "streaming." |
| **NVIDIA Parakeet / Canary** | NVIDIA OSS | Local | Yes | <200ms on GPU | ~5-6% | 2025 release. Best local accuracy + latency if you have NVIDIA hardware. Setup overhead. |
| **Distil-Whisper** | MIT | Local | Yes | ~2x faster than Whisper-base | ~9% | Lightweight local option. Good for low-spec machines. |

**Senior engineer's pick:** Deepgram Nova-3 for production cloud, NVIDIA Parakeet or Whisper.cpp large-v3-turbo for local. The local options have closed the gap dramatically in 2025-2026 — local is now a defensible choice for privacy-conscious deployments.

### TTS (Text-to-Speech)

| Option | License | Local/Cloud | Streaming first-audio | Voice cloning | Naturalness | Tradeoff |
|---|---|---|---|---|---|---|
| **ElevenLabs Flash v2.5** | Commercial | Cloud | <500ms (often <300ms) | Pro Voice (yes) | Industry-leading | The premium option. ~$0.0001-0.0003 per char depending on tier. Worth the money if voice is the product. |
| **OpenAI gpt-4o-mini-tts** | Commercial | Cloud | ~600ms | No | High, somewhat robotic | Cheaper than ElevenLabs. "Instructable" voice (you can tell it tone in the prompt). ~$0.6/1M chars. |
| **Sesame CSM-1B (2025)** | Open-source | Local (GPU) | <300ms with streaming | Yes | Excellent — closest open model to ElevenLabs | Released March 2025. Game-changer for local TTS. Needs a beefy GPU. |
| **Kokoro-82M** | Apache 2.0 | Local | <200ms on CPU | Limited | Surprisingly good for size | Tiny model (82M params). Real-time on CPU. The cheap-and-cheerful default for local. |
| **Piper** | MIT | Local | <100ms | No (fixed voices) | Good but obviously TTS | The workhorse for self-hosted. Home Assistant default. Fast, reliable, sounds like TTS. |
| **XTTS-v2 (Coqui)** | CPL (non-commercial) | Local | ~400ms | Yes (6-second sample) | Excellent | Best local voice cloning quality. License blocks commercial use without negotiation. |
| **Hume EVI 3 (2025)** | Commercial | Cloud | ~400ms | Yes | Emotionally expressive | Specializes in emotion-aware TTS. Premium positioning. |

**Senior engineer's pick:** ElevenLabs Flash v2.5 for cloud (production default). Sesame CSM-1B if you want local-grade quality and have GPU. Kokoro-82M if CPU-only is a constraint. Avoid Piper for an assistant you'll use daily — it always sounds like TTS.

### Orchestration layer (utterance → multi-agent routing)

This is the part most "how to build Jarvis" tutorials underestimate. Routing a free-form utterance to the right specialist (out of 17, in your case) is a real ML/prompt engineering problem.

| Pattern | How it works | Latency | Tradeoff |
|---|---|---|---|
| **Single classifier call** | One LLM call with the roster + descriptions → returns `{agentId, refined_prompt}` | +400-800ms | Simple. Works well for 5-20 agents. Add streaming to start TTS preamble while classifier runs. |
| **Embedding-based router** | Embed utterance, k-NN against pre-embedded agent descriptions | +50-100ms | Fast. No LLM call. Less smart at edge cases ("I want to schedule something AND log expenses"). |
| **Hybrid: embedding pre-filter + LLM tiebreaker** | k-NN to top-3, then LLM picks | +200-400ms | Best practical combination. Fast common path, smart fallback. |
| **Tool-calling with all 17 agents as tools** | LLM with function calling, each agent is a tool | +500-1000ms | Native to modern LLMs. Lets LLM call multiple agents. Risk of over-fanning. |
| **Hierarchical routing** | Top-level routes to "finance / personal / external" buckets, then to specialist | +600-1200ms | Scales better at 50+ agents. Overkill at 17. |
| **Conversation-aware routing** | Routing considers last N turns ("we were just talking about X") | +500-900ms | Necessary for multi-turn. Adds context-window cost. |

**Senior engineer's pick at 17 agents:** Hybrid embedding pre-filter + LLM tiebreaker. Pre-compute embeddings for every agent's `description.md` and a curated list of "things this agent handles." Embed user utterance, top-3 candidates by cosine, LLM picks (or asks the user if confidence is low).

**The non-obvious bit:** start the TTS preamble ("Sure, let me ask Tally about that…") *while* the orchestration runs. Hiding latency by talking first is the single biggest perceptual-latency win in voice assistant design.

### HUD (visual layer)

| Option | Stack | Pros | Cons | Native feel |
|---|---|---|---|---|
| **Tauri** (Rust + WebView) | Rust backend, WebView2 on Windows | Small footprint (~3-10MB), fast startup, secure-by-default | Steeper dev curve, smaller ecosystem | Excellent — uses OS native WebView |
| **Electron** | Node + bundled Chromium | Largest ecosystem, easiest to hire for, fastest dev iteration | 100-200MB binary, RAM hungry, slower startup | Good — universal, slightly clunky |
| **Web-based overlay** (browser tab) | Just a Next.js page | Zero install, deployable to localhost or anywhere | Not always-on-top, browser chrome visible, no global hotkey | Poor — feels like a web app, not an assistant |
| **WinUI 3 / WPF** (Windows native) | C# / XAML | True native Windows look, best perf | Windows-only, smaller talent pool, harder to iterate | Best on Windows specifically |
| **Avalonia** | C# / XAML cross-platform | Native cross-platform, good perf | Less common, smaller community | Very good |
| **Flutter desktop** | Dart | Single codebase mobile + desktop, great animation | Heavier, less native feel | Good |
| **Native menu bar / system tray app + popup HUD** | Any of above with tray integration | Always available, doesn't take up screen | Smaller surface for the HUD itself | Excellent (least intrusive) |

**Senior engineer's pick for Windows-first Jarvis HUD:** Tauri for new builds (best size/perf tradeoff in 2026), or Electron if the team needs maximum velocity and doesn't care about RAM. **Always-on-top, frameless, system-tray-anchored, global-hotkey-summoned overlay** — not a full window.

For an existing Next.js app like Atlas: you can ship a `/hud` route as web-based v1 (no install) and wrap it in Tauri later for the native overlay. Tauri can literally point at your existing Next.js dev server.

### End-to-end latency budget

The "feels like Jarvis" perceptual threshold. Cumulative ms across the pipeline:

| Stage | Best-case (2026 cloud) | Best-case (2026 local) | Optimistic budget for "feels instant" |
|---|---|---|---|
| Wake word → audio captured | 50ms | 50ms | 50ms |
| Voice activity detection (end-of-utterance) | 200ms | 200ms | 200ms |
| STT (final transcript) | 300ms (Deepgram) | 200ms (Parakeet) | 250ms |
| Orchestration / routing | 200ms (hybrid embed+LLM) | 200ms | 200ms |
| LLM response (first token) | 400ms (Sonnet/Haiku) | 300ms (local 70B on H100) | 350ms |
| TTS first audio chunk | 400ms (ElevenLabs Flash) | 200ms (Sesame/Kokoro) | 300ms |
| Audio playback start | 50ms | 50ms | 50ms |
| **Total p50** | **~1600ms** | **~1200ms** | **~1400ms target** |

**Notes:**
- **Sub-800ms is achievable** but only with: local STT + local TTS + smaller LLM (Haiku-class) + speculative TTS preamble. Realistically you give up some response quality.
- **The biggest cheat:** speculative TTS. Start saying "Sure, let me check…" while the LLM is still generating. Perceived latency drops by 400-700ms with zero real-latency change.
- **Above 2500ms:** the illusion breaks. Users start over-explaining or repeating themselves.
- **The single longest pole is usually the LLM,** especially with reasoning or tool use. Long-running agents (Pax doing research, Atlas building a feature) cannot be voice-first — they need an "I'll get back to you" pattern.

---

## Section 3 — Tradeoffs that matter

### Cloud vs. local

| Dimension | Cloud-heavy stack | Local-heavy stack |
|---|---|---|
| Latency | Predictable network-bound (300-800ms) | Lower if you have GPU, higher on CPU |
| Cost | $20-100/month at moderate use | Free after hardware sunk cost |
| Privacy | Voice/transcripts leave the machine | Stays on disk |
| Reliability | Vendor outages happen | You own the failure mode |
| Maintenance | Near-zero | Significant — model updates, GPU drivers, audio device quirks |
| Quality ceiling | State-of-the-art instantly | ~6-12 months behind cloud frontier |

**Senior engineer's recommendation for a single-user personal assistant:** Hybrid. Local wake word + VAD (always-on, privacy-sensitive). Cloud STT + TTS + LLM (latency + quality). Reconsider quarterly as local models improve.

### Wake word vs. push-to-talk

- **Push-to-talk** is engineering-trivial and 95% as useful in practice for solo desk work.
- **Wake word** matters when (a) you're across the room, (b) hands are occupied, (c) it's a phone/wearable. For desk-bound use, the cost-benefit of wake word is much worse than people assume.
- Most "I built a Jarvis" projects that died in beta died because of wake word false positives, not anything else.

### Streaming everything vs. batch-then-process

If you can only optimize one thing for perceptual latency, **make every link in the chain streaming**. STT should emit partial transcripts. LLM should stream tokens. TTS should accept tokens and emit audio as it arrives. The architecture where each stage waits for the previous to finish will feel slow even with the world's fastest individual stages.

### Single voice vs. per-agent voices

Open question for your use case: do all 17 agents share one voice, or does each have its own? Pros and cons:

- **One voice:** simpler, lower ElevenLabs voice slot cost, consistent feel, the "Jarvis" archetype
- **Per-agent voices:** more distinctive, harder to confuse who's talking, more memorable, more setup ($5-15/voice/month)
- **Hybrid:** one "narrator" voice for Larry (orchestrator) announcing handoffs, then a small palette of 3-5 voices for the specialists, mapped semantically (analytical voice for Ledger, warm voice for Kade, etc.)

Senior engineer's instinct: hybrid is the most fun and the most legible for users.

### The "always listening" tradeoff

True always-on (wake-word always live) means a process is constantly recording 1-2 seconds of audio in a rolling buffer. Implications:
- Privacy disclosure to anyone in the room (you're recording, even if it's discarded)
- Background CPU/battery usage (small but not zero)
- Wake-word false-fire rate (will happen during meetings, podcasts, TV)
- Required mute hotkey (and obvious mute indicator)

Many serious voice assistants ship a "tap to talk" mode + a "wake word mode" toggle. Don't assume always-on is the right default.

---

## Section 4 — What a senior engineer would NOT recommend

- ❌ **Building wake word from scratch** unless you have a dedicated ML engineer and 6+ months
- ❌ **Pure local stack on a CPU-only machine** in 2026 — the latency is just bad enough to break the illusion
- ❌ **Whisper file-upload API as your "streaming" STT** — it's not streaming, regardless of how the docs describe it
- ❌ **Picking the cheapest TTS to save $20/month** — bad TTS is the single biggest "this feels fake" lever
- ❌ **Always-on listening without an obvious mute** — privacy and trust killer
- ❌ **Skipping VAD** ("just record until silence") — silence detection without VAD is brittle; users will say "is it listening?" constantly
- ❌ **No fallback to text** — when STT fails, you need a graceful "I didn't catch that, type instead?" path
- ❌ **Building this without a real eval harness** — voice assistants degrade silently. You need a regression suite of recorded utterances + expected behaviors.

---

## Section 5 — What "good" looks like — the demo checklist

A senior voice engineer would want any v1 to pass these tests before calling it done:

1. **The first utterance lands** — wake word fires, STT captures, routing picks right agent, response plays, all in <2.5s p95
2. **Barge-in works** — say "stop" mid-response, AI stops within 150ms, listens for new utterance
3. **Background noise tolerance** — works with a fan running, music at conversational volume, light office chatter
4. **Mid-sentence pause handling** — if you say "remind me to..." then pause for 2s, it waits, doesn't cut you off
5. **Wrong-agent recovery** — if it routes to Ledger when you meant Kade, "no, ask Kade" works without re-explaining
6. **Mute works** — system-tray mute button or hotkey instantly stops listening
7. **Privacy is visible** — a clear indicator when audio is being captured (not just LED — UI state)
8. **Text fallback is one click away** — typing into the HUD works exactly like speaking

---

## Section 6 — Summary

A real senior voice/multimodal engineer in 2026 would specify roughly this stack as a defensible starting point:

| Layer | Choice | Why |
|---|---|---|
| Wake word | Porcupine (or skip — push-to-talk v1) | Best accuracy/effort ratio. Wake word is where projects die — skip it until you've proved the rest works. |
| VAD | WebRTC VAD or Silero-VAD | Standard. Free. Good enough. |
| STT | Deepgram Nova-3 (cloud) or NVIDIA Parakeet (local-GPU) | Best 2026 streaming options |
| LLM | Sonnet/Haiku via Anthropic API for routing + generation | Good latency, smart routing |
| Orchestration | Embedding pre-filter + LLM tiebreaker | Sweet spot for 17 agents |
| TTS | ElevenLabs Flash v2.5 (cloud) or Sesame CSM-1B (local-GPU) | Best naturalness at acceptable latency |
| HUD | Tauri (frameless, always-on-top, system-tray-anchored) | Native feel, small footprint, web tech inside |
| Latency target | <1500ms end-to-end p50 | Achievable with above stack |
| Privacy controls | Visible mute, no retention by default, opt-in transcript logging | Trust requirement |

**The full Jarvis-grade stack delta vs. "voice frontend on existing infra":** the existing-infra path (push-to-talk + cloud STT + cloud TTS + existing routing) gets you to ~70% of the user experience for ~10% of the engineering effort. The remaining 30% takes 5-8 weeks of dedicated work, and most of it is in wake word, native overlay, barge-in, and local-latency optimization.

For Jimmie's specific situation — solo desk user, 17 specialist agents already orchestrated by Larry, Atlas DB already exists — the gold-standard stack is overkill for v1. But knowing what gold-standard looks like makes the cuts deliberate, not accidental.

---

**Cross-reference:** Atlas's bolt-on report (`Team Inbox/Atlas/kade-status-and-jarvis-bolt-on.md`) covers what to actually build first. This report is the reference for what we're choosing not to build yet, and what we'd reach for if/when the v1 proves the use case.