PersonaPlex 7B Runs Full-Duplex Speech on a Mac

A Swift developer named Ivan built a native port of NVIDIA's PersonaPlex 7B that runs full-duplex speech-to-speech on Apple Silicon using MLX. No Python runtime, no cloud API, no CoreML conversion - just a 4-bit quantized model running faster than real-time on your Mac. The implementation hit #1 on Hacker News with 218 points and 68 comments.

The code ships as part of qwen3-asr-swift, an open-source Swift speech toolkit that bundles ASR, TTS, speech-to-speech, voice activity detection, speaker diarization, and noise suppression into a single package for Apple Silicon.

TL;DR

NVIDIA PersonaPlex 7B ported to native Swift via MLX - full-duplex speech-to-speech on Apple Silicon
4-bit quantized model weighs ~5.3 GB, runs at ~68ms per step (RTF 0.87 on M2 Max)
Part of qwen3-asr-swift, an open-source toolkit covering the full speech pipeline
Supports 18 voice presets and role-based system prompts
No Python, no server, no cloud - install via Homebrew or Swift Package Manager
English only for now; ASR and TTS components support 10-52 languages

What PersonaPlex Actually Is

NVIDIA's Full-Duplex Architecture

PersonaPlex is NVIDIA's extension of Kyutai's Moshi architecture. Traditional voice assistants use a cascaded pipeline - speech-to-text, then an LLM, then text-to-speech. PersonaPlex skips the text step entirely. Audio goes in, audio comes out. The model listens and speaks simultaneously, which means it can handle interruptions, backchanneling, and natural turn-taking without the robotic pauses that plague cascaded systems.

The 7B model was released in January 2026 under NVIDIA's Open Model License (MIT-compatible). It ships with 18 controllable voice presets and supports role-based text prompts that define the AI's persona, background, and conversation style. NVIDIA trained it on about 2,500 hours of data blending real conversations from the Fisher English corpus with synthetic dialogues produced by Qwen3-32B.

On NVIDIA hardware (A100/H100), PersonaPlex already beats Moshi and Freeze Omni on the FullDuplexBench for conversation dynamics. The question was always whether this could run locally on consumer hardware - and specifically on Apple Silicon, where MLX has become the default framework for on-device inference.

The Swift Port

Ivan's implementation converts the model to 4-bit quantization via MLX, bringing the footprint down to about 5.3 GB. On an M2 Max, inference runs at roughly 68ms per step with a real-time factor of 0.87 - meaning the model produces audio faster than you can listen to it.

// Install via Homebrew
// brew tap ivan-digital/speech https://github.com/ivan-digital/qwen3-asr-swift
// brew install speech

// Or via Swift Package Manager
dependencies: [
  .package(url: "https://github.com/ivan-digital/qwen3-asr-swift", branch: "main")
]

The port uses MLX Swift directly - no Python bridge, no CoreML conversion, no server process. Models auto-download from Hugging Face on first run. The repository includes a PersonaPlexDemo app with microphone input, VAD-triggered recording, and multi-turn conversation support.

The Full Speech Stack

PersonaPlex is the headline feature, but qwen3-asr-swift is a broader toolkit. Here is what ships in the package:

Component	Model	Task	Size	Streaming
ASR	Qwen3-ASR-0.6B (4-bit)	Speech-to-text	~400 MB	No
ASR	Qwen3-ASR-1.7B (8-bit)	Speech-to-text	~2.5 GB	No
ASR	Parakeet-TDT (CoreML)	Speech-to-text	~315 MB	No
TTS	Qwen3-TTS-0.6B (4-bit)	Text-to-speech	~1.7 GB	Yes (~120ms)
TTS	CosyVoice3-0.5B (4-bit)	Text-to-speech	~1.9 GB	Yes (~150ms)
Speech-to-Speech	PersonaPlex-7B (4-bit)	Full-duplex	~5.3 GB	Yes (~2s chunks)
VAD	Silero-VAD-v5	Voice activity	~1.2 MB	Yes (32ms)
Enhancement	DeepFilterNet3	Noise suppression	~4.2 MB	Yes (10ms)

The ASR handles 52 languages. TTS covers 10 languages with streaming latency under 150ms. The entire diarization stack is about 32 MB. Everything runs on Apple Silicon M1 or newer, macOS 14+, with no external dependencies beyond the Swift toolchain.

For anyone building local voice applications, this is a meaningfully different proposition than running Whisper plus a separate TTS engine plus a separate LLM. One package, one framework, one install.

What Hacker News Thought

The post hit 218 points, but the comments were a mixed bag. The architecture impressed people - full-duplex speech-to-speech on consumer hardware is genuinely novel. But several developers flagged practical issues.

User d4rkp4ttern reported 10-second response times on an M1 Max with irrelevant outputs. Another commenter noted the demo currently accepts WAV files rather than enabling true interactive streaming conversation. The consensus was that the traditional STT-to-LLM-to-TTS pipeline still wins for practical agent use cases, since it gives you access to tool calling, function execution, and frontier-class reasoning models that a 7B speech model can't match.

The counterargument: full-duplex is architecturally interesting exactly because it eliminates the cascaded latency. Even if the 7B model's reasoning isn't frontier-grade, the natural conversation dynamics - interruptions, backchanneling, overlapping speech - are things cascaded systems struggle with fundamentally.

Where It Falls Short

English Only for Full-Duplex

PersonaPlex supports only English. The ASR and TTS components handle dozens of languages, but the flagship speech-to-speech capability is English-only. NVIDIA's training data was almost completely English (Fisher corpus plus English synthetic dialogues).

Proof of Concept, Not Production

The current demo takes WAV files as input. True real-time streaming conversation with microphone input, VAD triggering, and continuous full-duplex interaction is added in the PersonaPlexDemo app but is still early. Multiple HN commenters described it as a proof of concept rather than a production-ready voice assistant.

Hardware Floor

While Apple's M-series chips are increasingly capable for on-device inference, the M1 base model may struggle. The published RTF of 0.87 is on an M2 Max. Lower-tier chips will have higher latency, and the 5.3 GB model footprint plus runtime memory could be tight on 8 GB unified memory machines.

7B Reasoning Ceiling

A 7B model isn't going to match GPT-5 or Claude on complex reasoning tasks. For use cases that need both natural conversation dynamics and deep reasoning, the cascaded pipeline with a frontier LLM in the middle is still the better architecture. PersonaPlex's sweet spot is scenarios where conversational naturalness matters more than intellectual depth - customer service, language practice, companion apps.

The port matters less for what it does today and more for what it proves: a 7B full-duplex speech model can run faster than real-time on a Mac, in native Swift, with no cloud dependency. That is the kind of capability floor that gets interesting when models get better and Apple ships more memory.

Sources: