🔊 Voice & Talk Mode

Speak to your agent, hear it respond. Three levels: voice notes on messaging channels, full Talk Mode with wake word on macOS/iOS/Android, and phone call integration via Twilio/Plivo. Powered by Whisper STT + ElevenLabs TTS.

Medium · ~15 minElevenLabs + WhispermacOS / iOS / Android

🎚️ Three levels of voice

Level	What it does	Requires	Cost
1. Voice notes	Send voice → get text reply	Just Whisper STT (via OpenAI key)	~$0.006/min
2. TTS responses	Agent replies as audio	ElevenLabs API key	~$5-15/mo
3. Talk Mode	Hands-free, real-time conversation	macOS/iOS/Android node + mic	~$10-20/mo

✅ Start at Level 1 — voice notes on Telegram/WhatsApp work out of the box with just an OpenAI API key. Add ElevenLabs TTS in week 2 if you want spoken responses. Talk Mode is week 3.

🎤 Level 1: Voice notes (easiest)

Send a voice note on Telegram or WhatsApp. OpenClaw automatically transcribes it using Whisper and processes the transcript like a normal text message. No extra config needed if you have an OpenAI API key.

This works because:

Telegram/WhatsApp voice notes arrive as audio files
OpenClaw runs them through Whisper (OpenAI API or local)
The transcript is treated as the message body
Slash commands in speech work too ("slash reset" → /reset)

# STT config (usually auto-detected from your OpenAI key)
{
  "messages": {
    "stt": {
      "provider": "openai",
      "model": "whisper-1",
      "language": "en"
    }
  }
}

🗣️ Level 2: TTS responses

Make your agent talk back with ElevenLabs:

Sign up at elevenlabs.io (free tier available for testing)
Copy your API key from Profile → API Keys
Browse the voice library and pick a voice ID

# Set the env var
export ELEVENLABS_API_KEY="your_key_here"

# Or configure in openclaw.json
{
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "elevenlabs",
      "voiceId": "EXAVITQu4vr4xnSDxMaL",
      "maxTextLength": 4000,
      "timeoutMs": 30000
    }
  }
}

TTS auto modes

Mode	Behavior
`"inbound"`	Only speaks when user sent voice first (recommended)
`"always"`	Always responds with audio
`"tagged"`	Only speaks when the agent's response includes a voice tag
`false`	TTS disabled

✅ Use "inbound" — the agent only talks back when you spoke first. Prevents surprise audio and unnecessary cost.

🎧 Level 3: Talk Mode

Full hands-free: you speak, the agent listens, processes, and speaks back. Requires a paired node device with microphone and speaker.

Architecture

┌─────────────┐  WebSocket  ┌──────────────────────┐
│ Node device  │◄──────────►│    Gateway            │
│ (mic + spkr) │            │ (model + tools)       │
│ macOS/iOS/   │            │ ws://127.0.0.1:18789  │
│ Android      │            └──────────────────────┘
└─────────────┘
  ↕ Whisper STT    ↕ ElevenLabs TTS

The Gateway handles AI processing; the node handles audio I/O. This keeps gateway headless and stable.

Activate Talk Mode

Web client: Click the microphone icon at localhost:18789
macOS app: Push-to-talk overlay in the menu bar
iOS/Android: Via the companion node app
Wake word: Say "hey claw" (see below)
Chat command: Send "Start talk mode" or "Let's talk"

Talk Mode config

{
  "talk": {
    "voiceId": "EXAVITQu4vr4xnSDxMaL",
    "modelId": "eleven_v3",
    "outputFormat": "mp3_44100_128",
    "interruptOnSpeech": true,
    "stability": 0.5,
    "similarityBoost": 0.75
  }
}

Interrupt on speech: If you start talking while the agent is speaking, it stops and listens. Feels natural.

👂 Wake word detection

Always-on listening that activates on a keyword (like "Hey Siri" or "OK Google"):

{
  "voice": {
    "wake_word": {
      "enabled": true,
      "engine": "porcupine",
      "keyword": "hey claw",
      "sensitivity": 0.5
    }
  }
}

Engine	Notes
Porcupine	Most popular. Requires Picovoice API key. Custom keywords supported.
Snowboy	Open-source alternative. Less accurate but free.

⚠️ Tune sensitivity carefully. Too high = false triggers on random words. Too low = won't hear you. Start at 0.5 and adjust. Use headphones to prevent echo.

🔄 Provider comparison

Provider	Type	Quality	Cost	Latency
ElevenLabs	TTS	⭐⭐⭐	$5-22/mo	~1-2s
OpenAI TTS	TTS	⭐⭐	$15/1M chars	~1s
Edge TTS	TTS	⭐	Free	~2s
Whisper (API)	STT	⭐⭐⭐	$0.006/min	~1s
Whisper (local)	STT	⭐⭐⭐	Free	~2-5s
Deepgram	STT	⭐⭐	$0.0059/min	~0.5s

💡 Recommended stack: ElevenLabs for TTS (best natural voice) + Whisper API for STT (best accuracy). Run local Whisper as fallback. Total: ~$10-20/mo for moderate use.

⚙️ Full config reference

{
  "messages": {
    "stt": {
      "provider": "openai",
      "model": "whisper-1",
      "language": "en"
    },
    "tts": {
      "auto": "inbound",
      "provider": "elevenlabs",
      "voiceId": "EXAVITQu4vr4xnSDxMaL",
      "maxTextLength": 4000,
      "timeoutMs": 30000
    }
  },
  "talk": {
    "voiceId": "EXAVITQu4vr4xnSDxMaL",
    "modelId": "eleven_v3",
    "interruptOnSpeech": true,
    "stability": 0.5,
    "similarityBoost": 0.75
  }
}

💡 Tips & best practices

Start with voice notes — Level 1 costs almost nothing and works today
Use "inbound" TTS mode — agent only speaks when you spoke first
Set maxTextLength — prevents TTS from reading 300-line stack traces
Use headphones for Talk Mode — prevents echo and improves wake word detection
Choose voice wisely — ElevenLabs has dozens of pre-made voices. Browse the library to match your agent's personality.
Monitor ElevenLabs credits — chatty setups can hit $50+/mo. The free tier is fine for testing.
Local Whisper as fallback — if cloud STT is down, local model keeps working

🔧 Troubleshooting

Problem	Fix
Voice notes not transcribed	Check STT config has valid OpenAI API key. Try `openclaw logs --follow` for errors.
TTS audio not playing	Verify ElevenLabs API key and voice ID. Check `tts.auto` is not `false`.
Talk Mode no audio	Ensure node device is paired. Check mic permissions in OS settings.
High latency	Use `eleven_v3` model (fastest). Try local Whisper for STT. Check network.
WebCrypto error in dashboard	Access Control UI over HTTPS (Tailscale Serve) or localhost. HTTP on remote hosts breaks WebCrypto.
Agent reads code/markdown aloud	Set `maxTextLength: 4000`. The TTS pipeline should strip markdown, but long responses still waste credits.
Wake word false positives	Lower sensitivity (closer to 0.0). Use a more unique wake phrase. Quiet environment helps.

🔊 Voice & Talk Mode

On this page

🎚️ Three levels of voice

🎤 Level 1: Voice notes (easiest)

🗣️ Level 2: TTS responses

TTS auto modes

🎧 Level 3: Talk Mode

Architecture

Activate Talk Mode

Talk Mode config

👂 Wake word detection

🔄 Provider comparison

⚙️ Full config reference

💡 Tips & best practices

🔧 Troubleshooting