โ† All Guides

๐Ÿ”Š Voice & Talk Mode

Speak to your agent, hear it respond. Three levels: voice notes on messaging channels, full Talk Mode with wake word on macOS/iOS/Android, and phone call integration via Twilio/Plivo. Powered by Whisper STT + ElevenLabs TTS.

Medium ยท ~15 minElevenLabs + WhispermacOS / iOS / Android

๐ŸŽš๏ธ Three levels of voice

LevelWhat it doesRequiresCost
1. Voice notesSend voice โ†’ get text replyJust Whisper STT (via OpenAI key)~$0.006/min
2. TTS responsesAgent replies as audioElevenLabs API key~$5-15/mo
3. Talk ModeHands-free, real-time conversationmacOS/iOS/Android node + mic~$10-20/mo

โœ… Start at Level 1 โ€” voice notes on Telegram/WhatsApp work out of the box with just an OpenAI API key. Add ElevenLabs TTS in week 2 if you want spoken responses. Talk Mode is week 3.

๐ŸŽค Level 1: Voice notes (easiest)

Send a voice note on Telegram or WhatsApp. OpenClaw automatically transcribes it using Whisper and processes the transcript like a normal text message. No extra config needed if you have an OpenAI API key.

This works because:

  • Telegram/WhatsApp voice notes arrive as audio files
  • OpenClaw runs them through Whisper (OpenAI API or local)
  • The transcript is treated as the message body
  • Slash commands in speech work too ("slash reset" โ†’ /reset)
# STT config (usually auto-detected from your OpenAI key)
{
  "messages": {
    "stt": {
      "provider": "openai",
      "model": "whisper-1",
      "language": "en"
    }
  }
}

๐Ÿ—ฃ๏ธ Level 2: TTS responses

Make your agent talk back with ElevenLabs:

  1. Sign up at elevenlabs.io (free tier available for testing)
  2. Copy your API key from Profile โ†’ API Keys
  3. Browse the voice library and pick a voice ID
# Set the env var
export ELEVENLABS_API_KEY="your_key_here"

# Or configure in openclaw.json
{
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "elevenlabs",
      "voiceId": "EXAVITQu4vr4xnSDxMaL",
      "maxTextLength": 4000,
      "timeoutMs": 30000
    }
  }
}

TTS auto modes

ModeBehavior
"inbound"Only speaks when user sent voice first (recommended)
"always"Always responds with audio
"tagged"Only speaks when the agent's response includes a voice tag
falseTTS disabled

โœ… Use "inbound" โ€” the agent only talks back when you spoke first. Prevents surprise audio and unnecessary cost.

๐ŸŽง Level 3: Talk Mode

Full hands-free: you speak, the agent listens, processes, and speaks back. Requires a paired node device with microphone and speaker.

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  WebSocket  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Node device  โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚    Gateway            โ”‚
โ”‚ (mic + spkr) โ”‚            โ”‚ (model + tools)       โ”‚
โ”‚ macOS/iOS/   โ”‚            โ”‚ ws://127.0.0.1:18789  โ”‚
โ”‚ Android      โ”‚            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  โ†• Whisper STT    โ†• ElevenLabs TTS

The Gateway handles AI processing; the node handles audio I/O. This keeps gateway headless and stable.

Activate Talk Mode

  • Web client: Click the microphone icon at localhost:18789
  • macOS app: Push-to-talk overlay in the menu bar
  • iOS/Android: Via the companion node app
  • Wake word: Say "hey claw" (see below)
  • Chat command: Send "Start talk mode" or "Let's talk"

Talk Mode config

{
  "talk": {
    "voiceId": "EXAVITQu4vr4xnSDxMaL",
    "modelId": "eleven_v3",
    "outputFormat": "mp3_44100_128",
    "interruptOnSpeech": true,
    "stability": 0.5,
    "similarityBoost": 0.75
  }
}

Interrupt on speech: If you start talking while the agent is speaking, it stops and listens. Feels natural.

๐Ÿ‘‚ Wake word detection

Always-on listening that activates on a keyword (like "Hey Siri" or "OK Google"):

{
  "voice": {
    "wake_word": {
      "enabled": true,
      "engine": "porcupine",
      "keyword": "hey claw",
      "sensitivity": 0.5
    }
  }
}
EngineNotes
PorcupineMost popular. Requires Picovoice API key. Custom keywords supported.
SnowboyOpen-source alternative. Less accurate but free.

โš ๏ธ Tune sensitivity carefully. Too high = false triggers on random words. Too low = won't hear you. Start at 0.5 and adjust. Use headphones to prevent echo.

๐Ÿ”„ Provider comparison

ProviderTypeQualityCostLatency
ElevenLabsTTSโญโญโญ$5-22/mo~1-2s
OpenAI TTSTTSโญโญ$15/1M chars~1s
Edge TTSTTSโญFree~2s
Whisper (API)STTโญโญโญ$0.006/min~1s
Whisper (local)STTโญโญโญFree~2-5s
DeepgramSTTโญโญ$0.0059/min~0.5s

๐Ÿ’ก Recommended stack: ElevenLabs for TTS (best natural voice) + Whisper API for STT (best accuracy). Run local Whisper as fallback. Total: ~$10-20/mo for moderate use.

โš™๏ธ Full config reference

{
  "messages": {
    "stt": {
      "provider": "openai",
      "model": "whisper-1",
      "language": "en"
    },
    "tts": {
      "auto": "inbound",
      "provider": "elevenlabs",
      "voiceId": "EXAVITQu4vr4xnSDxMaL",
      "maxTextLength": 4000,
      "timeoutMs": 30000
    }
  },
  "talk": {
    "voiceId": "EXAVITQu4vr4xnSDxMaL",
    "modelId": "eleven_v3",
    "interruptOnSpeech": true,
    "stability": 0.5,
    "similarityBoost": 0.75
  }
}

๐Ÿ’ก Tips & best practices

  • Start with voice notes โ€” Level 1 costs almost nothing and works today
  • Use "inbound" TTS mode โ€” agent only speaks when you spoke first
  • Set maxTextLength โ€” prevents TTS from reading 300-line stack traces
  • Use headphones for Talk Mode โ€” prevents echo and improves wake word detection
  • Choose voice wisely โ€” ElevenLabs has dozens of pre-made voices. Browse the library to match your agent's personality.
  • Monitor ElevenLabs credits โ€” chatty setups can hit $50+/mo. The free tier is fine for testing.
  • Local Whisper as fallback โ€” if cloud STT is down, local model keeps working

๐Ÿ”ง Troubleshooting

ProblemFix
Voice notes not transcribedCheck STT config has valid OpenAI API key. Try openclaw logs --follow for errors.
TTS audio not playingVerify ElevenLabs API key and voice ID. Check tts.auto is not false.
Talk Mode no audioEnsure node device is paired. Check mic permissions in OS settings.
High latencyUse eleven_v3 model (fastest). Try local Whisper for STT. Check network.
WebCrypto error in dashboardAccess Control UI over HTTPS (Tailscale Serve) or localhost. HTTP on remote hosts breaks WebCrypto.
Agent reads code/markdown aloudSet maxTextLength: 4000. The TTS pipeline should strip markdown, but long responses still waste credits.
Wake word false positivesLower sensitivity (closer to 0.0). Use a more unique wake phrase. Quiet environment helps.