K
Kodama Vault
knowledge hub
Vault
HomeBoardMap of ContentChatConversasAuditoria
Agentes
AgentsIssuesTerminalPreviews
Sistema
MCPSetup MCPSettings
Brain
Global agent instructions
Análise custos migração — evitar senha no payloadLevantamento fluxo registro + duplicados StripeRelatório segurança + pentes finos (Cláudio)Revisão security concerns e race conditionsMagic link / esqueceu senha via SupabaseCorrigir erros pós-upgrade TypeScriptTestar PRs do agente Vault para mergeAnálise de 3 issues para iniciarErro no terminal do VSCodePR #173 — aguardando aprovação do LeoTestar fluxo ponta a ponta — criação de clients no StripePR #172 — testar e subir correção de funções deprecatedPitch de vendas SaaS — agendar call de conversãoOrganizar issues e bugs rápidos para a semanaMerge PR cadastro-novo — funcionalidades e correçõesCorrigir bugs PR #173 e #172 — image domainsPR mesosóico — página de acesso mobile + segurança OTPRefatoração de códigos — PR #202Ajustes em PRs abertos de ontemEstudo de jornada de compra e técnicas de fechamentoDefinir preço e entregável do produtoProspecção de reuniões para esta semanaAgente anti AI slop — centralização de conhecimento ConnfitPR #179 — resolver conflitos e erros de teste CLIAlinhamento de preços e usos da ConffitFix adicional para PR #183 — perfil do usuárioCorrigir estilização da Connfit para identidade visualSubir modificações no copy da ConnfitCriação de 4 campanhas no Meta AdsRevisão de PRs do GilinesExploração do Roblox EditorRelatório João — devolutiva TikTok ShopReunião presencial Zassi Uniformes — diagnóstico automaçõesCriar repositório de diagnósticos e relatórios de entrevistasDiagnóstico da ZassiGeração de relatórios para reuniões de fechamentoProposta Zassi — apresentação amanhãProspecção — Clínica Odontológica Dr. But
VPS Hermes — acesso e estrutura
Always Commit Push DeployHermes Voice GeminiHermes VPSKodama Prospects TrackerMEMORYObsidian VaultRoblox Mining Sim
OpenSpec -- Spec-Driven Development no VaultPlano de Teste — OpenSpec Vault Persistence
CaumzitoNyxzZanini
Claude Code — Setup MCP VaultClaude Desktop — Setup MCP Vault (remote)VS Code + Copilot — Setup MCP Vault
Skill — Carousel Designer (Paper Style)
Standup 2026-05-14Standup 2026-05-15Standup 2026-05-16Standup 2026-05-17Standup 2026-05-18Standup 2026-05-19Standup 2026-05-20Standup 2026-05-21Standup 2026-05-22Standup 2026-05-25Standup 2026-05-26Standup 2026-05-27Standup 2026-05-28Standup 2026-05-29Standup 2026-06-01Standup 2026-06-02Standup 2026-06-03Standup 2026-06-05Standup 2026-06-11Standup 2026-06-15Standup 2026-06-16Standup 2026-06-17Standups
MOCWelcome
v0.3
K
Kodama Vault
brain / memory

Hermes Voice Gemini

GPT-4o-Realtime-style voice pipeline for Hermes using Gemini Live + Claude tool calls. Architecture, files, quirks, and known issues.

Voice-first Hermes frontend built on Gemini Live, with ask_hermes tool that delegates to Claude + MCPs. Integrated into hermes-gateway (same bot/process — does NOT require a second Discord bot).

Architecture

Discord voice (DAVE E2EE) → VoiceReceiver (hermes-gateway)
  → 48kHz stereo PCM
  → audioop downmix + resample to 16kHz mono
  → GeminiVoiceBridge.send_user_pcm
  → Gemini Live WebSocket (audio=Blob, mime=audio/pcm;rate=16000)
  → Gemini responds with 24kHz mono PCM
  → upsample + stereo → GeminiStreamSource → vc.play()
  → Discord voice channel

Tool call path:
  Gemini decides ask_hermes is needed → bridge calls runner._handle_gemini_ask(query, uid)
  → builds synthetic MessageEvent with chat_id="voice-ask-<uid>"
  → runner._handle_message(event) runs full Claude+MCP agent
  → returns text → bridge sends FunctionResponse back to Gemini
  → Gemini speaks result

Key files

  • /app/tools/gemini_voice.py — GeminiVoiceBridge class (~250 LOC)
  • /app/gateway/platforms/discord.py — VoiceReceiver forwards PCM to self._gemini_bridge when attached
  • /app/gateway/run.py — _handle_gemini_ask method + startup wiring that attaches adapter._gemini_bridge_factory when GOOGLE_API_KEY is set
  • /home/hermes/hermes-home/.env — stores GOOGLE_API_KEY and GEMINI_MODEL=gemini-2.5-flash-native-audio-latest
  • /app/gateway/run.py — also hosts aiohttp /ask endpoint on :7171 (not used by voice bridge since it's in-process, but handy for external clients)

Current model + voice

  • Model: gemini-2.5-flash-native-audio-latest (only v1beta bidi-capable model with audio input on this key)
  • Voice: Charon (PT-BR)
  • Auto VAD (not manual — manual VAD tried and reverted)
  • Session resumption enabled via SessionResumptionConfig(handle=...) — captures session_resumption_update.new_handle from each recv and passes to next reconnect

Known quirks (Gemini Live)

  • 50-second session lifetime: server sends go_away then closes. Bridge catches this and reconnects with resumption handle so conversation continues.
  • Keepalive ping timeouts (1011) happen transiently; supervisor auto-reconnects after 2s backoff.
  • Tool response + speak-back may need user silence: if user keeps speaking while Gemini is generating, its response is interrupted.
  • Response audio may be in response.data OR nested in server_content.model_turn.parts[*].inline_data.data — bridge handles both.
  • _handle_gemini_ask triggers a real Hermes agent run which tries to send reply to chat_id="voice-ask-<uid>" — Discord adapter fails with invalid literal for int() on this chat_id. Cosmetic only: the response text still returns to the bridge and goes back to Gemini. Could suppress by marking event.source.chat_type differently or intercepting send.

Confirmed working

  • Gemini Live session opens + stays connected (with resumption)
  • PCM reaches Gemini (send_user_pcm #1..#N logs)
  • Gemini transcribes + responds with audio (gemini audio out #N bytes=...)
  • ask_hermes tool call fires → Claude + MCPs run → returns text
  • response ready for chat=voice-ask-<uid> shows Claude executed properly

Current state (2026-04-23 after debug session)

Works:

  • Multi-turn conversation within a single ~60s session
  • ask_hermes tool call → Claude+MCPs → audio response back
  • Fast reconnect (FIRST_COMPLETED wait + _SessionClosed sentinel for go_away)
  • Manual VAD via ActivityStart/ActivityEnd with 1.2s silence threshold (optional; auto VAD also works)

Does NOT work:

  • Session resumption is not supported by gemini-2.5-flash-native-audio-latest — server sends empty session_resumption_update: {} (no new_handle, no resumable). Confirmed via raw response dump. Model limitation, not code bug. transparent=True raises ValueError: transparent parameter is not supported in Gemini API.
  • Consequence: every reconnect (every ~60s) = fresh conversation. User context lost. For longer conversations, need to manually re-inject turn history.

Known issues (older)

  • Session lifetime ~60s: Gemini sends go_away after ~50s then keepalive ping timeout closes it. Supervisor reconnect path has a bug where reconnection sometimes doesn't fire after graceful go_away return from recv_loop. Observed: go_away logged, no Gemini Live session connected after.
  • Session resumption doesn't work: session_resumption_update arrives with resumable=None. Adding transparent=True to SessionResumptionConfig raises ValueError('transparent parameter is not supported in Gemini API') — docs lie about this field. Without transparent/resumable, each reconnect starts fresh conversation (loses context).
  • Response latency unpredictable: confirmed working responses took 3s (good) up to 90s (useless). Gemini's auto VAD behavior erratic with Discord's intermittent packet stream.
  • After ask_hermes tool result returned to Gemini, never observed Gemini vocalizing the result in this session.

Current state (2026-04-23 end-of-day)

  • GOOGLE_API_KEY commented out in /home/hermes/hermes-home/.env → bridge factory doesn't attach → VoiceReceiver falls back to Whisper+Claude flow.
  • Gemini integration code stays in /app/tools/gemini_voice.py and gateway wiring intact — just dormant until key is restored.
  • Text Hermes fully working (Sonnet 4.6, MCPs Linear/Excalidraw).

Next attempt should consider

  • Switch to OpenAI Realtime (gpt-4o-realtime): 30x more expensive ($0.30/min vs ~$0.01/min) but docs are mature, latency consistently <800ms, no 50s session limit. Same architecture — swap WebSocket provider + tool format, keep ask_hermes path.
  • If staying with Gemini: fix supervisor reconnect after graceful go_away return. Study why transparent=True was rejected (maybe SDK vs API version mismatch). Consider manual session-keepalive ping.

Debugging commands

  • Check Gemini recv: grep 'gemini recv\\|audio out\\|Gemini Live session' /home/hermes/hermes-home/logs/agent.log | tail -20
  • Check VAD flow: grep 'send_user_pcm\\|Voice state\\|ask_hermes' ... | tail -20
  • Check tool call end-to-end: grep 'ask_hermes\\|response ready.*voice-ask' ...

Reverting to Whisper+Claude pipeline

If Gemini path is broken, unset GOOGLE_API_KEY in /home/hermes/hermes-home/.env + restart. adapter._gemini_bridge_factory won't attach → _gemini_bridge stays None → VoiceReceiver falls back to buffer/silence/Whisper flow (which is less-natural but proven).

notas relacionadas
carregando…