Estado — Mimic

Spike 4 COMPLETO — Pipeline IA LIVE end-to-end (2026-05-14)

Sistema produção rodando. User logga, grava voz, dispara video gen, recebe mp4 dublado em <1min.

S4a Voice transcribe (LIVE)

Whisper base no scraper VPS (CPU int8)
/api/voices POST -> auto-transcribe -> voice.status=ready com metadata.refText

S4b LoRA train (PROVEN)

RunPod endpoint iwau73ni3uu456 (A40 48GB Medium Supply)
Worker ghcr.io/kodama1/mimic-worker-lora:latest (diffusers + PEFT + SDXL base)
12 fotos -> 500 steps -> 44MB safetensors. ~10min, ~$0.30/avatar.

S4c Video gen V1 (PROVEN)

RunPod endpoint rb6t98cug8ysil (A40 48GB)
Worker ghcr.io/kodama1/mimic-worker-video:latest (torch 2.5.1 + f5-tts 1.1.5 + faster-whisper + yt-dlp + ffmpeg)
Pipeline: yt-dlp -> ffmpeg audio -> Whisper transcribe -> F5-TTS clone -> ffmpeg mux -> upload mp4
30s execution warm. ~$0.02/video.
Teste real: TikTok PT-BR ("dois mil seguidores em sete dias no TikTok Shop") dublado com voz EN clonada -> 8.7MB mp4 entregue

Live URLs

Bugs corrigidos no caminho S4

S3 SDK v3 checksum -> MinIO 403. Fix: requestChecksumCalculation: WHEN_REQUIRED
SigV4 host validation. Fix: subdomain dedicado mimic-cdn sem path rewrite
transformers 4.47 + torch 2.4 schema_infer. Fix: torch 2.5.1
F5-TTS API mudou: F5TTS() no-arg ctor
RunPod 10min default -> bump 1800s no endpoint LoRA
Whisper VAD removia audio sem fala. Fix: fallback no-VAD + filler
RunPod cacheia imagem em workers ativos. Fix: delete worker manual após release

Proximos passos (S5)

Face swap real (Roop/ReActor + LoRA inference): output com rosto do user trocado
Billing Stripe + créditos
Consent onboarding (selfie + código)
Watermark invisível anti-misuse

Spike 4a — Voice transcribe (2026-05-13)

Entregue

mimic-scraper ganhou faster-whisper (CPU, int8) e endpoint POST /transcribe
WHISPER_MODEL=base configurado no VPS (~145MB, qualidade boa em PT-BR e EN)
/api/voices POST agora transcreve audio durante upload e:
- Marca voice.status='ready' se transcricao OK
- Salva metadata.refText, metadata.language, metadata.transcribedAt
Voice page mostra preview da transcricao + badge ready/pending
Volume mimic_whisper_cache evita re-download em recreates

Testado live

Audio: espeak-ng "hello world this is a test of mimic voice transcription"
POST /api/voices -> {
  voice: { status: "ready", durationSeconds: 3, refAudioPath: "voices/.../wN805ZX9lmp7.wav" },
  transcription: { text: "Hello world, this is a test of mimic voice transcription.",
                   language: "en", duration: 3.49 }
}

Spike 4b — LoRA training (em progresso, 2026-05-13)

Entregue (codigo)

apps/worker-lora/: RunPod serverless handler em Python
- SDXL base + PEFT LoRA (rank 16, ~1000 steps, ~10-15min em A100 80GB)
- Baixa fotos do MinIO (URLs publicas), treina, sobe .safetensors via signed PUT
- Callback POST /api/webhooks/runpod com Bearer token validado
/api/avatars/[id]/train:
- Gera callback_token per-job (nanoid 32)
- Gera signed PUT URL via presignPut (reescreve host MinIO interno -> public proxy via STORAGE_PUBLIC_ENDPOINT)
- Dispatch via dispatch() que chama RunPod /run real se RUNPOD_API_KEY setado, senao stub
- Guarda metadata.callbackToken/triggerWord/loraKey/runpodJobId
/api/webhooks/runpod:
- Valida Bearer == metadata.callbackToken
- Atualiza avatar.status (ready/failed), loraPath, readyAt
Nginx /storage/ aceita PUT ate 500MB (LoRA upload)
GH Actions: workflow nova job worker-lora -> ghcr.io/kodama1/mimic-worker-lora:latest

Pendente

CI buildando imagem GPU agora (gh run 25769742978). Imagem deve ficar ~5-8GB
USER: criar endpoint RunPod Serverless apontando para ghcr.io/kodama1/mimic-worker-lora:latest
- GPU: A100 80GB ou H100 80GB
- Min workers: 0, max: 1-2
- Pegar endpoint ID, gravar em VPS .env como RUNPOD_LORA_ENDPOINT
Apos: testar avatar real

Status containers VPS

mimic-postgres   healthy (9 tables)
mimic-redis      up
mimic-minio      up (bucket "mimic", anonymous download)
mimic-scraper    healthy (whisper base loaded)
mimic-api        up
mimic-web        up (Better-Auth + bearer + RunPod dispatch real)

Pendencias do user

DNS GoDaddy:
- mimic.kodama.solutions -> 187.127.24.217
- mimic-api.kodama.solutions -> 187.127.24.217

Apos DNS propagar:

ssh root@187.127.24.217 'bash /home/mimic/src/scripts/issue-cert.sh mimic-api.kodama.solutions'
ssh root@187.127.24.217 'bash /home/mimic/src/scripts/issue-cert.sh mimic.kodama.solutions'

Criar endpoint RunPod Serverless:
- https://www.runpod.io/console/serverless
- Template: Custom -> Container image ghcr.io/kodama1/mimic-worker-lora:latest
- GPU: A100 80GB ou H100 80GB
- Container disk: 20GB+, Network volume opcional (20GB para cache HF)
- Min: 0, Max: 1-2, Idle timeout: 5s
- Pegar endpoint ID + adicionar em .env:
```
RUNPOD_LORA_ENDPOINT=<id>
```
- Restart: cd /home/mimic/src && docker compose -f infra/compose/docker-compose.vps.yml --env-file .env up -d mimic-web

Proximos passos (S4c)

Video gen worker:

Considerar 2 paths:
- Premium: MimicMotion / AnimateAnyone (pose-driven, qualidade alta, GPU intensa)
- Shortcut: face swap simples + lip-sync (Roop + MuseTalk, custo baixo)
Pipeline: TikTok URL -> Demucs (separa voz/musica) -> Whisper (transcricao) -> F5-TTS (TTS com voz clonada) -> motion extract -> avatar swap -> lip sync -> compose final
POST /api/jobs cria record, dispatch worker
UI: pagina /jobs/[id] com progresso step-by-step

Decisoes operacionais novas (S4)

Whisper roda CPU no scraper VPS (faster-whisper, int8, ~5x realtime). Nao precisa GPU pra transcrever ref audio.
LoRA training usa SDXL base (licenca aberta, comercial OK). Flux seria melhor mas tem clausulas restritivas.
Hyperparams: rank=16, steps=1000, resolution=1024, lr=1e-4, batch=1. Conservador, evita overfit em 10-50 fotos.
Trigger word mimic<id4> -- evita conflito com tokens existentes do CLIP.
Callback validado por token per-job (nanoid 32 em metadata.callbackToken), nao secret global. Reduz blast radius se vazar.
Signed PUT URL com TTL 4h (training pode demorar). MinIO V4 signature funciona com endpoint publico via host rewrite.
Imagem worker-lora vai pro GHCR via Actions (kodama1 org). Workflow trigger por path apps/worker-lora/** ou manual.

Custos esperados (quando RunPod ativo)

Operacao	GPU	Tempo	Custo
Voice transcribe	CPU VPS	~3-10s	$0 (selfhost)
LoRA train (avatar)	A100 80GB	10-15min	$0.40-0.60
F5-TTS inference (audio gen)	L4/T4	5-10s/100chars	$0.005
MimicMotion video (S4c)	A100 80GB	3-5min	$0.20-0.30

Avatar onboarding total: ~$0.50
Video gen end-to-end: ~$0.30-0.50