Prerequisites
| Requirement | Minimum | Recommended |
|---|---|---|
| Python | 3.11+ | 3.12 |
| RAM | 8 GB (small models) | 16 GB (medium models) |
| Disk | 10 GB free | 25 GB free |
| OS | macOS 13+, Ubuntu 22.04+, Windows 11 | Apple Silicon or CUDA GPU |
Tested Combinations
These combos are regularly tested by the team. Mix and match to fit your hardware budget.| Component | Provider | Model | RAM | Speed |
|---|---|---|---|---|
| LLM | Ollama | llama3:8b | 6 GB | Good |
| LLM | Ollama | qwen2:7b | 5 GB | Good |
| Vision | Ollama | llava:7b | 5 GB | OK |
| STT | faster-whisper | base | 1 GB | Fast |
| STT | faster-whisper | small | 2 GB | Good |
| TTS | Piper | en_US-lessac-medium | 50 MB | Fast |
RAM figures are approximate peak usage. When running LLM + STT + TTS
simultaneously, add the individual figures together.
Setup
Download from ollama.com or use your package manager:
export FERAL_LLM_PROVIDER=ollama
export FERAL_LLM_MODEL=llama3:8b
export FERAL_STT_PROVIDER=local
export FERAL_TTS_PROVIDER=local
# Optional — vision
export FERAL_VISION_PROVIDER=ollama
export FERAL_VISION_MODEL=llava:7b
# Optional — point to a non-default Ollama host
# export OLLAMA_HOST=http://localhost:11434
FERAL will connect to the local Ollama instance and use on-device STT/TTS.
You can verify the providers in the startup banner:
Troubleshooting
”Connection refused” from Ollama
Make sureollama serve is running. By default it listens on
http://localhost:11434. Check with:
Slow first response
The first inference after pulling a model is slow because Ollama loads weights into memory. Subsequent calls are fast. You can pre-warm with:Out of memory
If your machine runs out of RAM:- Switch to a smaller model (
qwen2:7buses ~1 GB less thanllama3:8b). - Close other heavy applications.
- On Linux, increase swap:
sudo fallocate -l 8G /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile.
STT not detecting speech
Ensure your microphone is accessible and the correct device is selected:Performance Tips
Choose models based on your hardware:- 8 GB RAM (no GPU):
qwen2:7b+faster-whisper base+ Piper — expect 2–4 s response times. - 16 GB RAM (no GPU):
llama3:8b+faster-whisper small+ Piper — expect 1–3 s response times. - Apple Silicon (M1+): Ollama uses Metal acceleration automatically.
llama3:8bruns at ~30 tokens/s on M2. - NVIDIA GPU (8 GB+ VRAM): Ollama detects CUDA automatically. Expect 40–80 tokens/s depending on model and GPU.
- Keep Ollama running between sessions to avoid cold-start latency.
- Use
faster-whisper baseunless you need higher accuracy —smallis 2× slower for a modest accuracy gain. - Piper TTS is CPU-only and extremely fast; it won’t bottleneck your setup.
- If you run FERAL on a headless server, disable TTS with
FERAL_TTS_PROVIDER=noneto save resources.
