Inference — local-first, engine-agnostic

The guarantee: Umi runs for anyone, fully local. We can’t promise frontier-level intelligence on every device — that’s hardware. We can promise it works, offline, on whatever you have. That promise drives every decision here.

The through-line: every engine — our managed llama.cpp, a user’s Ollama, a future WebGPU sidecar — reduces to an OpenAI-compatible endpoint behind the InferenceEngine trait. The only variable is who owns its lifecycle. That is what lets us commit hard to llama.cpp today without betting the architecture on it.

Decisions

1. Default engine: llama.cpp. CPU fallback runs on anything (the floor); GPU backends (Vulkan covers AMD/Intel/NVIDIA cross-vendor today, plus Metal/CUDA) accelerate when present; GGUF means every model arrives day one. It is the only engine that delivers “works for anyone” without caveats.

2. It runs as a managed sidecar, not in-process. Native inference on diverse consumer hardware will crash and hang; in-process that kills the whole app, as a supervised sidecar the app survives and restarts it. Engine-as-separate-binary also lets us ship/upgrade the CUDA vs Vulkan vs CPU build independently and pick the right one per device — the same “swap a leaf, don’t refactor the tree” modularity as silos, and exactly the existing EngineKind::ManagedSidecar seam. Start by spawning llama-server (OpenAI-compatible) and reuse the same client as external servers — ManagedSidecar and ExternalServer then differ only in who spawned the process. Bind to loopback with a session token (or a unix socket); nothing leaves the device.

3. Interop is free: external OpenAI-compatible servers. Power users’ Ollama / LM Studio / vLLM are just an ExternalServer engine — the same HTTP client. This is the “you might get more intelligence if your setup allows” path, at no architectural cost.

4. Models download on first run and cache forever; an offline installer is a variant. A bundled model can’t tailor to hardware and bloats every installer; local-first means runtime independence (data + compute stay local, no cloud to operate), not “ships in the binary.” We detect the device, fetch the right-sized model once, cache it permanently, and never touch the network for inference again. First run without internet is graceful — vault and UI work, inference shows a “fetching your model” state, never a brick. A separate offline installer bundles a tiny floor model for airgapped cases.

5. The embedder rides the same sidecar; its dimension is config. One engine for chat and embeddings (/v1/embeddings), fewer moving parts. Default to a ~384-dim floor model that runs anywhere; the SurrealDB vector index is created to match, so a stronger embedder later is a reindex, not a redesign.

Deferred, with triggers

These are new InferenceEngine impls — a new file, not a migration — which is the whole payoff of the seam.

WebGPU engine (MLC-LLM or a Rust wgpu engine): build it when Umi targets a browser/PWA, or when per-platform GPU builds become the bottleneck.
NPU / ONNX engine (DirectML, Qualcomm QNN, CoreML, LiteRT / Gemini Nano): build it when a meaningful share of target devices ship NPUs worth using and a runnable model exists.

Until then, llama.cpp’s Vulkan/Metal/CUDA covers acceleration, and nothing above constrains the future.