Genie Fleet

The orchestration layer that runs every Genie workload on the GPUs you own. Add a machine and it gets work, with zero cloud-side config.

What it is

Fleet is how Genie turns a set of GPU machines into one pool of capacity. Each machine runs an inference engine and reports its status; the cloud is a stateless brain that routes each request to the best machine for the job. Scaling is just add a worker.

Concepts

Worker: a GPU machine that serves inference and runs jobs. It heartbeats its status every 30 seconds and loads models on command. It never decides routing or knows about other workers.
Cloud: the central brain: a fleet registry, an inference router, a model planner, and a job queue. Stateless and serverless; it holds no model weights.
Queue: a logical channel a worker subscribes to. genie serves shared workloads; an org can have dedicated workers on its own queue.
Persona routing: work is routed by persona (review, coder, chat…), so the smallest model that solves the task handles it.

Local-first, by contract

Prompts and code run on your machines. Cloud overflow exists only as a capacity valve (when local queue depth exceeds your SLA), and it is opt-in, never the default. When it does run, overflow uses the same engines and the same open-weights models at the same quant: never a third-party API. This is the local-first axiom, and it is a contract, not a preference.

Because marginal cost on local hardware is electricity, not per-token markup, steady-state inference for Review and Rollup is effectively free. Cloud is for load spikes and customer-mandated data policy only.

Running your own workers

Install the worker, point it at your org, and start it; it registers, subscribes to its queue, and begins serving. The cloud issues load/unload directives in heartbeat responses; you never hand-place models. See Genie Inference for the API that runs on top of the fleet, and Models for the catalog a fleet can serve.