Genie Fleet
The orchestration layer that runs every Genie workload on the GPUs you own. Add a machine and it gets work, with zero cloud-side config.
What it is
Fleet is how Genie turns a set of GPU machines into one pool of capacity. Each machine runs an inference engine and reports its status; the cloud is a stateless brain that routes each request to the best machine for the job. Scaling is just add a worker.
Concepts
- Worker: a GPU machine that serves inference and runs jobs. It heartbeats its status every 30 seconds and loads models on command. It never decides routing or knows about other workers.
- Cloud: the central brain: a fleet registry, an inference router, a model planner, and a job queue. Stateless and serverless; it holds no model weights.
- Queue: a logical channel a worker subscribes to.
genieserves shared workloads; an org can have dedicated workers on its own queue. - Persona routing: work is routed by persona (review, coder, chat…), so the smallest model that solves the task handles it.
Local-first, by contract
Prompts and code run on your machines. Cloud overflow exists only as a capacity valve (when local queue depth exceeds your SLA), and it is opt-in, never the default. When it does run, overflow uses the same engines and the same open-weights models at the same quant: never a third-party API. This is the local-first axiom, and it is a contract, not a preference.
Running your own workers
Install the worker, point it at your org, and start it; it registers, subscribes to its queue, and begins serving. The cloud issues load/unload directives in heartbeat responses; you never hand-place models. See Genie Inference for the API that runs on top of the fleet, and Models for the catalog a fleet can serve.