Serving Models Across 4 GPUs Without Losing Your Mind
I was working on a VM with four GPUs. The kind of setup where you'd think things just work, but they really don't.
The problem
When you load a model onto a GPU, it takes time. If you're loading sequentially across four GPUs, you're sitting there waiting for each one to finish before the next one starts. That's fine if you're deploying something and walking away. But if you're actively developing and restarting things constantly, it gets old fast.
So the obvious thought is: load them in parallel. Just kick off all four at once. Except that causes its own problems. I kept running into corrupted model states on the GPUs. The loading process would step on itself and you'd end up with a model that looked like it loaded fine but would give you garbage outputs. Really fun to debug.
The solution I landed on
What actually worked was treating each GPU as its own isolated server. I wrapped each one in a uvicorn process running a small FastAPI app. So GPU 0 has its own server on port 8001, GPU 1 on 8002, and so on. Each one loads its own model at startup and just sits there waiting for requests.
Then I wrote a central server that sits in front of all of them. This is the only thing your client code talks to. It knows about all four GPU workers and keeps track of which ones are busy, which ones are free, and how long each one has been running its current job.
When a request comes in, the central server checks if any GPU is free. If one is, it forwards the job there. If all of them are busy, the job goes into a queue. There's an async thread that watches the queue and dispatches jobs as GPUs become available.
Why this works well
The key thing is isolation. Each GPU gets its own process, its own memory space, its own model. No weird race conditions during loading. No corrupted state. If one GPU worker crashes, the other three keep going and the central server just stops sending work to the dead one.
The queue is simple. It's just a list with some bookkeeping. The async thread pops jobs off and sends them to the next free GPU. Nothing fancy. I log when each job starts, which GPU it went to, and how long it took. That's enough to spot problems.
# rough idea of the central server
from fastapi import FastAPI
import httpx
import asyncio
from collections import deque
app = FastAPI()
gpu_workers = {
0: {"url": "http://localhost:8001", "busy": False, "job_start": None},
1: {"url": "http://localhost:8002", "busy": False, "job_start": None},
2: {"url": "http://localhost:8003", "busy": False, "job_start": None},
3: {"url": "http://localhost:8004", "busy": False, "job_start": None},
}
job_queue = deque()
def get_free_gpu():
for gpu_id, info in gpu_workers.items():
if not info["busy"]:
return gpu_id
return None
async def dispatch_job(gpu_id, job):
import time
gpu_workers[gpu_id]["busy"] = True
gpu_workers[gpu_id]["job_start"] = time.time()
try:
async with httpx.AsyncClient(timeout=300) as client:
resp = await client.post(
gpu_workers[gpu_id]["url"] + "/predict",
json=job["payload"],
)
job["future"].set_result(resp.json())
finally:
gpu_workers[gpu_id]["busy"] = False
gpu_workers[gpu_id]["job_start"] = None
async def queue_worker():
while True:
if job_queue:
gpu_id = get_free_gpu()
if gpu_id is not None:
job = job_queue.popleft()
asyncio.create_task(dispatch_job(gpu_id, job))
await asyncio.sleep(0.05)
What I would change
If I were doing this for a production system with more GPUs, I'd probably use something like Ray Serve or Triton. But for a dev setup with four GPUs where you just want to iterate quickly, this is about 100 lines of code and it just works. No dependencies beyond FastAPI, uvicorn, and httpx.
The one thing I'd add is health checks. Right now if a worker dies, the central server figures it out when a request to it fails. It would be better to ping them periodically and remove dead workers from the pool proactively. But for my use case, it was fine.