Improve GPU Handling

Problem

We have different models of GPUs available in our cluster and now one is assigned one GPU randomly based upon availability.

Users are not able to pick model based on their needs.

Solution proposal 1

Rename the resources reflecting GPU model and rely on some internal string magic to assign users to the different "pools" of gpus.

On GPU node change:

/etc/nvidia-container-runtime/config.toml

- swarm-resource = "DOCKER_RESOURCE_GPU"
+ swarm-resource = "DOCKER_RESOURCE_GPU1070"

/etc/docker/daemon.json

     "node-generic-resources": [
-       "GPU=GPU-3b3da545-ff37-c8cb-f9c9-ca9342d32ef0"
+       "GPU1070=GPU-3b3da545-ff37-c8cb-f9c9-ca9342d32ef0"
    ]

On HUB change:

-        spawner.extra_resources_spec["generic_resources"] = {"GPU": 1}
+        spawner.extra_resources_spec["generic_resources"] = {"GPU1070": 1}

Pros:

Easy
Using docker swarms internal system for figuring out where services should be located.

Cons:

Not possible to mix different GPU models on the same host.
Not possible to share GPU.
Not possible to pick a specific host machine.

Solution proposal 2

Build a new service to keep track of GPU assignments.

sequenceDiagram
    participant user
    participant HUB
    Note over Inventory Service,HOST1: Get INFO (GPU_ID1 available)
    user->>HUB: LOGIN REQ GPU1070
    HUB->>Inventory Service: REQ GPU1070
    Inventory Service->>HUB: RESP GPU_ID1
    HUB->>HOST1: Start user service
    HOST1->>user: logged in with GPU1 on HOST1

On HUB:

-    spawner.environment["NVIDIA_VISIBLE_DEVICES"] = None
+    spawner.environment["NVIDIA_VISIBLE_DEVICES"] = GPU_ID  # could be multiple or "all"

     spawner.extra_placement_spec = {
-          "constraints": ["node.role == worker"] 
+          "constraints": ["node.role == worker", f"node.hostname == {GPU_HOST}"] 
     }

Pros:

Total control over GPU assignments, single, multiple or all GPUs on one host.
Can assign same GPU to another user
Overall health check
Can be used for collecting statistics

Cons:

Advanced
Another moving part

Edited Feb 22, 2024 by Tobias Vehkajärvi