Improve GPU Handling
Problem
We have different models of GPUs available in our cluster and now one is assigned one GPU randomly based upon availability.
Users are not able to pick model based on their needs.
Solution proposal 1
Rename the resources reflecting GPU model and rely on some internal string magic to assign users to the different "pools" of gpus.
On GPU node change:
/etc/nvidia-container-runtime/config.toml
- swarm-resource = "DOCKER_RESOURCE_GPU"
+ swarm-resource = "DOCKER_RESOURCE_GPU1070"
/etc/docker/daemon.json
"node-generic-resources": [
- "GPU=GPU-3b3da545-ff37-c8cb-f9c9-ca9342d32ef0"
+ "GPU1070=GPU-3b3da545-ff37-c8cb-f9c9-ca9342d32ef0"
]
On HUB change:
- spawner.extra_resources_spec["generic_resources"] = {"GPU": 1}
+ spawner.extra_resources_spec["generic_resources"] = {"GPU1070": 1}
Pros:
- Easy
- Using docker swarms internal system for figuring out where services should be located.
Cons:
- Not possible to mix different GPU models on the same host.
- Not possible to share GPU.
- Not possible to pick a specific host machine.
Solution proposal 2
Build a new service to keep track of GPU assignments.
sequenceDiagram
participant user
participant HUB
Note over Inventory Service,HOST1: Get INFO (GPU_ID1 available)
user->>HUB: LOGIN REQ GPU1070
HUB->>Inventory Service: REQ GPU1070
Inventory Service->>HUB: RESP GPU_ID1
HUB->>HOST1: Start user service
HOST1->>user: logged in with GPU1 on HOST1
On HUB:
- spawner.environment["NVIDIA_VISIBLE_DEVICES"] = None
+ spawner.environment["NVIDIA_VISIBLE_DEVICES"] = GPU_ID # could be multiple or "all"
spawner.extra_placement_spec = {
- "constraints": ["node.role == worker"]
+ "constraints": ["node.role == worker", f"node.hostname == {GPU_HOST}"]
}
Pros:
- Total control over GPU assignments, single, multiple or all GPUs on one host.
- Can assign same GPU to another user
- Overall health check
- Can be used for collecting statistics
Cons:
- Advanced
- Another moving part
Edited by Tobias Vehkajärvi