... | @@ -17,7 +17,7 @@ Furthermore there was no separation between users and a since we ran on one serv |
... | @@ -17,7 +17,7 @@ Furthermore there was no separation between users and a since we ran on one serv |
|
|
|
|
|
While we have some background in creating/running containers (docker) we had zero experience with administering clouds, container orchestration, micro architectures, GPU assisted AI/ML and jupyterhub/jupyternootebooks. Of course we have all heard the buzzwords and (thought we) understood the concepts but as always reality is a lot more ugly (and at the same time more straightforward) then the powerpoint slides explaining the reality.
|
|
While we have some background in creating/running containers (docker) we had zero experience with administering clouds, container orchestration, micro architectures, GPU assisted AI/ML and jupyterhub/jupyternootebooks. Of course we have all heard the buzzwords and (thought we) understood the concepts but as always reality is a lot more ugly (and at the same time more straightforward) then the powerpoint slides explaining the reality.
|
|
|
|
|
|
# We have a task!
|
|
# Enough bull, We have a task!
|
|
|
|
|
|
Implement/install a system for students and researchers to be able to code and share/use GPU resources in a efficient way for both AI/Machine learning/research and "normal" coding.
|
|
Implement/install a system for students and researchers to be able to code and share/use GPU resources in a efficient way for both AI/Machine learning/research and "normal" coding.
|
|
|
|
|
... | @@ -45,14 +45,14 @@ We broke down the issue is multiple steps as we will describe below |
... | @@ -45,14 +45,14 @@ We broke down the issue is multiple steps as we will describe below |
|
As previously said, we have been using jupyter notebooks in the past so jupyterhub seamed like a natural fit for this task.
|
|
As previously said, we have been using jupyter notebooks in the past so jupyterhub seamed like a natural fit for this task.
|
|
|
|
|
|
We had nothing configured for workload scaling but we did (and still do) have a vmware cluster up and running (approx 100 vms and could easily fit more) and are fairly competent of managing said cluster (ie it works mostly fine). As a start we thought of utilizing this cluster and create several virtual machines (or a mix of physical and virtual machines) for the different tasks.
|
|
We had nothing configured for workload scaling but we did (and still do) have a vmware cluster up and running (approx 100 vms and could easily fit more) and are fairly competent of managing said cluster (ie it works mostly fine). As a start we thought of utilizing this cluster and create several virtual machines (or a mix of physical and virtual machines) for the different tasks.
|
|
The idea was to run Jupyterhub with nginx on one vm and have several "worker" vms, ie a virtual machine per notebook. GPU we also thought to share by either buying a really expensive nvidia card (Pascal architecture, such as Tesla V100, P100, P40) that allow vgpus or go for several consumer grade cards and do pci passthrough into the virtual machine (ie 1-2 virtual machines on each esxi host). Yeah it was a *bit* naive but you got a start somehwere.
|
|
The idea was to run Jupyterhub with nginx on one vm and have several "worker" vms, ie a virtual machine per notebook. GPU we also thought to share by either buying a really expensive nvidia card (Pascal architecture, such as Tesla V100, P100, P40) that allow vgpus or go for several consumer grade cards and do pci passthrough into the virtual machine (ie 1-2 virtual machines on each esxi host). Yeah it was maybe a *bit* naive but you got a start somehwere.
|
|
|
|
|
|
In the end the budget constraint made us use 4xRTX 2080 Ti so we could not use the vgpu features.
|
|
In the end the budget constraint made us use 4xRTX 2080 Ti so we could not use the vgpu features.
|
|
|
|
|
|
### Off we went
|
|
### Off we went
|
|
Happy as two penguins we started to buy HW and install the system (and google extensively).
|
|
Happy as two penguins we started to buy HW and install the system (and google extensively).
|
|
Task:
|
|
Tasks:
|
|
1. Install esxi with RTX 2080 Ti and forward that into as virtual machine described [here](how-to-install-rtx-2080-in-esxi).
|
|
1. Install esxi with RTX 2080 Ti and forward that into a virtual machine as described [here](how-to-install-rtx-2080-in-esxi).
|
|
- To make a long story short in the end we did not go this route as we did not find a way to assign more than 1 vCPU to the VM that should have GPU access. Instead we opted for bare metal installation of Ubuntu.
|
|
- To make a long story short in the end we did not go this route as we did not find a way to assign more than 1 vCPU to the VM that should have GPU access. Instead we opted for bare metal installation of Ubuntu.
|
|
2. Install jupyterhub and make notebooks scale to multiple machines.
|
|
2. Install jupyterhub and make notebooks scale to multiple machines.
|
|
|
|
|
... | @@ -113,10 +113,37 @@ As can be seen there are many layers (I have omitted the step where the swarm se |
... | @@ -113,10 +113,37 @@ As can be seen there are many layers (I have omitted the step where the swarm se |
|
Some of the code that you specify in the configuration file happens at the top, ie inside the jupyterhub container, some (ie what command to run) happens at the worker node, some happens inside the notebook container. Then there is the whole interaction between docker.py and DockerSwarmAPI (which differs from docker the command line client) as well as there is a difference between capabilities when running a container locally and using the swarm API, puuh.
|
|
Some of the code that you specify in the configuration file happens at the top, ie inside the jupyterhub container, some (ie what command to run) happens at the worker node, some happens inside the notebook container. Then there is the whole interaction between docker.py and DockerSwarmAPI (which differs from docker the command line client) as well as there is a difference between capabilities when running a container locally and using the swarm API, puuh.
|
|
|
|
|
|
Well anyway after we deciphered most of that we got notebooks up and running but all content that a user saved got lost when they shutoff their notebooks!
|
|
Well anyway after we deciphered most of that we got notebooks up and running but all content that a user saved got lost when they shutoff their notebooks!
|
|
Not good, but lets get back to that and first talk about user authorization.
|
|
Not good, but lets get back to that later and first talk about user authorization.
|
|
|
|
|
|
## Centralized login and gpu access restrictions
|
|
## Centralized login and gpu access restrictions
|
|
We choose between multiple options but since we have a gitlab server (that authorize the students via the central IT system/ie ldap) we resued that as a oath provider.
|
|
We choose between multiple options but since we have a gitlab server (that authorize the students via the central IT system/ie ldap) we resued that as a oath provider.
|
|
To differentiate between persons that sshould be able to use GPU or not we "abused" gitlabs group feature and so that if a user is part of a gitlab group he/she/it will be given different options.
|
|
|
|
|
|
Luckily (well we would not have chosen this path if it has not been) a oauthenticator is implemented for jupyterhub and they even have a specific gitlab version.
|
|
|
|
Now for the oath dance to work we need internet access and we need incomming connections but we do not want everyone to access our jupyterhub directly so we came up with some [firewall rules](https://git.cs.kau.se/jonakarl/jupyterhub/-/blob/master/host_files/etc/rc.local) to block access.
|
|
|
|
Other than that it was mostly to configure gitlab to act as a oath provider and generate the needed secrets and keys for that to work.
|
|
|
|
Only caveate is that it will use gitlabs username as the jupyterhub username and this might differ from what is used in "upstream" authorization sources like ldap.
|
|
|
|
|
|
|
|
To differentiate between persons that should be able to use GPU or not we "abused" gitlabs group feature and so that if a user is part of a specific gitlab group he/she/it will be given different options.
|
|
|
|
|
|
|
|
## Persistant storage
|
|
|
|
Our users would be very annoyed if their files was lost on every restart of there notebook, therefore we needed some place to permanently store the files/folders.
|
|
|
|
|
|
|
|
Furthermore, since the notebooks ran as containers on arbitrary nodes we could not map a local folder directly.
|
|
|
|
Once again we where lucky as docker volumes can mount nfs shares.
|
|
|
|
|
|
|
|
So we setup the manager node (where we run jupyterhub) to act as a nfs server. That way we could create the needed homefolders for the users on-demand when the users log in the first time.
|
|
|
|
|
|
|
|
**All great ?***
|
|
|
|
|
|
|
|
Well, it adds a extra dependency on each client (needs the nfs-client package installed) and docker volumes are persistent even if something went wrong.
|
|
|
|
This leads to that if something is wrong when the volume is created, like missing nfs client or network error or basically anything, the creation of the notebook will fail and continue to fail for all eternity (until the volume is deleted).
|
|
|
|
It also gives no easy to understand error message so our first line of defense when something is not working is to delete all volumes and try again (usually that works fine).
|
|
|
|
But once all is configured currectly the solution works fine.
|
|
|
|
|
|
|
|
## GPU
|
|
|
|
Ooh how much info there is on the Internet for setting up docker with nvidia and get GPU support in containers.
|
|
|
|
While some of that information is old and some is incorrect basically all of the information comes without explanation on why you should set
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|