|
|
# Background
|
|
|
## Who are we
|
|
|
We are three coders/engineers/researchers working at the CS department at Karlstad Univeristy. Our main tasks are to help and support the researchers,teachers and students with the practical/technical stuff required for their day to day experiments/classes etc. to run smoothly.
|
|
|
For this task the main drivers have been @tobivehk and @jonakarl while @moharaji have been used as a reality check/moderator of our wild ideas.
|
|
|
<!--- Who we are --->
|
|
|
We are three coders/engineers/researchers working at the CS department at Karlstad Univeristy. Our main tasks are to help and support the researchers,teachers and students in there day to day work with the practical/technical stuff required for their experiments/research/classes etc.
|
|
|
In this particular task (ie what is documented here) the main drivers have been @tobivehk and @jonakarl while @moharaji have been used as a reality check/moderator of our wild ideas.
|
|
|
|
|
|
## We got a task!
|
|
|
Implement/install a system for students and researchers to be able to code and share/use GPU resources in a efficient way for both AI/Machine learning/research and "normal" coding.
|
|
|
**Why write this blogpost?**
|
|
|
|
|
|
Simply to document the ideas and decisions we made (good and bad) in the process to the current system we currently use.
|
|
|
|
|
|
## Why this blogpost
|
|
|
To document the ideas and decisions we made (good and bad) in the process to the current system we currently use.
|
|
|
**What we have done in the past**
|
|
|
|
|
|
## What we have done in the past
|
|
|
We have previously used jupyternootebooks on a single semi-large server for data mining and python coding.
|
|
|
While it worked for a class of 30 students in one or two classes each year, the administrative burden on manually creating users/killing bad programs etc was not really scalable.
|
|
|
Furthermore there was no separation between users and a since we ran on one server no possibility to scale out for multiple GPU machines.
|
|
|
Furthermore there was no separation between users and a since we ran on one server little possibility to scale out for multiple GPUs.
|
|
|
|
|
|
**What we have NOT done in the past (ie what we knew nothing about at the start of this)**
|
|
|
|
|
|
## What we have NOT done in the past (ie what we knew nothing about at the start of this)
|
|
|
While we have some background in creating/running containers (docker) we had zero experience with administering clouds, container orchestration, micro architectures, GPU assisted AI/ML and jupyterhub/jupyternootebooks. Of course we have all heard the buzzwords and (thought we) understood the concepts but as always reality is a lot more ugly (and at the same time more straightforward) then the powerpoint slides explaining the reality.
|
|
|
|
|
|
# The vision
|
|
|
# We have a task!
|
|
|
|
|
|
Implement/install a system for students and researchers to be able to code and share/use GPU resources in a efficient way for both AI/Machine learning/research and "normal" coding.
|
|
|
|
|
|
**The vision**
|
|
|
|
|
|
Create a load-balanced working environment for students and staff where they can run labs/experiments and the data should be saved between sessions.
|
|
|
Selected users should be able to share a restricted set of GPUs for running their experiments (ie machine learning/AI classes).
|
|
|
|
|
|
# The goal
|
|
|
**The goal**
|
|
|
|
|
|
Setup a jupyterhub environment with central authorization and with possibilities to assign "GPU" capabilities to selected user.
|
|
|
- Users should be able to save the assignment/file they are working on and it should be saved between sessions
|
|
|
- users should be able to use a centralized user account/passwd (ie LDAP/SAML/OAUTH)
|
... | ... | @@ -34,27 +40,59 @@ Setup a jupyterhub environment with central authorization and with possibilities |
|
|
# How we did it
|
|
|
We broke down the issue is multiple steps as we will describe below
|
|
|
|
|
|
## Multi user working environment
|
|
|
**Multi user working environment**
|
|
|
|
|
|
As previously said, we have been using jupyter notebooks in the past so jupyterhub seamed like a natural fit for this task.
|
|
|
|
|
|
We had nothing configured for workload scaling but we did (and still do) have a vmware cluster up and running (approx 100 vms and could easily fit more) and are fairly competent of managing said cluster (ie it works mostly fine). As a start we thought of utilizing this cluster and create several virtual machines (or a mix of physical and virtual machines) for the different tasks.
|
|
|
The idea was to run Jupyterhub with nginx on one vm and have several "worker" vms, ie a virtual machine per notebook. GPU we also thought to share by either buying a really expensive nvidia card (Pascal architecture, such as Tesla V100, P100, P40) that allow vgpus or go for a consumer grade card and do a pci passthrough into the virtual machine (ie 1-2 virtual machines on each esxi host). Yeah it was a bit naive but you got a start somehwere.
|
|
|
The idea was to run Jupyterhub with nginx on one vm and have several "worker" vms, ie a virtual machine per notebook. GPU we also thought to share by either buying a really expensive nvidia card (Pascal architecture, such as Tesla V100, P100, P40) that allow vgpus or go for several consumer grade cards and do pci passthrough into the virtual machine (ie 1-2 virtual machines on each esxi host). Yeah it was a *bit* naive but you got a start somehwere.
|
|
|
|
|
|
In the end the budget constraint made us use 4xRTX 2080 Ti so we could not use the vgpu features.
|
|
|
|
|
|
### Off we went
|
|
|
Happy as two penguins we started to buy HW and install the system (and google extensively).
|
|
|
Task:
|
|
|
1. Install esxi with RTX 2080 Ti and forward that into as virtual machine, long story short that ideas was abandoned in the end (why and how we did it is described [here](how-to-install-rtx-2080-in-esxi).
|
|
|
2. Install jupyterhub and make that scale to multiple machines.
|
|
|
1. Install esxi with RTX 2080 Ti and forward that into as virtual machine described [here](how-to-install-rtx-2080-in-esxi).
|
|
|
- To make a long story short in the end we did not go this route as we did not find a way to assign more than 1 vCPU to the VM that should have GPU access. Instead we opted for bare metal installation of Ubuntu.
|
|
|
2. Install jupyterhub and make notebooks scale to multiple machines.
|
|
|
|
|
|
#### How to do install a multi user jupyterhub that scales
|
|
|
|
|
|
##### 1. [The Littlest JupyterHub](https://tljh.jupyter.org/en/latest/)
|
|
|
The simplest installation of Jupyterhub is the superb script/package [The Littlest JupyterHub]
|
|
|
(https://tljh.jupyter.org/en/latest/). While it supports multiple users it runs on a single server does not scale well alas did not fit our use case well.
|
|
|
|
|
|
##### 2. [Jupyter Enterprise Gateway](https://jupyter-enterprise-gateway.readthedocs.io/en/latest/)
|
|
|
We next tried with a manually installed jupyterhub and [Jupyter Enterprise Gateway](https://jupyter-enterprise-gateway.readthedocs.io/en/latest/).
|
|
|
|
|
|
[Jupyter Enterprise Gateway](https://jupyter-enterprise-gateway.readthedocs.io/en/latest/) is a very nice piece of code that separates the notebook UI form the running kernels and seamed to fit our task very well. However, probably due to the complete lack of understanding of how jupyterhub works we did never get this setup to fly. However, we got the eyes on kubernetes and docker swarm. which brought container orchestration into our view.
|
|
|
|
|
|
##### 3. [Zero to JupyterHub with Kubernetes](https://zero-to-jupyterhub.readthedocs.io/en/latest/)
|
|
|
This perfectly fine tutorial for setting up Jupyterhub on kubernetes we worked with for a long time. I even contacted the author and he promised to help us out, so all kudos to him.
|
|
|
However also this setup did not fly for us in the end. I think we failed mostly due to two things:
|
|
|
1. We had a mindset that the jupyterhub should be a standalone vm (something that is permanent) and that the workers should be ephermal and spawned off into the cluster.
|
|
|
2. The tutorial is geared mostly for using public kubernetes (aka someone else is supporting the HW and underlying cluster)
|
|
|
- In our use case we need to handle everything from installing/supporting the HW, kubernetes to installing supporting jupyterhub itself.
|
|
|
- We actually started to build our own kubernetes cluster but in the end we decided the complexity to high (both from scratch learn to admin kuberenets, helm and and jupyterhub).
|
|
|
|
|
|
In hindsight this (jupyterhub on kubernetes) is probably the most flexible solution and would solve some of the "bugs"/problems we discovered but at that time we thought kubernetes was too much overkill.
|
|
|
|
|
|
##### 4. [SwarmSpawner doc](https://jupyterhub-dockerspawner.readthedocs.io/en/latest/) --- this is the solution we in the end used.
|
|
|
While [Jupyter Enterprise Gateway](https://jupyter-enterprise-gateway.readthedocs.io/en/latest/) pointed us to containerization [Zero to JupyterHub with Kubernetes](https://zero-to-jupyterhub.readthedocs.io/en/latest/) made us more cluster savvy and made us look at a simpler cluster solution that is actually built into docker itself [docker swarm](https://docs.docker.com/engine/swarm/) and [DockerSpawner](https://github.com/jupyterhub/dockerspawner).
|
|
|
|
|
|
In comparison to kubernetes, docker swarm is a breeze to install and there is basically no maintenance.
|
|
|
Installation
|
|
|
1. Install docker
|
|
|
2. on a manager do `docker swarm init --advertise-addr IP-OF-SWARM-FACING-INTERFACE`
|
|
|
3. on a worker do `docker swarm join --token TOKEN-FROM-INIT-COMMAND IP-OF-SWARM-FACING-INTERFACE:2377`
|
|
|
4. Repeat 1 and 3 on all workers.
|
|
|
|
|
|
Voila your cluster is up and work, for upgrading just upgrade you docker installation via a normal apt update && apt upgrade.
|
|
|
|
|
|
Similarly [SwarmSpawner (code)](https://github.com/jupyterhub/dockerspawner) is a extension of [DockersSpawner](https://github.com/jupyterhub/dockerspawner) (same repo). We therefore could start with getting docker spawner to spawn our custom notebooks first (yaay) and then later move onto [SwarmSpawner (doc)](https://jupyterhub-dockerspawner.readthedocs.io/en/latest/spawner-types.html#swarmspawner).
|
|
|
|
|
|
.........................................................................
|
|
|
|
|
|
|
|
|
To begin with we though of having a
|
|
|
We then needed something that could scale and since noone of us has ever worked with swarm/container orchestration and felt that kubernetes was juts plain overkill for what we wanted we opted for docker swarm.
|
|
|
|
|
|
## Centralized login and gpu access restrictions
|
|
|
We choose between multiple options but since we have a gitlab server (that authorize the students via the central IT system/ie ldap) we resued that as a oath provider.
|
... | ... | |