|
|
# Background
|
|
|
We have in the past used jupyterhub and we are more and more focusing on AI/Machine learning and needed some good system for utilizing this in both teaching and experiment tasks.
|
|
|
The covid-19 situation also increased the number of students working from home and the
|
|
|
## Who are we
|
|
|
We are three coders/engineers/researchers working at the CS department at Karlstad Univeristy. Our main tasks are to help and support the researchers,teachers and students with the practical/technical stuff required for their day to day experiments/classes etc. to run smoothly.
|
|
|
For this task the main drivers have been @tobivehk and @jonakarl while @moharaji have been used as a reality check/moderator of our wild ideas.
|
|
|
|
|
|
## We got a task!
|
|
|
Implement/install a system for students and researchers to be able to code and share/use GPU resources in a efficient way for both AI/Machine learning/research and "normal" coding.
|
|
|
|
|
|
## Why this blogpost
|
|
|
To document the ideas and decisions we made (good and bad) in the process to the current system we currently use.
|
|
|
|
|
|
## What we have done in the past
|
|
|
We have previously used jupyternootebooks on a single semi-large server for data mining and python coding.
|
|
|
While it worked for a class of 30 students in one or two classes each year, the administrative burden on manually creating users/killing bad programs etc was not really scalable.
|
|
|
Furthermore there was no separation between users and a since we ran on one server no possibility to scale out for multiple GPU machines.
|
|
|
|
|
|
## What we have NOT done in the past (ie what we knew nothing about at the start of this)
|
|
|
While we have some background in creating/running containers (docker) we had zero experience with administering clouds, container orchestration, micro architectures, GPU assisted AI/ML and jupyterhub/jupyternootebooks. Of course we have all heard the buzzwords and (thought we) understood the concepts but as always reality is a lot more ugly (and at the same time more straightforward) then the powerpoint slides explaining the reality.
|
|
|
|
|
|
# The vision
|
|
|
Create a load-balanced working environment for students and staff where they can run labs/experiments and the data should be saved between sessions.
|
... | ... | @@ -21,8 +35,26 @@ Setup a jupyterhub environment with central authorization and with possibilities |
|
|
We broke down the issue is multiple steps as we will describe below
|
|
|
|
|
|
## Multi user working environment
|
|
|
We have been using jupyter notebookshub in the past so jupyterhub seamed like a natural fit for this task.
|
|
|
We then needed something that could scale and since noone of us has ever worked with swarm/conatiner mangement and felt that kubernetes was juts plain overkill for what we wanted we opted for docker swarm.
|
|
|
As previously said, we have been using jupyter notebooks in the past so jupyterhub seamed like a natural fit for this task.
|
|
|
|
|
|
We had nothing configured for workload scaling but we did (and still do) have a vmware cluster up and running (approx 100 vms and could easily fit more) and are fairly competent of managing said cluster (ie it works mostly fine). As a start we thought of utilizing this cluster and create several virtual machines (or a mix of physical and virtual machines) for the different tasks.
|
|
|
The idea was to run Jupyterhub with nginx on one vm and have several "worker" vms, ie a virtual machine per notebook. GPU we also thought to share by either buying a really expensive nvidia card (Pascal architecture, such as Tesla V100, P100, P40) that allow vgpus or go for a consumer grade card and do a pci passthrough into the virtual machine (ie 1-2 virtual machines on each esxi host). Yeah it was a bit naive but you got a start somehwere.
|
|
|
|
|
|
In the end the budget constraint made us use 4xRTX 2080 Ti so we could not use the vgpu features.
|
|
|
|
|
|
### Off we went
|
|
|
Happy as two penguins we started to buy HW and install the system (and google extensively).
|
|
|
Task:
|
|
|
1. Install esxi with RTX 2080 Ti and forward that into as virtual machine, long story short that ideas was abandoned in the end (why and how we did it is described [here](how-to-install-rtx-2080-in-esxi).
|
|
|
2. Install jupyterhub and make that scale to multiple machines.
|
|
|
|
|
|
|
|
|
|
|
|
.........................................................................
|
|
|
|
|
|
|
|
|
To begin with we though of having a
|
|
|
We then needed something that could scale and since noone of us has ever worked with swarm/container orchestration and felt that kubernetes was juts plain overkill for what we wanted we opted for docker swarm.
|
|
|
|
|
|
## Centralized login and gpu access restrictions
|
|
|
We choose between multiple options but since we have a gitlab server (that authorize the students via the central IT system/ie ldap) we resued that as a oath provider.
|
... | ... | |