Distributing PyTorch model training on minikube with Kubeflow

Sarwesh Suman
5 min readAug 24, 2019

We know that building a deep learning model for real world problems requires lots of training data and when problems grow in complexity, usually, model’s complexity and training data also grows. This means training time on a single machine will increase and become unacceptable when we want to quickly build and evaluate multiple models with different hyper parameters.

Adding more GPUs to a machine is always an option, however, there is a limit to vertically scaling the machine. There comes a point when scaling out horizontally makes more sense and gives more throughput.

In this article I will setup and run a demo pytorch distributed training on minikube cluster.

Let’s start by setting up minikube on your local machine.

Easiest way to install minikube is via homebrew,

brew cask install minikube

Once, minikube is installed we can start the cluster with following command,

minikube start

To verify, we can get cluster details with below command,

kubectl cluster-info

Next, we download ksonnet package.

In short, ksonnet is a configuration management tool for kubernetes. More info can found here.

Follow below commands to setup ks.

KS_VER=0.13.1
export KS_PKG=ks_${KS_VER}_linux_amd64
wget https://github.com/ksonnet/ksonnet/releases/download/v${KS_VER}/${KS_PKG}.tar.gz
mkdir ./bin
tar -xvf ./$KS_PKG.tar.gz -C ./bin

The downloaded file is binary so no need of installation, we can directly start using it. Lets, add that binary location to PATH env variable.

cd ./bin
export PATH=$PATH:`pwd`
cd ..

KS setup is done. Test the ks command like so,

Next, I clone kubeflow src into local. This we do to avoid GIT api rate limitation.

git clone --single-branch --branch v0.6.0 https://github.com/kubeflow/kubeflow.git
cd kubeflow
export KUBEFLOW_REPO=`pwd`
cd ..

I am using v0.6.0 branch of kubeflow. Lets initialise ks app like so,

ks init pytorch_operator_installation --insecure-skip-tls-verify

We can edit app.yaml and set correct namespace if we want pytorch operator pod to run in a specific namespace in the cluster.

We add local kubeflow src as registry to our ks app.

ks registry add kubeflow "${KUBEFLOW_REPO}/kubeflow"

We then install pytorch package like so,

ks pkg install kubeflow/pytorch-job

We generate pytorch operator component and apply it to minikube cluster.

ks generate pytorch-operator pytorch-operator
ks apply default -c pytorch-operator

This installs pytorchjob crd and installs a pytorch operator deployment which starts an operator pod.

Thats it. Infrastructure setup is done. All that is left is to write a distributed pytorch training code and submit it to cluster via a “pytorchjob” yaml.

Before, moving forward we initialise docker env using minikube we just installed.

eval $(minikube docker-env)

We use the demo code from distributed pytorch code example here.

We start by cloning the repo like so,

git clone --single-branch --branch v0.6.0 https://github.com/kubeflow/pytorch-operator.git
cd pytorch-operator/examples/mnist/

We build the docker image as mentioned in the repo. I am building the one without mpi, as mpi installation takes lot of time.

To use GPU we can simply use nccl backend which comes by default with pytorch installation. For more information on what backend to use refer here.

docker build -f Dockerfile -t kubeflow/pytorch-dist-mnist-test:1.0 ./

This builds the image with the mnist.py file that we want to run.

To note, when we run this code, it first downloads the dataset and then starts training. This is very basic example but we can build real world codes using this.

Lets look at the pytorchjob yaml file,

apiVersion: "kubeflow.org/v1beta2"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-mnist-nccl"
spec: pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: kubeflow/pytorch-dist-mnist-test:1.0
args: ["--backend", "nccl"]
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: kubeflow/pytorch-dist-mnist-test:1.0
args: ["--backend", "nccl"]
resources:
limits:
nvidia.com/gpu: 1

In a distributed pytorch cluster we only have support for all-reduce algorithm therefore, we only need to have master and worker nodes . Master will always have 1 instance and we can scale worker nodes as per our requirement.

We can augment this yaml with additional statements on node selector/anti-affinity etc. But since it is running on my local , I will use it as it is.

To deploy it in minikube cluster I have to simple create pytorchjob like so,

cd v1beta2
>kubectl create -f pytorch_job_mnist_nccl.yaml
pytorchjob.kubeflow.org/pytorch-dist-mnist-nccl created

This starts the master and worker pod in the default namespace. Once both pods are up and running, they form a cluster and starts distributed training.

Once the training is complete, the pods will move to Completed status and will stay there until removed. ( There is a way to configure pytorch operator to remove it, i haven’t explored that option yet. )

We can remove the job like so,

>kubectl get pytorchjobs
NAME AGE
pytorch-dist-mnist-example 9m
>kubectl delete pytorchjob pytorch-dist-mnist-example

To conclude, in this article I have setup minikube, ksonnet, kubeflow-pytorch-operator and then ran a distributed pytorch example. This example can easily be extended to real world problems. To setup pytorch-operator in a full blown kubernetes cluster, the steps shown above after minikube setup stays the same.

In my next article I will delve deeper into what is pytorch-operator, we will try to understand it functionally and also look at the code, till then cheers!

You can look at additional resources at below link(s).

--

--