Distributing PyTorch model training on minikube with Kubeflow

brew cask install minikube
minikube start
kubectl cluster-info
KS_VER=0.13.1
export KS_PKG=ks_${KS_VER}_linux_amd64
wget https://github.com/ksonnet/ksonnet/releases/download/v${KS_VER}/${KS_PKG}.tar.gz
mkdir ./bin
tar -xvf ./$KS_PKG.tar.gz -C ./bin
cd ./bin
export PATH=$PATH:`pwd`
cd ..
git clone --single-branch --branch v0.6.0 https://github.com/kubeflow/kubeflow.git
cd kubeflow
export KUBEFLOW_REPO=`pwd`
cd ..
ks init pytorch_operator_installation --insecure-skip-tls-verify
ks registry add kubeflow "${KUBEFLOW_REPO}/kubeflow"
ks pkg install kubeflow/pytorch-job
ks generate pytorch-operator pytorch-operator
ks apply default -c pytorch-operator
eval $(minikube docker-env)
git clone --single-branch --branch v0.6.0 https://github.com/kubeflow/pytorch-operator.git
cd pytorch-operator/examples/mnist/
docker build -f Dockerfile -t kubeflow/pytorch-dist-mnist-test:1.0 ./
apiVersion: "kubeflow.org/v1beta2"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-mnist-nccl"
spec: pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: kubeflow/pytorch-dist-mnist-test:1.0
args: ["--backend", "nccl"]
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: kubeflow/pytorch-dist-mnist-test:1.0
args: ["--backend", "nccl"]
resources:
limits:
nvidia.com/gpu: 1
cd v1beta2
>kubectl create -f pytorch_job_mnist_nccl.yaml
pytorchjob.kubeflow.org/pytorch-dist-mnist-nccl created
>kubectl get pytorchjobs
NAME AGE
pytorch-dist-mnist-example 9m
>kubectl delete pytorchjob pytorch-dist-mnist-example

--

--

--

Senior SDE @Amazon

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

I am writing this piece for 12 months

Performance Testing Types & Metrics

Using AWS Lambda to do Microbatch Loading to Amazon Redshift

Auto compile-time support for Unity Animation Parameters

Task Description :

Beyond the Basics: Protect your code and business logic with Runtime Encryption Plugins

CLI written in Python to fast exploration of csv files with or without additional packages

Build a GraphQL Server With Spring Boot and MySQL

Autumn in all around the world. But never-mind, it’s Springtime here. Enjoy the beauty. Credit goes to John Peel

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sarwesh Suman

Sarwesh Suman

Senior SDE @Amazon

More from Medium

How to Scale AI Models and Applications using a Distributed Python Framework — Ray

YOLO V2 Configuration file Explained!!

Build vs. Buy an End-to-End MLOps Platform

Optimize Deep Learning Inference on GPU — Part 1