Distributing PyTorch model training on minikube with Kubeflow

brew cask install minikube
minikube start
kubectl cluster-info
KS_VER=0.13.1
export KS_PKG=ks_${KS_VER}_linux_amd64
wget https://github.com/ksonnet/ksonnet/releases/download/v${KS_VER}/${KS_PKG}.tar.gz
mkdir ./bin
tar -xvf ./$KS_PKG.tar.gz -C ./bin
cd ./bin
export PATH=$PATH:`pwd`
cd ..
git clone --single-branch --branch v0.6.0 https://github.com/kubeflow/kubeflow.git
cd kubeflow
export KUBEFLOW_REPO=`pwd`
cd ..
ks init pytorch_operator_installation --insecure-skip-tls-verify
ks registry add kubeflow "${KUBEFLOW_REPO}/kubeflow"
ks pkg install kubeflow/pytorch-job
ks generate pytorch-operator pytorch-operator
ks apply default -c pytorch-operator
eval $(minikube docker-env)
git clone --single-branch --branch v0.6.0 https://github.com/kubeflow/pytorch-operator.git
cd pytorch-operator/examples/mnist/
docker build -f Dockerfile -t kubeflow/pytorch-dist-mnist-test:1.0 ./
apiVersion: "kubeflow.org/v1beta2"
kind: "PyTorchJob"
metadata:
name: "pytorch-dist-mnist-nccl"
spec: pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: kubeflow/pytorch-dist-mnist-test:1.0
args: ["--backend", "nccl"]
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: kubeflow/pytorch-dist-mnist-test:1.0
args: ["--backend", "nccl"]
resources:
limits:
nvidia.com/gpu: 1
cd v1beta2
>kubectl create -f pytorch_job_mnist_nccl.yaml
pytorchjob.kubeflow.org/pytorch-dist-mnist-nccl created
>kubectl get pytorchjobs
NAME AGE
pytorch-dist-mnist-example 9m
>kubectl delete pytorchjob pytorch-dist-mnist-example

--

--

--

Senior SDE @Amazon

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

API Development in Laravel 8 using JWT token

[Fixed]ERROR:localstack.services.kinesis.kinesis_starter: kinesis terminated with return code -11

Understanding AWS Chalice application

Good Stuff Can Be Written in Bad Languages

Cache Access Patterns for Performance Optimization

Scrapping our Style Guide

Java Notes — ArrayList vs LinkedList

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sarwesh Suman

Sarwesh Suman

Senior SDE @Amazon

More from Medium

Speed up EfficientNet model training on Amazon SageMaker with PyTorch and SageMaker distributed…

Neural Architecture Search

Solliance makes headlines with cryptocurrency news analysis platform powered by Azure Machine…

HuggingFace Transformer Pipeline — Vision: How to Use, Deploy and Serve