Kubernetes
SF Computes offers Kubernetes via a single namespace or virtual clusters. Virtual clusters is in beta right now. If you need virtual clusters (e.g., applying Helm charts), contact us.
QuickstartCopied!
Install the CLI.
curl -fsSL https://sfcompute.com/cli/install | bash
Login via the CLI.
sf login
Buying Kubernetes nodesCopied!
To get a Kubernetes nodes for a fixed period of time, such as an hour, use the buy command.
sf buy -d '1h' -t h100i # The "-t h100i" opts you into the Kubernetes
SF Compute also offers virtual machines.
Connecting to KubernetesCopied!
List the clusters you have access to.
sf clusters list
CONT_A9ICKALESUBTHEY
Name alamo
K8s API https://alamo.clusters.sfcompute.com:6443
Namespace sf-alamo
Add a user to the cluster (a Kubernetes service account).
sf clusters users add --cluster alamo --user myuser
Use kubectl to check if your connection works. You should see something like this.
kubectl get pods
No resources found in sf-jensen namespace.
TemplatesCopied!
Single-node trainingCopied!
Example config for training nanogpt.
apiVersion: batch/v1
kind: Job
metadata:
name: nanogpt
spec:
completions: 1
parallelism: 1
completionMode: Indexed
template:
metadata:
labels:
job-name: nanogpt # This matches service selector
spec:
containers:
- name: trainer
image: alexsfcompute/nanogpt-k8s:latest
command: ["torchrun", "--standalone", "--nproc_per_node", "8", "train.py"]
ports:
- containerPort: 29500
resources:
requests:
nvidia.com/gpu: 8
nvidia.com/hostdev: 8
memory: "512Gi"
cpu: "32"
limits:
nvidia.com/gpu: 8
nvidia.com/hostdev: 8
memory: "512Gi"
cpu: "32"
volumeMounts:
- name: data-volume
mountPath: /data
volumes:
- name: data-volume
emptyDir: {}
restartPolicy: Never
Apply the config with kubectl.
kubectl apply -f nanogpt.yaml
Watch the pods spinning up with kubectl get pods
and follow the logs with kubectl logs -f <pod-name>
.
Multi-node trainingCopied!
Example config for multi-node training of nanogpt.
# nanogpt.yaml
apiVersion: v1
kind: Service
metadata:
name: nanogpt-svc
spec:
clusterIP: None # Headless service
selector:
job-name: nanogpt
ports:
- port: 29500
name: dist-port
---
apiVersion: batch/v1
kind: Job
metadata:
name: nanogpt
spec:
completions: 2 # Total number of pods
parallelism: 2 # Run all pods in parallel
completionMode: Indexed
template:
metadata:
labels:
job-name: nanogpt # This matches service selector
spec:
containers:
- name: trainer
image: alexsfcompute/nanogpt-k8s:latest
command: ["torchrun", "--nnodes", "2", "--nproc_per_node", "8", "--rdzv-backend", "c10d", "--rdzv-endpoint", "nanogpt-0.nanogpt-svc:29500", "train.py"]
ports:
- containerPort: 29500
resources:
requests:
nvidia.com/gpu: 8
nvidia.com/hostdev: 8
memory: "512Gi"
cpu: "32"
limits:
nvidia.com/gpu: 8
nvidia.com/hostdev: 8
memory: "512Gi"
cpu: "32"
volumeMounts:
- name: data-volume
mountPath: /data
volumes:
- name: data-volume
emptyDir: {}
restartPolicy: Never
subdomain: nanogpt-svc # needed for networking between pods in the job
This config uses torchrun
to spawn a distributed training job on 16 GPUs across two nodes, and uses a kubernetes service to expose port 29500 on each of the nodes for the different pytorch processes to be able to discover each other. Once the service has been created, each node is accessible as nanogpt-<i>.nanogpt-svc
, where i
is the rank of the node, starting at 0.
This example also mounts a local data volume at /data
, which is larger and more performant than the default Docker filesystem. The nanogpt example doesn’t use it, but you could cache model checkpoints or batches of training data there.
Apply the config with kubectl.
kubectl apply -f nanogpt.yaml
Watch the pods spinning up with kubectl get pods
, and follow the logs with kubectl logs -f <pod-name>
.
It’s possible that the first time you run it, different pods will start at very different times because they take different amounts of time to download the Docker image. This can cause some of the first ones to timeout while waiting for the stragglers. It should work fine the second time, once the image has been cached, and the pods should even restart automatically if they timeout.
At the moment, you need to install the libibverbs-dev
userspace InfiniBand library into your containers to get RDMA via InfiniBand working at full speed. You can add it to your docker containers like this:
# Install required packages, add the CUDA keyring, then clean up
RUN apt-get update && apt-get install -y wget sudo && \
wget -q https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && \
sudo dpkg -i cuda-keyring_1.1-1_all.deb && \
rm cuda-keyring_1.1-1_all.deb && \
apt-get update && \
apt-get install -y libibverbs-dev && \
rm -rf /var/lib/apt/lists/*
If you have trouble with performance, we recommend running your scripts with the NCCL_DEBUG=INFO
flag to see how NCCL communication is handled, like such:
NCCL_DEBUG=INFO torchrun --nnodes 2 --nproc_per_node 8 --node-rank 0 --rdzv-backend c10d --rdzv-endpoint nanogpt-0.nanogpt-svc:29500rain.py
Building Docker imagesCopied!
If you’re building Docker images on a Mac with an ARM processor, we recommend using docker buildx
to build your Dockerfile targeting AMD64.
docker buildx build --platform linux/amd64 -t <your image tag> .
Once it’s built, you can tag it and push it to your container repository.
docker tag <local tag> <remote tag>
docker push <remote tag>
Here we are using Docker Hub, but you can use any container repository you like (AWS ECR, Google GCR, etc.).
Adding local volumesCopied!
If you need more than about 200GB of storage in your pod, or faster reads and writes than the default Docker filesystem, you can add an Ephemeral Volume to your Kubernetes manifest.
Here is an example config for a pod that mounts an Ephemeral Volume at /data
.
apiVersion: v1
kind: Pod
metadata:
name: cuda-pod
spec:
containers:
- name: cuda
image: nvidia/cuda:12.3.1-base-ubuntu22.04
command: ["sleep", "infinity"]
volumeMounts:
- name: data-volume
mountPath: /data
volumes:
- name: data-volume
emptyDir: {}
Creating an SSH podCopied!
If you’d like to SSH into a pod, here is an example manifest you can apply.
apiVersion: v1
kind: Pod
metadata:
name: ssh-pod
spec:
containers:
- name: cuda
image: nvidia/cuda:12.3.1-base-ubuntu22.04
command:
- /bin/sh
- -c
- |
apt-get update && apt-get install -y openssh-server && \
passwd -d root && \
echo 'PermitRootLogin yes\nPasswordAuthentication yes\nPermitEmptyPasswords yes' > /etc/ssh/sshd_config && \
mkdir -p /var/run/sshd && \
/usr/sbin/sshd -D
ports:
- containerPort: 22
resources:
requests:
nvidia.com/gpu: 8
nvidia.com/hostdev: 8
memory: "512Gi"
cpu: "32"
limits:
nvidia.com/gpu: 8
nvidia.com/hostdev: 8
memory: "512Gi"
cpu: "32"
volumeMounts:
- name: data-volume
mountPath: /data
volumes:
- name: data-volume
emptyDir: {}
Then once it's running (and give it a minute to install sshd, which you can monitor with kubectl logs -f ssh-pod
), you can port-forward the SSH port to your local machine with kubectl port-forward pod/ssh-pod 2222:22
, and then SSH into it with ssh -p 2222 root@localhost
. If you're building a custom Docker image you should also be able to just copy those four lines into the image and make sure sshd starts at the end.
Connecting to a Jupyter NotebookCopied!
Example manifest that you can apply.
# jupyter.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jupyter
labels:
app: jupyter
spec:
replicas: 1
selector:
matchLabels:
app: jupyter
template:
metadata:
labels:
app: jupyter
spec:
containers:
- name: jupyter
image: quay.io/jupyter/pytorch-notebook:cuda12-python-3.11.8
ports:
- containerPort: 8888
command: ["start-notebook.sh"]
args: ["--NotebookApp.token=''", "--NotebookApp.password=''"]
resources:
requests:
nvidia.com/gpu: 8
nvidia.com/hostdev: 8
memory: "512Gi"
cpu: "32"
limits:
nvidia.com/gpu: 8
nvidia.com/hostdev: 8
memory: "512Gi"
cpu: "32"
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: "64Gi"
---
apiVersion: v1
kind: Service
metadata:
name: jupyter-service
spec:
selector:
app: jupyter
ports:
- protocol: TCP
port: 8888
targetPort: 8888
type: ClusterIP
Apply it with kubectl apply -f jupyter.yaml
, and you can watch it spin up with kubectl get pods
.
Once it’s running, you can forward the Jupyter notebook server’s port to your computer like this:
kubectl port-forward service/jupyter-service 8888:8888
To connect your VS Code to the server:
-
Open VS Code
-
Create or open a Jupyter notebook (.ipynb file)
-
Click on the kernel selector in the top right
-
Select "Select Another Kernel"
-
Choose "Existing Jupyter Server"
-
Enter the URL:
http://localhost:8888
You should now be able to run cells! As a test, you can run !nvidia-smi
, or
import torch
torch.cuda.is_available()
> True