Kubernetes

SF Computes offers Kubernetes via a single namespace or virtual clusters. Virtual clusters is in beta right now. If you need virtual clusters (e.g., applying Helm charts), contact us.

Quickstart

Install the CLI.

curl -fsSL https://sfcompute.com/cli/install | bash

sf login

Buying Kubernetes nodes

To get a Kubernetes nodes for a fixed period of time, such as an hour, use the buy command.

sf buy -d '1h' -t h100i # The "-t h100i" opts you into the Kubernetes

SF Compute also offers virtual machines.

Connecting to Kubernetes

List the clusters you have access to.

sf clusters list

CONT_A9ICKALESUBTHEY

Name       alamo
K8s API    https://alamo.clusters.sfcompute.com:6443
Namespace  sf-alamo

Add a user to the cluster (a Kubernetes service account).

sf clusters users add --cluster alamo --user myuser

Use kubectl to check if your connection works. You should see something like this.

kubectl get pods

No resources found in sf-jensen namespace.

Templates

Single-node training

Example config for training nanogpt.

apiVersion: batch/v1
kind: Job
metadata:
  name: nanogpt
spec:
  completions: 1
  parallelism: 1
  completionMode: Indexed
  template:
    metadata:
      labels:
        job-name: nanogpt  # This matches service selector
    spec:
      containers:
      - name: trainer
        image: sfcomputekate/sfc-nanogpt:1.0.0
        command: ["torchrun", "--standalone", "--nproc_per_node", "8", "train.py"]
        ports:
        - containerPort: 29500
        resources:
          requests:
            nvidia.com/gpu: 8
            nvidia.com/hostdev: 8
            memory: "512Gi"
            cpu: "32"
          limits:
            nvidia.com/gpu: 8
            nvidia.com/hostdev: 8
            memory: "512Gi"
            cpu: "32"
        volumeMounts:
        - name: data-volume
          mountPath: /data
      volumes:
      - name: data-volume
        emptyDir: {}
      restartPolicy: Never

Apply the config with kubectl.

kubectl apply -f nanogpt.yaml

Watch the pods spinning up with kubectl get pods and follow the logs with kubectl logs -f <pod-name>.

Multi-node training

Example config for multi-node training of nanogpt.

# nanogpt.yaml
apiVersion: v1
kind: Service
metadata:
  name: nanogpt-svc
spec:
  clusterIP: None  # Headless service
  selector:
    job-name: nanogpt
  ports:
  - port: 29500
    name: dist-port
---
apiVersion: batch/v1
kind: Job
metadata:
  name: nanogpt
spec:
  completions: 2  # Total number of pods
  parallelism: 2  # Run all pods in parallel
  completionMode: Indexed
  template:
    metadata:
      labels:
        job-name: nanogpt  # This matches service selector
    spec:
      containers:
      - name: trainer
        image: sfcomputekate/sfc-nanogpt:1.0.0
        command: ["torchrun", "--nnodes", "2", "--nproc_per_node", "8", "--rdzv-backend", "c10d", "--rdzv-endpoint", "nanogpt-0.nanogpt-svc:29500", "train.py"]
        ports:
        - containerPort: 29500
        resources:
          requests:
            nvidia.com/gpu: 8
            nvidia.com/hostdev: 8
            memory: "512Gi"
            cpu: "32"
          limits:
            nvidia.com/gpu: 8
            nvidia.com/hostdev: 8
            memory: "512Gi"
            cpu: "32"
        volumeMounts:
        - name: data-volume
          mountPath: /data
      volumes:
      - name: data-volume
        emptyDir: {}
      restartPolicy: Never
      subdomain: nanogpt-svc  # needed for networking between pods in the job

This config uses torchrun to spawn a distributed training job on 16 GPUs across two nodes, and uses a kubernetes service to expose port 29500 on each of the nodes for the different pytorch processes to be able to discover each other. Once the service has been created, each node is accessible as nanogpt-<i>.nanogpt-svc, where i is the rank of the node, starting at 0.

This example also mounts a local data volume at /data, which is larger and more performant than the default Docker filesystem. The nanogpt example doesn’t use it, but you could cache model checkpoints or batches of training data there.

Apply the config with kubectl.

kubectl apply -f nanogpt.yaml

Watch the pods spinning up with kubectl get pods, and follow the logs with kubectl logs -f <pod-name>.

It’s possible that the first time you run it, different pods will start at very different times because they take different amounts of time to download the Docker image. This can cause some of the first ones to timeout while waiting for the stragglers. It should work fine the second time, once the image has been cached, and the pods should even restart automatically if they timeout.

At the moment, you need to install the libibverbs-dev userspace InfiniBand library into your containers to get RDMA via InfiniBand working at full speed. You can add it to your docker containers like this:

# Install required packages, add the CUDA keyring, then clean up
RUN apt-get update && apt-get install -y wget sudo && \
    wget -q https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && \
    sudo dpkg -i cuda-keyring_1.1-1_all.deb && \
    rm cuda-keyring_1.1-1_all.deb && \
    apt-get update && \
    apt-get install -y libibverbs-dev && \
    rm -rf /var/lib/apt/lists/*

If you have trouble with performance, we recommend running your scripts with the NCCL_DEBUG=INFO flag to see how NCCL communication is handled, like such:

NCCL_DEBUG=INFO torchrun --nnodes 2 --nproc_per_node 8 --node-rank 0 --rdzv-backend c10d --rdzv-endpoint nanogpt-0.nanogpt-svc:29500rain.py

Building Docker images

If you’re building Docker images on a Mac with an ARM processor, we recommend using docker buildx to build your Dockerfile targeting AMD64.

docker buildx build --platform linux/amd64 -t <your image tag> .

Once it’s built, you can tag it and push it to your container repository.

docker tag <local tag> <remote tag>
docker push <remote tag>

Here we are using Docker Hub, but you can use any container repository you like (AWS ECR, Google GCR, etc.).

Adding local volumes

If you need more than about 200GB of storage in your pod, or faster reads and writes than the default Docker filesystem, you can add an Ephemeral Volume to your Kubernetes manifest.

Here is an example config for a pod that mounts an Ephemeral Volume at /data.

apiVersion: v1
kind: Pod
metadata:
  name: cuda-pod
spec:
  containers:
  - name: cuda
    image: nvidia/cuda:12.3.1-base-ubuntu22.04
    command: ["sleep", "infinity"]
    volumeMounts:
    - name: data-volume
      mountPath: /data
  volumes:
  - name: data-volume
    emptyDir: {}

Creating an SSH pod

If you’d like to SSH into a pod, here is an example manifest you can apply.

apiVersion: v1
kind: Pod
metadata:
 name: ssh-pod
spec:
 containers:
 - name: cuda
   image: nvidia/cuda:12.3.1-base-ubuntu22.04
   command:
   - /bin/sh
   - -c
   - |
     apt-get update && apt-get install -y openssh-server && \
     passwd -d root && \
     echo 'PermitRootLogin yes\nPasswordAuthentication yes\nPermitEmptyPasswords yes' > /etc/ssh/sshd_config && \
     mkdir -p /var/run/sshd && \
     /usr/sbin/sshd -D
   ports:
   - containerPort: 22
   resources:
     requests:
       nvidia.com/gpu: 8
       nvidia.com/hostdev: 8
       memory: "512Gi"
       cpu: "32"
     limits:
       nvidia.com/gpu: 8
       nvidia.com/hostdev: 8
       memory: "512Gi"
       cpu: "32"
   volumeMounts:
   - name: data-volume
     mountPath: /data
 volumes:
 - name: data-volume
   emptyDir: {}

Then once it's running (and give it a minute to install sshd, which you can monitor with kubectl logs -f ssh-pod), you can port-forward the SSH port to your local machine with kubectl port-forward pod/ssh-pod 2222:22, and then SSH into it with ssh -p 2222 root@localhost. If you're building a custom Docker image you should also be able to just copy those four lines into the image and make sure sshd starts at the end.

Connecting to a Jupyter Notebook

Example manifest that you can apply.

# jupyter.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jupyter
  labels:
    app: jupyter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jupyter
  template:
    metadata:
      labels:
        app: jupyter
    spec:
      containers:
      - name: jupyter
        image: quay.io/jupyter/pytorch-notebook:cuda12-python-3.11.8
        ports:
        - containerPort: 8888
        command: ["start-notebook.sh"]
        args: ["--NotebookApp.token=''", "--NotebookApp.password=''"]
        resources:
          requests:
            nvidia.com/gpu: 8
            nvidia.com/hostdev: 8
            memory: "512Gi"
            cpu: "32"
          limits:
            nvidia.com/gpu: 8
            nvidia.com/hostdev: 8
            memory: "512Gi"
            cpu: "32"
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: "64Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: jupyter-service
spec:
  selector:
    app: jupyter
  ports:
    - protocol: TCP
      port: 8888
      targetPort: 8888
  type: ClusterIP

Apply it with kubectl apply -f jupyter.yaml, and you can watch it spin up with kubectl get pods.

Once it’s running, you can forward the Jupyter notebook server’s port to your computer like this:

kubectl port-forward service/jupyter-service 8888:8888

To connect your VS Code to the server:

Open VS Code
Create or open a Jupyter notebook (.ipynb file)
Click on the kernel selector in the top right
Select "Select Another Kernel"
Choose "Existing Jupyter Server"
Enter the URL: http://localhost:8888

You should now be able to run cells! As a test, you can run !nvidia-smi , or

import torch
torch.cuda.is_available()

> True