Inference fleet

A deployment maintains N nodes on a capacity. A procurement buys and sells spot compute to keep them running. To scale, update the deployment’s target node count. The procurement adjusts automatically.

Prerequisites

SF Compute CLI installed and authenticated (sf login)
Credits on your account (sf billing balance)
A zone with availability (sf zones ls)

Create a capacity

sf capacities create --zone richmond --name inference

Create a node template

Define the image and startup script for your inference nodes.

startup.sh

#!/bin/bash

mkdir -p /root/.ssh
cat >>/root/.ssh/authorized_keys <<"EOF"
ssh-ed25519 AAAA... you@example.com
EOF

# Start your inference server here

sf node-templates create \
  --name inference-worker \
  --image ubuntu-22.04.5-cuda-12.7 \
  --cloud-init ./startup.sh

See Node Templates for details on cloud-init and image configuration.

Create a deployment

sf deployments create \
  --name inference \
  --capacity inference \
  --node-template inference-worker \
  --target-node-count 4

4 nodes are created in awaiting_allocation. They start once the capacity has compute time.

Create a procurement

node_count as the target tells the procurement to match however many nodes exist on the capacity.

sf procurements create \
  --name inference \
  --capacity inference \
  --target node_count \
  --max-buy-price 20.00 \
  --min-sell-price 10.00 \
  --window 2h

The procurement sees 4 waiting nodes and places buy orders. Within minutes, your nodes move to running.

Scale up

sf deployments set inference --target-node-count 8

4 new nodes are created. The procurement buys compute to cover them.

Scale down

sf deployments set inference --target-node-count 2

Excess nodes are removed. The procurement sells unneeded compute.

Handling interruptions

Spot compute is not guaranteed. Nodes may shut down if the market price exceeds your buy limit or other buyers place reservations that consume the capacity. Design workloads to handle nodes being replaced.

Stateless workers. Download model weights on boot. Local disk does not persist between nodes.
Health check your load balancer. Route traffic only to nodes that are ready.
Longer --window. A higher value (e.g., 6h) reduces gaps but commits more spend. See Tuning the managed window.

Monitoring

sf deployments get inference   # Deployment status and node count
sf procurements get inference  # Procurement status and pricing
sf nodes ls                    # Individual node status

Next steps

Adjust --max-buy-price and --min-sell-price to control spend

Buy a reserved block of compute into this capacity when you need to guarantee availability:

sf orders create --capacity inference --side buy --nodes 4 --start "in 5h" --duration 24h --max-rate 20.00

Getting Started

Basics

Advanced

Guides

API Reference

Prerequisites

Create a capacity

Create a node template

Create a deployment

Create a procurement

Scale up

Scale down

Handling interruptions

Monitoring

Next steps

Getting Started

Basics

Advanced

Guides

API Reference

​Prerequisites

​Create a capacity

​Create a node template

​Create a deployment

​Create a procurement

​Scale up

​Scale down

​Handling interruptions

​Monitoring

​Next steps

Prerequisites

Create a capacity

Create a node template

Create a deployment

Create a procurement

Scale up

Scale down

Handling interruptions

Monitoring

Next steps