A deployment maintains N instances on a capacity. A
procurement buys and sells spot compute to keep them running. To scale,
update the deployment’s target instance count. The procurement adjusts automatically.
Prerequisites
- SF Compute CLI installed and authenticated (
sf login)
- Credits on your account (
sf billing balance)
- A sense of which hardware you want (
sf instance-skus list)
Create a capacity
sf capacities create --name inference
Create an instance template
Define the image and startup script for your inference instances.
#!/bin/bash
mkdir -p /root/.ssh
cat >>/root/.ssh/authorized_keys <<"EOF"
ssh-ed25519 AAAA... you@example.com
EOF
# Start your inference server here
sf instance-templates create \
--name inference-worker \
--image ubuntu-22.04.5-cuda-12.7 \
--cloud-init ./startup.sh
See Instance templates for details on cloud-init and image
configuration.
Create a deployment
sf deployments create \
--name inference \
--capacity inference \
--instance-template inference-worker \
--target-instance-count 4
4 instances are created in awaiting_allocation. They start once the capacity has compute time.
Create a procurement
node_count as the target tells the procurement to match however many instances exist on the
capacity. Pass --instance-sku <id> to pin the procurement to specific hardware; see
instance SKUs for the catalog.
sf procurements create \
--name inference \
--capacity inference \
--target node_count \
--max-buy-price 20.00 \
--min-sell-price 10.00 \
--window 2h \
--instance-sku isku_4UpxzQw7A8N
The procurement sees 4 waiting instances and places buy orders. Within minutes, your instances move
to running.
Scale up
sf deployments set inference --target-instance-count 8
4 new instances are created. The procurement buys compute to cover them.
Scale down
sf deployments set inference --target-instance-count 2
Excess instances are removed. The procurement sells unneeded compute.
Handling interruptions
Spot compute is not guaranteed. Instances may shut down if the market price exceeds your buy limit
or other buyers place reservations that consume the capacity. Design workloads to handle instances
being replaced.
- Stateless workers. Download model weights on boot. Local disk does not persist between
instances.
- Health check your load balancer. Route traffic only to instances that are ready.
- Longer
--window. A higher value (e.g., 6h) reduces gaps but commits more spend. See
Tuning the managed window.
Monitoring
sf deployments get inference # Deployment status and instance count
sf procurements get inference # Procurement status and pricing
sf instances list # Individual instance status
Next steps
- Adjust
--max-buy-price and --min-sell-price to control spend
- Buy a reserved block of compute into this capacity when you need to guarantee availability:
sf orders create --capacity inference --side buy --nodes 4 --start "in 5h" --duration 24h --max-rate 20.00