On-demand and spot

The easiest way to get something equivalent to "on demand" on other platforms is to use sf scale, which sets up an automated purchasing strategy called a procurement that constantly maintains a desired number of nodes by buying and selling short-term "spot" reservations.

Create a new procurement.

# create a new procurement for 8 GPUs
sf scale create -n 8

This will return an ID you can use to refer back to the procurement.

You can see all of your procurements.

sf scale show

You can see the details of a specific procurement.

sf scale show <procurement-id>

You can scale up/down existing procurements.

# scale a procurement to 16
sf scale update <procurement-id> -n 16

# scale a procurement to 8
sf scale update <procurement-id> -n 8

To turn a procurement off, scale it down to 0.

sf scale update <procurement-id> -n 0

Limit priceCopied!

You can set a "limit price" at which the procurement will stop buying compute via the -p flag. More specifically, the procurement will buy compute only when the price is at most the limit price.

# maintain 8 GPUs, but only while the price is <= $1.50/GPU/hr
sf scale update <procurement-id> -n 8 -p 1.50

If you do not specify a limit price, it defaults to 1.5 times an estimate of the current market price for the first reservation the procurement would buy. If it can't get a good price estimate, we set it to $2.65/GPU/hr.

Reserved time bufferCopied!

By default, procurements will try to ensure you have at least 1 hour of time reserved at all times. More specifically, if you have less than 1 hour of compute time remaining, it will try to buy the next hour. This threshold is called the horizon.

You can set the horizon with --horizon.

# maintain 8 GPUs, start buying the next reservation when there's 30 minutes left
sf scale create -n 8 --horizon '30m'

# maintain 8 GPUs, start buying the next reservation when there's 2 hours left
sf scale create -n 8 --horizon '2h'

Setting the horizon is a tradeoff.

With a short horizon, you're less likely to keep your GPUs, but you're also less likely to overpay.
With a long horizon, you're more likely to keep your GPUs, but you're also more likely to overpay.

To make the most of a short horizon, we recommend regularly snapshotting your training jobs.

Colocation behaviorCopied!

A procurement's colocation behavior can be configured via the colocation strategy, which can be set via the -cs or --colocation-strategy flag in the CLI.

There are four colocation strategies.

anywhere: The procurement will buy compute on any cluster, with no guarantees about colocation whatsoever. Generally speaking, it will attempt to choose the cheapest cluster.
colocate: The procurement will guarantee that all compute it buys is colocated in the same cluster, but not any specific cluster. If you scale the procurement down to 0 and then scale back up, the procurement is not guaranteed to land on the same cluster it was on before.
colocate-pinned: Same as colocate, but if you scale the procurement down to 0 and then scale back up, the procurement is guaranteed to land on the same cluster it was on before. Another way of saying this is once the first reservation begins, the procurement is "pinned" to whatever cluster that reservation landed on.
pinned: The procurement will guarantee that all compute it buys is colocated on a specific cluster, which is specified via the -c flag.

The default colocation strategy is colocate-pinned.

The pinned colocation strategy is implied by the -c flag, so if you set -c, you don't need to explicitly set -cs pinned.

Here are some examples.

# start a procurement for 32 GPUs with the 'colocate' colocation strategy
sf scale create -n 32 -cs colocate

# start a procurement for 32 GPUs, pinned to alamo
sf scale create -n 32 -cs 'pinned' -c 'alamo'
# OR
sf scale create -n 32 -c 'alamo'

# start a procurement for 32 GPUs, doesn't matter where they end up
sf scale create -n 32 -cs 'anywhere'

At the moment, the colocation strategy can only be set when the procurement is first created. If you want to change the colocation strategy, you scale your procurement down to 0 and create a new one.

This is not reserved capacityCopied!

It's important to note that this is not an indefinite reservation. That is, once you get some GPUs, there's no guarantee that you'll get to keep them until you turn them off. If the procurement isn't able to buy the next hour for any reason (all of the short-term capacity is sold out, prices are too high, etc) your compute time will end and your GPUs will be turned off.