Large Scale Inference
SFC Large Scale Inference provides large-scale asynchronous batch processing for multi-modal AI models. This API is designed for processing trillions of tokens efficiently at a low price while maintaining high accuracy.
Key Concepts
Batch Processing
-
Batch: A collection of inference requests packaged in a
.tarfile. The.tarfile must be less than 5GB in size. -
Completion Window: The maximum time allowed for batch processing. User configurable up to the S3 pre-signed URL expiry. We recommend setting longer completion windows.
File Handling
- To protect your privacy, SFC never stores your files: all data is read from and written to your choice of S3-compatible storage using pre-signed URLs.
- Expected File Format:
Completion Windows
When you submit batches you must provide a completion_window parameter which determines the maximum time allowed for processing. We recommend setting a 7 day completion window or longer, as batches that cannot complete within the window at the given price will expire.
Getting Started
To make and manage jobs with our REST API, you'll need to install the sf CLI and make an API token.
Install and Login to the CLI
curl -fsSL https://sfcompute.com/cli/install | bash
Source your shell profile to add the sf command to your PATH
source ~/.bashrc # For Bash
source ~/.zshrc # For Zsh
Login to the CLI
sf login
Create a Batch Job with the API
Make an API token from the command line:
sf tokens create
You can reuse this token for future requests.
Finally, create a new batch job by passing this token as a Bearer token in the Authorization header:
curl -X POST https://api.sfcompute.com/v1/inference/batches \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"input_file_uri": "https://input-file-uri.com",
"output_file_uri": "https://output-file-uri.com",
"endpoint": "/v1/chat/completions",
"model_id": "Qwen/Qwen2.5-VL-32B-Instruct",
"completion_window": "7d",
"store": "s3"
}'
LSI expects and produces a specific file structure for batches. Make sure your pre-signed URLs are setup to follow this format when creating your batch job.
You can find comprehensive examples for each endpoint in our API Reference.
Requirements and Recommendations
Pre-signed URL duration: If your pre-signed URLs expire your associated batch jobs will fail. We recommend setting your URL expiry to 7 days to help prevent this.
Test format compliance: Run a small test job with a sample of your jobs.jsonl to validate that it follows the OpenAI API format before submitting larger workloads.
File compression: Use gzip compression for .tar files to minimize upload and download times while staying under the 5GB size limit.
Rate limits: The number of concurrent requests to the LSI API must be under 1 request per second. Concurrent batches are limited by compute availability; excess batches will expire.
Batch Size: For optimal pricing and performance we recommend submitting ~48 million tokens per batch.
Input File Format
Input files must be compressed tar files (.tar.gz or .tar) with the following directory structure:
batch-input/
├── jobs.jsonl # Required: JSONL file with requests
└── files/ # Optional: Media files directory
├── image1.png
├── image2.jpg
└── subfolder/
└── image3.png
jobs.jsonl Format
Each line contains a complete OpenAI-compatible request:
{
"custom_id": "request-001",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "Qwen/Qwen2.5-VL-32B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image"
},
{
"type": "image_url",
"image_url": {
"url": "file:files/image1.png"
}
}
]
}
],
"max_tokens": 1000
}
}
Key Requirements:
- File references: use paths relative to
.tarroot with no leading slash after file - All requests in a batch must use the same model
- Maximum 5GB tar file size
Output File Format
Output files are compressed tar files with:
batch-output/
├── output.jsonl # Successful responses
└── error.jsonl # Failed requests (if any)
output.jsonl Format
{
"id": "batch_req_696ec8427763459fa409788746bda3e3",
"custom_id": "request-001",
"response": {
"status_code": 200,
"request_id": "request-001",
"body": {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "This image shows...",
"role": "assistant"
}
}
],
"created": 1751329764,
"id": "a0eeea75457242c1b7ab5e07138e470c",
"model": "Qwen/Qwen2.5-VL-32B-Instruct",
"object": "chat.completion",
"usage": {
"completion_tokens": 100,
"prompt_tokens": 296,
"total_tokens": 396
}
}
},
"error": null
}
Processing Video
File Structure for Video Batch
batch-input/
├── jobs.jsonl
└── files/
└── sample_video.mp4
LSI can accept videos as input. Simply replace your image_url field with video_url inside your jobs.jsonl. Each video should be placed in the files/ folder inside your .tar.gz input archive.
Sample jobs.jsonl Entry for Video
{
"custom_id": "request-video-001",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "Qwen/Qwen2.5-VL-32B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is happening in this video?"
},
{
"type": "video_url",
"video_url": {
"url": "file:files/sample_video.mp4"
}
}
]
}
],
"max_tokens": 1000
}
}
Other Use Cases and Models
LSI is designed for large scale, mostly enterprise, use cases. That lets us be more hands on than traditional, self-serve providers. If you are interested in features not documented here, please contact us. Between Modular's world-class engineering & SFC's dramatic price optimization, we'll work with you to get the best possible price & performance.