Kubernetes
Get started with a multi-machine prover that can run with as many GPUs as you can provision.
The SP1 Cluster is the official multi-GPU prover service implementation for generating SP1 proofs on the Succinct Prover Network. It can coordinate proof generation across tens to hundreds of GPU nodes.
This page explains how to setup the SP1 Cluster using Kubernetes and generate some basic proofs using command line tools. There are several ways to deploy SP1 Cluster, with Kubernetes being the best option for practical, production-ready deployments that need the best performance. For a simpler deployment workflow, please refer to the Docker Compose installation guide. For an overview of cluster components and how they interact, see the Architecture page.
Prerequisites
SP1 Cluster runs on Linux and has the following software requirements for each worker on the cluster:
- Kubernetes or RKE2 (Recommended)
- CUDA 12
- NVIDIA Container Toolkit
For the machine you're using to connect to the kubernetes cluster, you'll need:
Hardware Requirements
The hardware requirements for running SP1 Cluster depend on the node configuration and can change over time as the prover changes or new features are implemented.
| Component | Minimum Requirements | Recommended Requirements | Notes |
|---|---|---|---|
| 1x+ CPU Machines | • ≥40 GB RAM • >30 GB disk space • High single clock speed | • ≥64 GB DDR5 RAM • >30 GB disk space • High single clock speed • High core count | • High single clock speed is important for optimal VM emulation • High core count helps reduce Groth16/Plonk proving latency • DDR5 RAM is recommended for better proving performance |
| 1x+ GPU Machines | • ≥16 GB RAM • 1x NVIDIA GPU with: • ≥24 GB VRAM • CUDA Compute Capability ≥8.6 | • ≥32 GB DDR5 RAM per GPU • Multiple supported GPUs: • GeForce RTX 5090/4090 (Best performance) • NVIDIA L4 • NVIDIA A10G • NVIDIA A5000/A6000 | • Multiple GPUs are supported on a single machine • Each GPU requires a separate instance of the GPU node binary • DDR5 RAM is recommended for proving performance |
See the FAQ for detailed hardware recommendations.
Note that if you are using 5090s, there is currently a driver bug which requires you to set the env var MOONGATE_DISABLE_GRIND_DEVICE=true on the GPU node to enable a workaround. Not setting this will greatly reduce proving performance.
Helm Chart
We use a Helm Chart to manage the Kubernetes applications. The chart is defined in infra/charts/sp1-cluster/values-example.yaml.
The chart orchestrates all the components described in the Architecture page, plus optional Fulfiller and Bidder services for Prover Network integration. It can be used to configure the hardware requirements and the number of replicas you want per service.
Setup
In this section, we'll walk through the steps needed to setup a basic proving service using Helm.
1. Clone the repo
git clone https://github.com/succinctlabs/sp1-cluster.git
cd sp1-cluster
2. Ensure kubectl is connected to your cluster
kubectl cluster-info
kubectl get nodes
3. Create a k8s namespace
kubectl create namespace sp1-cluster-test
4. Create k8s secrets
Create a secret called cluster-secrets that configures the postgres database, the redis database, and the private key used for the prover.
kubectl create secret generic cluster-secrets \
--from-literal=DATABASE_URL=postgresql://postgres:postgrespassword@postgresql:5432/postgres \
--from-literal=REDIS_NODES=redis://:redispassword@redis-master:6379/0 \
--from-literal=PRIVATE_KEY=<PROVER_SIGNER_KEY> \
-n sp1-cluster-test
Create a secret called ghcr-secret that can allow you to pull private images from ghcr.io.
kubectl create secret docker-registry ghcr-secret \
--docker-server=ghcr.io \
--docker-username=$GITHUB_USERNAME \
--docker-password=$GITHUB_TOKEN \
-n sp1-cluster-test
5. Configure the Helm chart
The template helm chart for the cluster exists at infra/charts/sp1-cluster/values-example.yaml. Copy it and configure it to your liking.
cp infra/charts/sp1-cluster/values-example.yaml infra/charts/sp1-cluster/values-test.yaml
You can configure the Helm chart to your liking. In particular, you may want to configure the resources and node placement values based on your cluster hardware. We recommend following the following configuration constraints:
- Redis: placed on a non-worker machine with >= 20 GB RAM.
- API, Postgres, Fulfiller: placed on non-worker machine.
- CPU Workers: allocated >= 32 GB RAM, 10 GB disk, and powerful CPU with as many cores as possible.
- GPU Workers: allocated 1 GPU, >= 24 GB RAM, and as many cores as possible.
6. Setup Helm chart dependencies
helm dependency update infra/charts/redis-store
helm dependency update infra/charts/sp1-cluster
7. Create (or redeploy) the cluster
helm upgrade --install my-sp1-cluster infra/charts/sp1-cluster \
-f infra/charts/sp1-cluster/values-test.yaml \
-n sp1-cluster-test \
--debug
8. Verify pods are healthy
kubectl get pods -n sp1-cluster-test
Wait for all pods to show Running (may take 1-2 minutes): api, coordinator, cpu-node, gpu-node, postgresql, and redis-master.
The API pod may show CrashLoopBackOff briefly while PostgreSQL is still starting up. It resolves automatically once PostgreSQL is ready — just wait and re-check.
9. Send a test 5M cycle Fibonacci proof
Run a temporary CLI pod that executes the benchmark directly:
kubectl run cli -n sp1-cluster-test --rm -it \
--image=ghcr.io/succinctlabs/sp1-cluster:base-latest \
--env="RUST_LOG=info" \
--env="CLI_CLUSTER_RPC=http://api-grpc:50051" \
--env="CLI_REDIS_NODES=redis://:redispassword@redis-master:6379/0" \
-- /cli bench fibonacci 5
To debug interactively inside the pod, drop the -- /cli bench ... part and add /bin/bash instead.
Expected output:
INFO crates/common/src/logger.rs:110: logging initialized
INFO bin/cli/src/commands/bench.rs:68: Running Fibonacci Compressed for 5 million cycles...
INFO crates/common/src/client.rs:22: connecting to http://api-grpc:50051
INFO bin/cli/src/commands/bench.rs:113: using redis artifact store
INFO crates/artifact/src/redis.rs:38: initializing redis pool
INFO serialize: crates/artifact/src/lib.rs:126: close time.busy=15.5µs time.idle=267µs
INFO upload: crates/artifact/src/redis.rs:196: close time.busy=1.56ms time.idle=3.03ms artifact_type=Program id="artifact_01jxzm994ke78shjk272egp5vt"
INFO upload: crates/artifact/src/redis.rs:196: close time.busy=355µs time.idle=492µs artifact_type=Stdin id="artifact_01jxzm994rf3hve7yrfgg43t0w"
INFO bin/cli/src/commands/bench.rs:146: proof_id: cli_1750186894489
INFO bin/cli/src/commands/bench.rs:185: Proof request completed after 3.016538313s
INFO bin/cli/src/commands/bench.rs:187: Aggregate MHz: 1.66
Prover Network Integration
This section walks through the steps necessary to integrate with the Succinct Prover Network enabling you to fulfill proofs on behalf of requesters and earn fees.
0. Onchain Registration
You should have already registered your prover onchain. If you haven't, please refer to the Introduction for more information.
You also should have added a signer address to your prover so that you can use the private key to sign transactions on behalf of the prover. This private key should be set in the k8s secret used for FULFILLER_PRIVATE_KEY and BIDDER_PRIVATE_KEY.
1. Run the fulfiller
Inside the Helm chart, update the fulfiller service's enabled attribute to true.
fulfiller:
enabled: true
...
Then upgrade the deployment:
helm upgrade --install my-sp1-cluster infra/charts/sp1-cluster \
-f infra/charts/sp1-cluster/values-test.yaml \
-n sp1-cluster-test \
--debug
2. Run the bidder
Inside the Helm chart, update the bidder service's enabled attribute to true.
bidder:
enabled: true
...
Then upgrade the deployment:
helm upgrade --install my-sp1-cluster infra/charts/sp1-cluster \
-f infra/charts/sp1-cluster/values-test.yaml \
-n sp1-cluster-test \
--debug
Troubleshooting
Pod stuck in Pending
Inspect the pod events for scheduling failures:
kubectl describe pod <pod-name> -n sp1-cluster-test
Common causes:
- Insufficient resources — node doesn't have enough CPU/memory/GPU for the pod's requests.
- GPU pods without NVIDIA plugin — verify with
kubectl get pods -n kube-system | grep nvidia. If missing, install the NVIDIA device plugin. - Taint/toleration mismatch — GPU pods need tolerations for the GPU node taint.
Pod in CrashLoopBackOff
Check logs from the previous crash:
kubectl logs <pod-name> -n sp1-cluster-test --previous
Common causes:
- API can't reach PostgreSQL — the API pod often crashes until PostgreSQL is fully ready. Wait 1-2 minutes and it should stabilize.
- Wrong database URL — verify the
DATABASE_URLincluster-secretsmatches your PostgreSQL password. - Redis unreachable — verify
REDIS_NODESincluster-secretsand that theredis-masterpod is running.
Pod in ImagePullBackOff
Check the pod events for image pull errors:
kubectl describe pod <pod-name> -n sp1-cluster-test
Common causes:
ghcr-secretmissing or invalid — verify your GitHub PAT hasread:packagesscope.- Wrong image name — if
bitnami/postgresqlfails to pull, try changing tobitnamilegacy/postgresqlin your values file.
Proof request stuck or CLI silent
Check the coordinator and worker logs for errors:
kubectl logs -l app=coordinator -n sp1-cluster-test
kubectl logs -l app=cpu-node -n sp1-cluster-test
kubectl logs -l app=gpu-node -n sp1-cluster-test
Common causes:
- Missing
RUST_LOG— the CLI requiresRUST_LOG=info(ordebug) to produce any output. Without it, the CLI runs silently. - Coordinator not assigning tasks — check coordinator logs above for errors during task decomposition.
- Redis unreachable from workers — workers need Redis to exchange intermediate artifacts. Verify
redis-masterpod is running.
Helm deploy fails
Check the Helm output for template errors:
helm upgrade --install my-sp1-cluster infra/charts/sp1-cluster \
-f infra/charts/sp1-cluster/values-test.yaml \
-n sp1-cluster-test \
--debug --dry-run
Common causes:
- "Original containers have been substituted for unrecognized ones" — the Bitnami Redis sub-chart uses legacy image signatures that Helm rejects by default. Add the following to your values file:
global:
security:
allowInsecureImages: true - Template rendering errors after editing charts — re-run
helm dependency update infra/charts/sp1-clusterbefore deploying.