Skip to main content
Version: v6 (Hypercube)

Kubernetes

Get started with a multi-machine prover that can run with as many GPUs as you can provision.

The SP1 Cluster is the official multi-GPU prover service implementation for generating SP1 proofs on the Succinct Prover Network. It can coordinate proof generation across tens to hundreds of GPU nodes.

This page explains how to setup the SP1 Cluster using Kubernetes and generate some basic proofs using command line tools. There are several ways to deploy SP1 Cluster, with Kubernetes being the best option for practical, production-ready deployments that need the best performance. For a simpler deployment workflow, please refer to the Docker Compose installation guide. For an overview of cluster components and how they interact, see the Architecture page.

Prerequisites

SP1 Cluster runs on Linux and has the following software requirements for each worker on the cluster:

For the machine you're using to connect to the kubernetes cluster, you'll need:

Hardware Requirements

The hardware requirements for running SP1 Cluster depend on the node configuration and can change over time as the prover changes or new features are implemented.

ComponentMinimum RequirementsRecommended RequirementsNotes
1x+ CPU Machines• ≥40 GB RAM
• >30 GB disk space
• High single clock speed
• ≥64 GB DDR5 RAM
• >30 GB disk space
• High single clock speed
• High core count
• High single clock speed is important for optimal VM emulation
• High core count helps reduce Groth16/Plonk proving latency
• DDR5 RAM is recommended for better proving performance
1x+ GPU Machines• ≥16 GB RAM
• 1x NVIDIA GPU with:
    • ≥24 GB VRAM
    • CUDA Compute Capability ≥8.6
• ≥32 GB DDR5 RAM per GPU
• Multiple supported GPUs:
    • GeForce RTX 5090/4090 (Best performance)
    • NVIDIA L4
    • NVIDIA A10G
    • NVIDIA A5000/A6000
• Multiple GPUs are supported on a single machine
• Each GPU requires a separate instance of the GPU node binary
• DDR5 RAM is recommended for proving performance

See the FAQ for detailed hardware recommendations.

warning

Note that if you are using 5090s, there is currently a driver bug which requires you to set the env var MOONGATE_DISABLE_GRIND_DEVICE=true on the GPU node to enable a workaround. Not setting this will greatly reduce proving performance.

Helm Chart

We use a Helm Chart to manage the Kubernetes applications. The chart is defined in infra/charts/sp1-cluster/values-example.yaml.

The chart orchestrates all the components described in the Architecture page, plus optional Fulfiller and Bidder services for Prover Network integration. It can be used to configure the hardware requirements and the number of replicas you want per service.

Setup

In this section, we'll walk through the steps needed to setup a basic proving service using Helm.

1. Clone the repo

git clone https://github.com/succinctlabs/sp1-cluster.git
cd sp1-cluster

2. Ensure kubectl is connected to your cluster

kubectl cluster-info
kubectl get nodes

3. Create a k8s namespace

kubectl create namespace sp1-cluster-test

4. Create k8s secrets

Create a secret called cluster-secrets that configures the postgres database, the redis database, and the private key used for the prover.

kubectl create secret generic cluster-secrets \
--from-literal=DATABASE_URL=postgresql://postgres:postgrespassword@postgresql:5432/postgres \
--from-literal=REDIS_NODES=redis://:redispassword@redis-master:6379/0 \
--from-literal=PRIVATE_KEY=<PROVER_SIGNER_KEY> \
-n sp1-cluster-test

Create a secret called ghcr-secret that can allow you to pull private images from ghcr.io.

kubectl create secret docker-registry ghcr-secret \
--docker-server=ghcr.io \
--docker-username=$GITHUB_USERNAME \
--docker-password=$GITHUB_TOKEN \
-n sp1-cluster-test

5. Configure the Helm chart

The template helm chart for the cluster exists at infra/charts/sp1-cluster/values-example.yaml. Copy it and configure it to your liking.

cp infra/charts/sp1-cluster/values-example.yaml infra/charts/sp1-cluster/values-test.yaml

You can configure the Helm chart to your liking. In particular, you may want to configure the resources and node placement values based on your cluster hardware. We recommend following the following configuration constraints:

  • Redis: placed on a non-worker machine with >= 20 GB RAM.
  • API, Postgres, Fulfiller: placed on non-worker machine.
  • CPU Workers: allocated >= 32 GB RAM, 10 GB disk, and powerful CPU with as many cores as possible.
  • GPU Workers: allocated 1 GPU, >= 24 GB RAM, and as many cores as possible.

6. Setup Helm chart dependencies

helm dependency update infra/charts/redis-store
helm dependency update infra/charts/sp1-cluster

7. Create (or redeploy) the cluster

helm upgrade --install my-sp1-cluster infra/charts/sp1-cluster \
-f infra/charts/sp1-cluster/values-test.yaml \
-n sp1-cluster-test \
--debug

8. Verify pods are healthy

kubectl get pods -n sp1-cluster-test

Wait for all pods to show Running (may take 1-2 minutes): api, coordinator, cpu-node, gpu-node, postgresql, and redis-master.

note

The API pod may show CrashLoopBackOff briefly while PostgreSQL is still starting up. It resolves automatically once PostgreSQL is ready — just wait and re-check.

9. Send a test 5M cycle Fibonacci proof

Run a temporary CLI pod that executes the benchmark directly:

kubectl run cli -n sp1-cluster-test --rm -it \
--image=ghcr.io/succinctlabs/sp1-cluster:base-latest \
--env="RUST_LOG=info" \
--env="CLI_CLUSTER_RPC=http://api-grpc:50051" \
--env="CLI_REDIS_NODES=redis://:redispassword@redis-master:6379/0" \
-- /cli bench fibonacci 5
tip

To debug interactively inside the pod, drop the -- /cli bench ... part and add /bin/bash instead.

Expected output:

INFO crates/common/src/logger.rs:110: logging initialized
INFO bin/cli/src/commands/bench.rs:68: Running Fibonacci Compressed for 5 million cycles...
INFO crates/common/src/client.rs:22: connecting to http://api-grpc:50051
INFO bin/cli/src/commands/bench.rs:113: using redis artifact store
INFO crates/artifact/src/redis.rs:38: initializing redis pool
INFO serialize: crates/artifact/src/lib.rs:126: close time.busy=15.5µs time.idle=267µs
INFO upload: crates/artifact/src/redis.rs:196: close time.busy=1.56ms time.idle=3.03ms artifact_type=Program id="artifact_01jxzm994ke78shjk272egp5vt"
INFO upload: crates/artifact/src/redis.rs:196: close time.busy=355µs time.idle=492µs artifact_type=Stdin id="artifact_01jxzm994rf3hve7yrfgg43t0w"
INFO bin/cli/src/commands/bench.rs:146: proof_id: cli_1750186894489
INFO bin/cli/src/commands/bench.rs:185: Proof request completed after 3.016538313s
INFO bin/cli/src/commands/bench.rs:187: Aggregate MHz: 1.66

Prover Network Integration

This section walks through the steps necessary to integrate with the Succinct Prover Network enabling you to fulfill proofs on behalf of requesters and earn fees.

0. Onchain Registration

You should have already registered your prover onchain. If you haven't, please refer to the Introduction for more information.

You also should have added a signer address to your prover so that you can use the private key to sign transactions on behalf of the prover. This private key should be set in the k8s secret used for FULFILLER_PRIVATE_KEY and BIDDER_PRIVATE_KEY.

1. Run the fulfiller

Inside the Helm chart, update the fulfiller service's enabled attribute to true.

fulfiller:
enabled: true
...

Then upgrade the deployment:

helm upgrade --install my-sp1-cluster infra/charts/sp1-cluster \
-f infra/charts/sp1-cluster/values-test.yaml \
-n sp1-cluster-test \
--debug

2. Run the bidder

Inside the Helm chart, update the bidder service's enabled attribute to true.

bidder:
enabled: true
...

Then upgrade the deployment:

helm upgrade --install my-sp1-cluster infra/charts/sp1-cluster \
-f infra/charts/sp1-cluster/values-test.yaml \
-n sp1-cluster-test \
--debug

Troubleshooting

Pod stuck in Pending

Inspect the pod events for scheduling failures:

kubectl describe pod <pod-name> -n sp1-cluster-test

Common causes:

  • Insufficient resources — node doesn't have enough CPU/memory/GPU for the pod's requests.
  • GPU pods without NVIDIA plugin — verify with kubectl get pods -n kube-system | grep nvidia. If missing, install the NVIDIA device plugin.
  • Taint/toleration mismatch — GPU pods need tolerations for the GPU node taint.

Pod in CrashLoopBackOff

Check logs from the previous crash:

kubectl logs <pod-name> -n sp1-cluster-test --previous

Common causes:

  • API can't reach PostgreSQL — the API pod often crashes until PostgreSQL is fully ready. Wait 1-2 minutes and it should stabilize.
  • Wrong database URL — verify the DATABASE_URL in cluster-secrets matches your PostgreSQL password.
  • Redis unreachable — verify REDIS_NODES in cluster-secrets and that the redis-master pod is running.

Pod in ImagePullBackOff

Check the pod events for image pull errors:

kubectl describe pod <pod-name> -n sp1-cluster-test

Common causes:

  • ghcr-secret missing or invalid — verify your GitHub PAT has read:packages scope.
  • Wrong image name — if bitnami/postgresql fails to pull, try changing to bitnamilegacy/postgresql in your values file.

Proof request stuck or CLI silent

Check the coordinator and worker logs for errors:

kubectl logs -l app=coordinator -n sp1-cluster-test
kubectl logs -l app=cpu-node -n sp1-cluster-test
kubectl logs -l app=gpu-node -n sp1-cluster-test

Common causes:

  • Missing RUST_LOG — the CLI requires RUST_LOG=info (or debug) to produce any output. Without it, the CLI runs silently.
  • Coordinator not assigning tasks — check coordinator logs above for errors during task decomposition.
  • Redis unreachable from workers — workers need Redis to exchange intermediate artifacts. Verify redis-master pod is running.

Helm deploy fails

Check the Helm output for template errors:

helm upgrade --install my-sp1-cluster infra/charts/sp1-cluster \
-f infra/charts/sp1-cluster/values-test.yaml \
-n sp1-cluster-test \
--debug --dry-run

Common causes:

  • "Original containers have been substituted for unrecognized ones" — the Bitnami Redis sub-chart uses legacy image signatures that Helm rejects by default. Add the following to your values file:
    global:
    security:
    allowInsecureImages: true
  • Template rendering errors after editing charts — re-run helm dependency update infra/charts/sp1-cluster before deploying.