Skip to main content
Retry rules configure how Ctrlplane handles failed jobs. You can control the number of retry attempts, which failure types trigger retries, and the backoff strategy between attempts.

Overview

Why Use Retry Rules?

Retry rules help you:
  • Handle transient failures - Automatically recover from temporary issues
  • Reduce manual intervention - Let the system retry before alerting
  • Configure per-environment - More retries in dev, fewer in production
  • Control retry behavior - Set backoff strategies to avoid thundering herd

Configuration

curl -X POST https://api.ctrlplane.com/v1/workspaces/{workspaceId}/policies \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Retry on Failure",
    "selector": "environment.name == '\''production'\''",
    "rules": [
      {
        "retry": {
          "maxRetries": 3
        }
      }
    ]
  }'
Retry rules are not yet supported in the Terraform provider. Use the REST API to configure retry behavior.

Properties

retry.maxRetries
integer
required
Maximum retry attempts. 0 means no retries (1 attempt total), 3 means up to 4 attempts (1 initial + 3 retries).
retry.retryOnStatuses
array
Job statuses that trigger a retry. Defaults to ["failure", "invalidIntegration", "invalidJobAgent"] when maxRetries > 0. When maxRetries = 0, also includes "successful" to enforce deploy-once semantics.
retry.backoffSeconds
integer
default:"0"
Seconds to wait between retry attempts. If not set, retries are allowed immediately after job completion.
retry.backoffStrategy
string
default:"linear"
Backoff strategy: linear (constant delay) or exponential (doubling delay with each retry using backoffSeconds * 2^(attempt-1)).
retry.maxBackoffSeconds
integer
Maximum backoff cap in seconds (for exponential backoff). If not set, no maximum is enforced.

Job Statuses

The following job statuses can be used in retryOnStatuses:
StatusDescription
failureJob failed during execution
successfulJob completed successfully
cancelledJob was manually cancelled
skippedJob was skipped
invalidIntegrationIntegration configuration error
invalidJobAgentJob agent configuration error
Cancelled and skipped jobs never count toward the retry limit by default, allowing redeployment after manual cancellation.

Common Patterns

Basic Retry

Retry failed jobs up to 3 times:
{
  "retry": {
    "maxRetries": 3
  }
}

Retry with Backoff

Wait between retry attempts:
{
  "retry": {
    "maxRetries": 3,
    "backoffSeconds": 30
  }
}

Exponential Backoff

Increase wait time with each retry:
{
  "retry": {
    "maxRetries": 5,
    "backoffSeconds": 10,
    "backoffStrategy": "exponential",
    "maxBackoffSeconds": 300
  }
}
With exponential backoff, wait times are: 10s → 20s → 40s → 80s → 160s (capped at 300s)

No Retries (Deploy-Once)

Disable retries for critical deployments. When maxRetries is 0, the default retryOnStatuses also includes "successful", enforcing deploy-once semantics:
{
  "name": "No Retry Production",
  "selector": "environment.name == 'production'",
  "rules": [
    {
      "retry": { "maxRetries": 0 }
    }
  ]
}

Retry Specific Statuses

Only retry on specific failure types:
{
  "retry": {
    "maxRetries": 3,
    "retryOnStatuses": ["failure", "invalidIntegration"],
    "backoffSeconds": 60
  }
}

Environment-Specific Retry

Different retry behavior per environment:
[
  {
    "name": "Dev Retry",
    "selector": "environment.name == 'development'",
    "rules": [
      { "retry": { "maxRetries": 5, "backoffSeconds": 5 } }
    ]
  },
  {
    "name": "Staging Retry",
    "selector": "environment.name == 'staging'",
    "rules": [
      { "retry": { "maxRetries": 3, "backoffSeconds": 30 } }
    ]
  },
  {
    "name": "Production Retry",
    "selector": "environment.name == 'production'",
    "rules": [
      {
        "retry": {
          "maxRetries": 2,
          "backoffSeconds": 60,
          "backoffStrategy": "exponential"
        }
      }
    ]
  }
]

Backoff Strategies

Linear Backoff

Constant wait time between retries:
Attempt 1: immediate
Attempt 2: wait 30s
Attempt 3: wait 30s
Attempt 4: wait 30s

Exponential Backoff

Doubling wait time with each retry:
Attempt 1: immediate
Attempt 2: wait 10s  (10 * 2^0)
Attempt 3: wait 20s  (10 * 2^1)
Attempt 4: wait 40s  (10 * 2^2)
Attempt 5: wait 80s  (10 * 2^3)
Use maxBackoffSeconds to cap the maximum wait time.

Retry Lifecycle

1. Job Fails

A job completes with a status in retryOnStatuses.

2. Retry Check

Ctrlplane checks if retries remain (attempt < maxRetries + 1).

3. Backoff Wait

If backoffSeconds is configured, Ctrlplane waits before the next attempt. The nextEvaluationTime is set to indicate when the retry will be allowed.

4. Retry Attempt

A new job is created for the retry attempt.

5. Success or Exhausted

The process continues until success or all retries are exhausted.

Best Practices

Retry Guidelines

ScenarioMax RetriesBackoffStrategy
Transient network3-510-30sexponential
Rate limiting360sexponential
Resource contention2-330slinear
Critical production1-260slinear
Flaky tests (dev/qa)55slinear

Recommendations

  • ✅ Use exponential backoff for external service failures
  • ✅ Set maxBackoffSeconds to avoid excessive wait times
  • ✅ Use fewer retries in production than in development
  • ✅ Monitor retry rates to identify systemic issues
  • ✅ Combine with alerting on final failure

Anti-Patterns

  • ❌ Infinite retries (always set maxRetries)
  • ❌ No backoff for rate-limited APIs
  • ❌ Same retry config across all environments
  • ❌ Retrying on non-transient failures

Next Steps