Overview
Why Use Retry Rules?
Retry rules help you:- Handle transient failures - Automatically recover from temporary issues
- Reduce manual intervention - Let the system retry before alerting
- Configure per-environment - More retries in dev, fewer in production
- Control retry behavior - Set backoff strategies to avoid thundering herd
Configuration
Add a retry rule to your policy:Properties
| Property | Type | Required | Default | Description |
|---|---|---|---|---|
maxRetries | integer | Yes | - | Maximum retry attempts (0 = no retries) |
retryOnStatuses | array | No | failure, invalid* | Job statuses that trigger retry |
backoffSeconds | integer | No | 0 | Seconds to wait between retries |
backoffStrategy | string | No | linear | Backoff strategy: linear or exponential |
maxBackoffSeconds | integer | No | - | Maximum backoff cap (for exponential) |
failure, invalidIntegration, invalidJobAgent
Job Statuses
The following job statuses can be used inretryOnStatuses:
| Status | Description |
|---|---|
failure | Job failed during execution |
successful | Job completed successfully |
cancelled | Job was manually cancelled |
skipped | Job was skipped |
invalidIntegration | Integration configuration error |
invalidJobAgent | Job agent configuration error |
Common Patterns
Basic Retry
Retry failed jobs up to 3 times:Retry with Backoff
Wait between retry attempts:Exponential Backoff
Increase wait time with each retry:No Retries (Strict)
Disable retries for critical deployments:Retry Specific Statuses
Only retry on specific failure types:Environment-Specific Retry
Different retry behavior per environment:Backoff Strategies
Linear Backoff
Constant wait time between retries:Exponential Backoff
Doubling wait time with each retry:maxBackoffSeconds to cap the maximum wait time.
Retry Lifecycle
1. Job Fails
A job completes with a status inretryOnStatuses.
2. Retry Check
Ctrlplane checks if retries remain (attempt < maxRetries + 1).
3. Backoff Wait
IfbackoffSeconds is configured, Ctrlplane waits before the next attempt.
4. Retry Attempt
A new job is created for the retry attempt.5. Success or Exhausted
The process continues until success or all retries are exhausted.Best Practices
Retry Guidelines
| Scenario | Max Retries | Backoff | Strategy |
|---|---|---|---|
| Transient network | 3-5 | 10-30s | exponential |
| Rate limiting | 3 | 60s | exponential |
| Resource contention | 2-3 | 30s | linear |
| Critical production | 1-2 | 60s | linear |
| Flaky tests (dev/qa) | 5 | 5s | linear |
Recommendations
- ✅ Use exponential backoff for external service failures
- ✅ Set
maxBackoffSecondsto avoid excessive wait times - ✅ Use fewer retries in production than in development
- ✅ Monitor retry rates to identify systemic issues
- ✅ Combine with alerting on final failure
Anti-Patterns
- ❌ Infinite retries (always set
maxRetries) - ❌ No backoff for rate-limited APIs
- ❌ Same retry config across all environments
- ❌ Retrying on non-transient failures
Next Steps
- Policies Overview - Learn about policy structure
- Verification - Add health checks after deployment
- Environment Progression - Control promotion flow