Skip to main content
Verification allows you to validate that a deployment is healthy after it completes. Ctrlplane can automatically run verification checks by querying metrics from external providers and evaluating success conditions.

Overview

Why Use Verification?

Verification helps you:
  • Catch Issues Early - Detect problems before they impact users
  • Automate Rollbacks - Trigger rollback policies when verification fails
  • Build Confidence - Ensure deployments meet quality standards
  • Gate Promotions - Block progression to production until QA verifies
  • Environment-Specific Checks - Run different verifications per environment

Basic Configuration

Add a verification rule to your policy:
policies:
  - name: qa-smoke-tests
    description: Run E2E smoke tests in QA before promotion
    selectors:
      - environment: environment.name == "qa"
    rules:
      - verification:
          metrics:
            - name: e2e-smoke-tests
              interval: 30s
              count: 5
              provider:
                type: http
                url: "http://e2e-runner.qa/run?service={{.resource.name}}"
              successCondition: result.ok && result.json.passed == true
              failureLimit: 1

Environment-Specific Verifications

Different environments can have completely different verification requirements:
policies:
  # QA: Run E2E smoke tests
  - name: qa-verification
    selectors:
      - environment: environment.name == "qa"
    rules:
      - verification:
          metrics:
            - name: e2e-smoke-tests
              interval: 1m
              count: 3
              provider:
                type: http
                url: "http://e2e-runner/smoke?env=qa&service={{.resource.name}}"
              successCondition: result.json.all_passed == true

  # Staging: Check error rates and latency
  - name: staging-verification
    selectors:
      - environment: environment.name == "staging"
    rules:
      - verification:
          metrics:
            - name: error-rate
              interval: 30s
              count: 10
              provider:
                type: datadog
                apiKey: "{{.variables.dd_api_key}}"
                appKey: "{{.variables.dd_app_key}}"
                query: sum:errors{service:{{.resource.name}},env:staging}.as_rate()
              successCondition: result.value < 0.01
              failureLimit: 2

  # Production: Comprehensive health checks
  - name: production-verification
    selectors:
      - environment: environment.name == "production"
    rules:
      - verification:
          metrics:
            - name: error-rate
              interval: 1m
              count: 10
              provider:
                type: datadog
                apiKey: "{{.variables.dd_api_key}}"
                appKey: "{{.variables.dd_app_key}}"
                query: sum:errors{service:{{.resource.name}},env:prod}.as_rate()
              successCondition: result.value < 0.005
              failureLimit: 2
            - name: p99-latency
              interval: 1m
              count: 10
              provider:
                type: datadog
                apiKey: "{{.variables.dd_api_key}}"
                appKey: "{{.variables.dd_app_key}}"
                query: avg:latency.p99{service:{{.resource.name}},env:prod}
              successCondition: result.value < 200
              failureLimit: 2

Reusable Verification with Selectors

Use policy selectors to apply the same verification across multiple deployments or environments:
policies:
  # Apply to all backend services
  - name: backend-health-verification
    selectors:
      - deployment: deployment.metadata.serviceType == "backend"
    rules:
      - verification:
          metrics:
            - name: health-check
              interval: 30s
              count: 5
              provider:
                type: http
                url: "http://{{.resource.name}}/health"
              successCondition: result.ok

  # Apply to all services with canary deployments
  - name: canary-verification
    selectors:
      - deployment: deployment.metadata.deploymentStrategy == "canary"
    rules:
      - verification:
          metrics:
            - name: canary-error-comparison
              interval: 2m
              count: 5
              provider:
                type: datadog
                apiKey: "{{.variables.dd_api_key}}"
                appKey: "{{.variables.dd_app_key}}"
                query: |
                  sum:errors{service:{{.resource.name}},version:canary}.as_rate() /
                  sum:errors{service:{{.resource.name}},version:stable}.as_rate()
              successCondition: result.value < 1.1

Progressive Delivery Gates

Use verification to gate promotion through environments:
policies:
  # QA must pass smoke tests before staging
  - name: qa-gate
    selectors:
      - environment: environment.name == "qa"
    rules:
      - verification:
          metrics:
            - name: smoke-tests
              interval: 30s
              count: 3
              provider:
                type: http
                url: "http://smoke-test-runner/run"
                method: POST
                body: |
                  {
                    "service": "{{.resource.name}}",
                    "version": "{{.version.tag}}",
                    "environment": "qa"
                  }
              successCondition: result.json.status == "passed"
              failureLimit: 0

  # Staging must pass before production
  - name: staging-gate
    selectors:
      - environment: environment.name == "staging"
    rules:
      - verification:
          metrics:
            - name: integration-tests
              interval: 1m
              count: 5
              provider:
                type: http
                url: "http://integration-runner/run?service={{.resource.name}}"
              successCondition:
                result.json.passed_count == result.json.total_count

Metric Configuration

Metric Properties

PropertyTypeRequiredDescription
namestringYesName of the verification metric
intervalstringYesTime between measurements (e.g., ”30s”, “5m”)
countintegerYesNumber of measurements to take
providerobjectYesMetric provider configuration
successConditionstringYesCEL expression to evaluate success
failureLimitintegerNoStop after this many failures (0 = no limit)

Metric Providers

Ctrlplane supports multiple metric providers for collecting verification data.

HTTP Provider

Query any HTTP endpoint that returns JSON:
provider:
  type: http
  url: "http://{{.resource.name}}/health"
  method: GET
  headers:
    Authorization: "Bearer {{.variables.health_token}}"
  timeout: 30s
Response Data Available in CEL:
  • result.ok - true if status code is 2xx
  • result.statusCode - HTTP status code
  • result.body - Response body as string
  • result.json - Parsed JSON response
  • result.headers - Response headers
  • result.duration - Request duration in milliseconds
Example Success Conditions:
# Status code check
successCondition: result.ok

# JSON field check
successCondition: result.json.healthy == true

# Numeric threshold
successCondition: result.json.error_rate < 0.01

# Combined conditions
successCondition: result.ok && result.json.ready == true

Datadog Provider

Query metrics from Datadog’s Metrics API:
provider:
  type: datadog
  apiKey: "{{.variables.dd_api_key}}"
  appKey: "{{.variables.dd_app_key}}"
  site: datadoghq.com
  query: |
    sum:requests.error.rate{service:{{.resource.name}},env:{{.environment.name}}}
Configuration:
PropertyRequiredDescription
apiKeyYesDatadog API key (supports templates)
appKeyYesDatadog Application key (supports templates)
queryYesDatadog metrics query (supports templates)
siteNoDatadog site (default: datadoghq.com)
Supported Sites:
  • datadoghq.com (US1 - default)
  • datadoghq.eu (EU)
  • us3.datadoghq.com (US3)
  • us5.datadoghq.com (US5)
  • ap1.datadoghq.com (AP1)
Response Data Available in CEL:
  • result.ok - true if API call succeeded
  • result.statusCode - HTTP status code
  • result.value - Last metric value from the query
  • result.json - Full Datadog API response
  • result.query - The resolved query string
  • result.duration - Request duration in milliseconds
Example Queries:
# Error rate
query: sum:requests.error.rate{service:api-service}

# Latency percentile
query: avg:trace.http.request.duration.by.service.99p{service:api-service}

# With environment tags
query: sum:requests{service:{{.resource.name}},env:{{.environment.name}}}

# Rate of 5xx errors
query: sum:http.requests{status_code:5*,service:{{.resource.name}}}.as_rate()

Template Variables

All provider configurations support Go templates with access to deployment context:
# Resource information
{{.resource.name}}
{{.resource.identifier}}
{{.resource.kind}}

# Environment information
{{.environment.name}}
{{.environment.id}}

# Deployment information
{{.deployment.name}}
{{.deployment.slug}}

# Version information
{{.version.tag}}
{{.version.id}}

# Custom variables (from deployment variables)
{{.variables.my_variable}}
{{.variables.dd_api_key}}

Storing Secrets in Variables

For sensitive values like API keys, use deployment variables:
  1. Create a deployment variable:
curl -X POST https://app.ctrlplane.dev/api/v1/deployments/{deploymentId}/variables \
  -H "Authorization: Bearer $CTRLPLANE_API_KEY" \
  -d '{"key": "dd_api_key", "description": "Datadog API key"}'
  1. Set the value:
curl -X POST https://app.ctrlplane.dev/api/v1/deployments/{deploymentId}/variables/{variableId}/values \
  -H "Authorization: Bearer $CTRLPLANE_API_KEY" \
  -d '{"value": "your-actual-api-key"}'
  1. Reference in verification config:
provider:
  type: datadog
  apiKey: "{{.variables.dd_api_key}}"
  appKey: "{{.variables.dd_app_key}}"
  query: sum:errors{service:api}

Success Conditions (CEL)

Success conditions are written in CEL (Common Expression Language). The measurement data is available as the result variable.
# Boolean check
successCondition: result.ok

# Numeric comparison
successCondition: result.value < 0.01

# String comparison
successCondition: result.json.status == "healthy"

# Logical operators
successCondition: result.ok && result.json.ready
successCondition: result.json.status == "healthy" || result.json.status == "degraded"

# Arithmetic
successCondition: result.json.success_count / result.json.total_count > 0.99

Verification Lifecycle

1. Policy Evaluation

When a job completes, Ctrlplane evaluates policies to determine which verifications apply based on the policy selectors.

2. Verification Starts

If a matching policy has verification rules, Ctrlplane creates a verification record and starts the measurement process.

3. Measurements Taken

For each configured metric, measurements are taken at the specified interval:
Metric: error-rate (interval: 30s, count: 10)

t+0s:   Measurement 1 → Passed (value: 0.005)
t+30s:  Measurement 2 → Passed (value: 0.007)
t+60s:  Measurement 3 → Failed (value: 0.015)
t+90s:  Measurement 4 → Passed (value: 0.008)
...
t+270s: Measurement 10 → Passed (value: 0.006)

4. Verification Result

  • Passed: All measurements passed, or failures stayed below failureLimit
  • Failed: Failures exceeded failureLimit

5. Policy Action

Based on the verification result, the policy can:
  • Allow promotion to the next environment
  • Trigger rollback to a previous version
  • Block release until manual intervention

Verification Status

StatusDescription
runningVerification in progress, taking measurements
passedAll checks passed within acceptable limits
failedToo many measurements failed
cancelledVerification was manually cancelled

Best Practices

Timing Recommendations

ScenarioRecommended IntervalRecommended Count
Quick smoke test10-30s3-5
Standard verification30s-1m5-10
Extended soak test5m12-24

Failure Limits

Risk ToleranceFailure LimitNotes
Strict1Fail on first failure
Normal2-3Allow transient issues
Lenient5+For noisy metrics

Environment-Specific Recommendations

EnvironmentVerification FocusTiming
QASmoke tests, E2E testsQuick (1-3min)
StagingIntegration tests, error ratesMedium (5min)
ProductionError rates, latency, business KPIsExtended (10m)

Troubleshooting

Verification always fails

  • Check if the provider can reach the target (network, DNS)
  • Verify API credentials are correct
  • Test the query manually
  • Review measurement data for unexpected values
  • Check if success condition is too strict

Verification not running

  • Verify the policy selector matches the release target
  • Check that the policy is enabled
  • Review policy evaluation logs
  • Ensure verification is configured in the policy rules

Wrong verification applied

  • Review policy selectors
  • Check policy priority/ordering
  • Verify environment and metadata values
  • Review which policies matched the release

Next Steps