| Category | Status | Created | Author |
|---|---|---|---|
| Operations | Draft | 2026-03-13 | Justin Brooks |
Summary
Add a first-class deployment freeze primitive that instantly halts all deployments within a configurable scope (workspace, system, environment, or deployment). Freezes are imperative operations — created and lifted via API or UI — with an audit trail recording who activated the freeze, why, and when it was lifted. An optional TTL auto-thaws the freeze after a configurable duration to prevent forgotten freezes from blocking deployments indefinitely.Motivation
No emergency halt exists
Ctrlplane’s deployment window evaluator (policy_rule_deployment_window)
provides scheduled allow/deny windows using rrule patterns. This covers planned
maintenance windows and business-hours-only deployment policies. But there is no
mechanism for an operator to say “stop everything now” during an incident.
When a production incident occurs, the response today requires one of:
-
Disabling policies. Setting
enabled = falseon every relevant policy. This stops deployments but also disables approval requirements, version selectors, and every other policy rule. Re-enabling them requires remembering which policies were active. There is no audit trail of the freeze itself. -
Creating deny windows. Adding a
policy_rule_deployment_windowwithallow_window = falsethat covers the incident duration. This requires knowing the duration in advance, does not surface clearly as an emergency action in the UI, and leaves orphan policy rules that must be cleaned up. - Manual intervention. Telling the team on Slack to stop pushing versions and hoping no automation triggers. This provides no system-level enforcement.
- Takes effect immediately across the target scope.
- Does not disable other policy rules (approval, verification, etc. remain configured for when the freeze lifts).
- Records who activated it, why, and links to an incident.
- Automatically lifts after a TTL if not manually thawed.
- Notifies relevant stakeholders when activated and when lifted.
Deployment windows are the wrong abstraction
Deployment windows are scheduled, recurring patterns — “deploy only on weekdays 9am–5pm.” They are defined ahead of time and repeat on a cadence. An emergency freeze is an imperative, one-shot action — “stop deploying right now because production is on fire.” Overloading the window concept for emergency freezes creates several problems:- Discoverability. An emergency freeze buried in policy rules is hard to find. Operators need a top-level indicator — a banner in the UI, a status endpoint — that shows whether a freeze is active.
- Audit semantics. A deployment window rule has no concept of “who activated this” or “why.” It’s a configuration, not an action.
- Scope mismatch. Deployment windows are scoped to policies, which are scoped by selector. An emergency freeze often needs to cover an entire workspace or environment regardless of which policies are configured.
- Lifecycle mismatch. Windows are permanent configuration. Freezes are transient — they are created and destroyed. TTL-based auto-expiry makes no sense for a recurring window rule.
RFC 0003 does not address this
RFC 0003 introduces resource concurrency limits that cap how many resources can be simultaneously undergoing deployment. This addresses capacity concerns (don’t overwhelm the cluster) but not the “stop everything” incident response case. Concurrency limits still allow deployments — just fewer at a time. A freeze allows zero.Proposal
Deployment freeze as a standalone entity
A deployment freeze is not a policy rule. It is a workspace-level entity with its own lifecycle (create, extend, thaw), its own API surface, and its own audit trail. The workspace-engine checks for active freezes early in the evaluator pipeline and denies all matching deployments while a freeze is active. This separation means:- Freezes can be managed by anyone with the appropriate permission without touching policy configuration.
- The policy configuration remains unchanged during a freeze — approval rules, gradual rollout settings, deployment windows, etc. are all preserved.
- When the freeze lifts, deployments resume exactly where they left off in the policy pipeline.
Schema
thawed_at IS NULL(not manually thawed), ANDexpires_at IS NULL OR expires_at > now()(no TTL, or TTL not yet reached).
selector field is optional and allows narrowing within a scope. A
workspace-wide freeze with selector = "deployment.metadata['tier'] == 'critical'"
freezes only critical-tier deployments across the workspace, leaving non-critical
deployments unaffected.
Freeze audit log
Every state change is recorded in a dedicated audit table:expired action is written by the workspace-engine when it detects a freeze
past its expires_at. The actor_id is NULL for system-initiated actions
(auto-expiry).
API
REST
expiresIn is an ISO 8601 duration. The server computes expires_at = now() + duration. If omitted, the freeze has no auto-thaw and must be manually lifted.
Thaw:
now(), not from the original created_at.
tRPC
Evaluator integration
ADeploymentFreezeEvaluator is added to the evaluator pipeline with
Complexity() = 0 (cheapest possible) so it runs before all other evaluators.
If a matching freeze is active, it short-circuits with a Denied result — the
remaining evaluators are never called.
Evaluate method checks for any active freeze matching the release
target’s workspace, system, environment, and deployment:
GetActiveFreezes query checks all scopes that could apply to the release
target:
selector are
post-filtered by evaluating the CEL expression against the release target
context. This keeps the SQL query simple while supporting fine-grained
filtering.
Evaluator pipeline placement
The freeze evaluator does not originate from a policy rule — it is injected unconditionally for every evaluation:Complexity() = 0 and the evaluators are sorted cheapest-first,
the freeze check always runs first. If a freeze is active, the Denied result
prevents job creation without evaluating any policy rules.
Override mechanism
Some deployments must proceed even during a freeze — a hotfix for the very incident that caused the freeze, or a rollback to a known-good version. The freeze supports an explicit override via abypass_freeze field on the version
or a policy skip:
bypass_freeze = true, the freeze evaluator returns Allowed with a
detail noting the bypass:
bypass_freeze = true requires a specific permission
(deployment_freeze.bypass) and is recorded in the freeze event log.
Auto-thaw
The workspace-engine runs a periodic check (every 60 seconds) for freezes past theirexpires_at:
ExpireActiveFreezes query atomically sets thawed_at = now() on all
freezes where expires_at <= now() AND thawed_at IS NULL:
UI
Freeze banner
When any freeze is active in the current workspace, a persistent banner appears at the top of the workspace layout:Freeze management page
A new page at/workspaces/{id}/freezes shows:
- Active freezes — scope, reason, who activated, when, TTL remaining, thaw button.
- Recent freezes — last 30 days, with full event history (activated, extended, thawed/expired).
- Create freeze button — opens a form with scope picker, reason, incident URL, TTL selector, and optional CEL selector.
Environment and deployment views
The environment detail page and deployment detail page show a freeze indicator when a freeze is active that covers them. Frozen release targets display the freeze reason in their status column instead of the normal policy evaluation status.Notifications
Freeze lifecycle events trigger notifications through the existing notification system:| Event | Recipients | Content |
|---|---|---|
| Activated | Workspace members with deploy permission | Scope, reason, incident URL, TTL, who activated |
| Extended | Same as activated | New TTL, extension reason |
| Thawed | Same as activated | Who thawed, thaw reason, duration |
| Expired | Same as activated | Original TTL, total duration |
| Bypassed | Workspace admins | Which version bypassed, who created the version |
Examples
Workspace-wide emergency freeze
An SRE detects elevated error rates across all services:Environment-scoped freeze during rollback
A bad deployment reaches production. The operator freezes production while rolling back:Freeze with selector
A database migration is running and only deployments that write to the database should be frozen:depends_on_db: true in their metadata continue
unaffected.
Extending a freeze
The incident is taking longer than expected:extended event is recorded.
Manual thaw
The incident is resolved:Multiple overlapping freezes
A workspace freeze is active when an environment-specific freeze is also created:bypass_freeze field on a version exempts it from all matching
freezes. If the operator intends for the bypass to respect workspace-level
freezes, they should scope the bypass to a specific freeze via metadata
convention (see Open Questions).
Migration
- The
deployment_freezeanddeployment_freeze_eventtables are new. No data migration required. - The
bypass_freezecolumn ondeployment_versionis additive with a default offalse. Existing versions are unaffected. - The
deployment_freeze_scopeanddeployment_freeze_actionenum types are new. - The freeze evaluator is injected unconditionally and returns
Allowedwhen no freezes are active. Existing behavior is preserved. - No changes to existing policy rules or evaluators.
Open Questions
-
Bypass granularity. The current proposal has a boolean
bypass_freezeon the version that bypasses all matching freezes. Should bypass be scoped to a specific freeze ID instead? This would allow a version to bypass an environment freeze but still respect a workspace freeze. The trade-off is complexity — the deployer must know the freeze ID at version creation time. - Freeze inheritance. If a workspace freeze is active, should an environment-level thaw override it for that environment? The current proposal says no — a workspace freeze blocks everything regardless of environment-level freeze state. This is the safe default but may be too rigid for organizations that want hierarchical freeze management.
- In-flight jobs. A freeze prevents new jobs from being created. Should it also cancel or pause jobs that are already running? Cancellation is destructive and may leave resources in an inconsistent state. The safe default is to let in-flight jobs complete but prevent new ones. However, for severe incidents, the operator may want to stop everything including in-flight work.
-
Freeze permissions. Who can create and thaw freezes? The proposal assumes
a
deployment_freeze.createanddeployment_freeze.thawpermission. Should thawing require higher privileges than freezing (to prevent accidental thaws)? Should there be a “break glass” thaw that requires admin approval? - Notification timing. Should the system send warning notifications before a freeze auto-expires (e.g., “Freeze expires in 30 minutes — extend or thaw manually”)? This prevents surprise resumption of deployments if the operator intended to extend.
- Interaction with deployment windows. If a freeze is active during a deployment window’s allow period, the freeze takes precedence (denied). When the freeze lifts, should the system check if the deployment window is still open? The evaluator pipeline handles this naturally — after the freeze evaluator allows, the deployment window evaluator runs next and checks the current time. But this means a freeze that lifts 5 minutes before a window closes gives only 5 minutes of deployment time. Should the window be extended to compensate?
-
Terraform / IaC representation. Should freezes be expressible as
Terraform resources? Freezes are inherently imperative and transient, which
maps poorly to Terraform’s declarative model. A
ctrlplane_deployment_freezeresource would be created onapplyand destroyed ondestroy, which technically works but feels semantically odd for an incident response action. - Cascading thaw. When a workspace freeze is thawed, should all narrower-scope freezes within that workspace also be thawed? Or should they remain active independently? The current proposal treats each freeze as independent — thawing the workspace freeze does not affect the environment freeze.
Future Considerations
PagerDuty integration
The most natural extension of deployment freezes is automatic activation from an incident management system. PagerDuty is the primary target, with the pattern generalizing to Opsgenie, Grafana OnCall, and similar tools. Auto-freeze on incident creation. A PagerDuty webhook listener receives incident events and creates a deployment freeze when an incident is triggered. The mapping from incident to freeze scope could be configured per-service:minSeverity threshold prevents low-priority alerts from triggering freezes.
P1 incidents could freeze the workspace, P2 freeze the affected environment,
P3/P4 are informational only.
The freeze’s reason and incident_url are populated automatically from the
PagerDuty incident title and URL. The created_by is set to a service account
representing the PagerDuty integration, with the PagerDuty incident responder
recorded in event metadata.
Auto-thaw on incident resolution. When PagerDuty sends a resolved webhook,
the integration finds freezes linked to that incident (via incident_url or a
pagerduty_incident_id metadata field) and thaws them. The thaw reason is
populated from the PagerDuty resolution note.
A configurable thaw_delay (e.g., 15 minutes after resolution) provides a
buffer — the incident may be resolved in PagerDuty before the system is fully
stable. During the delay, the freeze remains active but a notification warns
that auto-thaw is imminent.
Bidirectional timeline. The PagerDuty integration posts timeline entries on
the incident when freezes are activated, extended, or thawed. This gives
incident responders visibility into deployment state directly from their
incident management tool:
Slack integration
Beyond notifications (covered in the main proposal), Slack could provide interactive freeze management:- Slash commands.
/ctrlplane freeze production "Payment service incident"creates a freeze./ctrlplane thaw <freeze-id>lifts one. Useful during incident response when switching to the ctrlplane UI is a context switch. - Interactive messages. Freeze notifications include “Extend” and “Thaw” buttons that trigger API calls directly from Slack.
- Incident channel binding. When a freeze is created with an incident URL
that maps to a Slack channel (e.g.,
#inc-4521), freeze lifecycle notifications are posted to that channel specifically, not just the default notification channel.
Statuspage integration
Active workspace-wide or environment-wide freezes could automatically update an external status page (Atlassian Statuspage, Instatus, etc.) to reflect that deployments are paused. This is relevant for platform teams that publish deployment status to internal consumers:- Freeze activated → status component set to “Degraded Performance” or “Maintenance” with the freeze reason.
- Freeze thawed → status component restored to “Operational.”
CI/CD pipeline gating
Freezes could be exposed as a check endpoint that CI/CD systems query before proceeding with deployment steps:Calendar-based planned freezes
While the current proposal focuses on emergency freezes, the same primitive could support planned change freezes (e.g., end-of-quarter code freezes, holiday freezes). These would be created ahead of time with a futurecreated_at
(or a separate effective_at field) and a known expires_at. The integration
with Google Calendar or Outlook could auto-create freezes from calendar events
tagged with a specific label.
Incident management post-mortem
On freeze thaw, the system could auto-create a post-mortem template in the configured project management tool (Jira, Linear, etc.) pre-populated with:- Freeze duration and scope.
- Which deployments were blocked and for how long.
- Which versions bypassed the freeze.
- Timeline of freeze events.