| Category | Status | Created | Author |
|---|---|---|---|
| Job Agents | Draft | 2026-03-13 | Justin Brooks |
Summary
Add amanual-action job agent type that represents a human task within a
deployment pipeline. When dispatched, the agent transitions the job to an
“action required” state, notifies assignees through configured channels (Slack,
email, webhook), and waits indefinitely until a human explicitly marks the task
as completed. This enables teams to embed manual operational steps — hardware
swaps, vendor coordination, compliance sign-offs, manual DNS changes — directly
into ctrlplane’s promotion lifecycle, ensuring downstream deployments do not
proceed until the manual work is confirmed done.
Motivation
Automated agents assume automated execution
Ctrlplane’s job agent model is built around dispatching work to external systems that execute autonomously: ArgoCD syncs an Application, GitHub Actions runs a workflow, Terraform Cloud applies a plan, Argo Workflows orchestrates a DAG. In each case, ctrlplane sends a dispatch, the external system does the work, and the agent reports back when it finishes. But not every step in a deployment pipeline can be automated. Real-world deployment procedures frequently include steps that require a human to physically do something:- Hardware provisioning — rack and cable a new server before software deployment can target it.
- Manual DNS changes — update DNS records in a provider that lacks API access or is managed by a different team.
- Vendor coordination — contact a third-party provider to enable a feature flag, update a firewall rule, or rotate a certificate.
- Compliance checkpoints — obtain a sign-off from a security or compliance officer that a change has been reviewed and meets regulatory requirements.
- Customer communication — notify a customer before a maintenance window begins, and confirm they have acknowledged.
- Manual database operations — run a migration in a restricted production environment where automated access is prohibited by policy.
- Physical verification — inspect that a deployment to an edge device or kiosk is functioning correctly before proceeding to the next location.
- No orchestration signal. Ctrlplane does not know that a manual step exists. The pipeline appears stalled with no indication of what is being waited on or who is responsible.
- No notification routing. There is no mechanism to automatically notify the right person when a manual step is ready. The deployer must remember to ping someone.
- No audit trail. There is no record of who completed the manual step, when, or what evidence they provided. The job status update only records that the job transitioned to success.
Distinct from the approval policy
The existing approval policy (seepolicies/approval) gates whether a release
should proceed — it is a governance checkpoint. A user reviews the proposed
change and approves or rejects it. The release itself has not started executing;
the approval decides whether it will.
A manual action is different. It represents work that must be performed as
part of the deployment execution. The deployment has already been approved and
is in progress. The manual action is a step within that execution that happens
to require a human instead of a machine:
Why not use an external ticketing system?
Teams could model manual steps as GitHub Actions workflows that create a Jira ticket and poll for its resolution. But this requires:- A CI runner continuously polling an external system.
- Credential management for the ticketing system API.
- Custom logic to map ticket state transitions to ctrlplane job status updates.
- No native integration with ctrlplane’s notification system, audit log, or UI.
Proposal
Agent type and config
Register a new agent typemanual-action in the workspace engine’s job agent
registry. The job agent config describes what the human needs to do and who
should be notified:
| Field | Required | Description |
|---|---|---|
name | Yes | Short name for the manual task, displayed in the UI and notifications. |
description | Yes | Go template string describing what the human needs to do. Receives the dispatch context. |
assignees | No | List of team slugs or user emails to notify. If omitted, the notification goes to the configured channel. |
channels | No | Notification channels for this task. Falls back to workspace notification defaults if omitted. |
timeout | No | ISO 8601 duration after which the job is marked as failed if not completed. Default: no timeout. |
requireEvidence | No | If true, the completion request must include an evidence field (URL, description, or attachment reference). |
description field is a Go template rendered with {[ ]} delimiters
(matching the convention from RFC 0005). This allows the task description to
include deployment-specific context:
Dispatch lifecycle
When the workspace engine dispatches a job to themanual-action agent, the
following sequence occurs:
action_required and returns. The job remains
in this state until an external signal (API call, Slack interaction, UI button)
advances it. There is no background goroutine watching an external system.
Job status: action_required
A new job status action_required is added to the JobStatus enum. This
status indicates that the job has been dispatched and is waiting for a human to
complete a task. It is semantically distinct from:
pending— job has not been dispatched yet.in_progress— job has been dispatched and an external system is actively executing it.action_required— job has been dispatched but requires a human to do something before it can complete.
action_required the same as in_progress for
promotion lifecycle purposes: downstream deployments wait for the job to reach
a terminal state (successful or failure).
action_required jobs with a distinct visual treatment — an
amber indicator with a call-to-action button — to differentiate them from
automated jobs that are still running.
Implementation
Go types
Dispatchable implementation
Timeout enforcement
If a timeout is configured, a background goroutine waits for the duration and then checks whether the job is still inaction_required state. If so, it
transitions the job to failure:
Description rendering
The description template is rendered using the sametemplatefuncs pipeline
as other job agents, with {[ / ]} delimiters:
Completion API
A new endpoint allows humans (or integrations) to mark a manual action job as completed:| Field | Required | Description |
|---|---|---|
status | No | successful (default) or failure. Allows the human to report that the task failed. |
message | No | Free-text message describing what was done or why it failed. |
evidence | Conditional | Required if requireEvidence = true in the agent config. URL or description. |
- The job exists and is in
action_requiredstatus. - The caller has permission to complete jobs in this workspace.
- If
requireEvidenceis configured, theevidencefield is present and non-empty.
Slack integration
The Slack integration is the primary notification channel for manual actions. When a manual action job is dispatched, a Slack message is sent to the configured channel with an interactive Block Kit layout:Message format
Interaction handler
When a user clicks a button in Slack, the Slack API sends an interaction payload to ctrlplane’s Slack integration endpoint. The handler:- Verifies the Slack request signature.
- Extracts the
action_idandvalue(job ID). - Resolves the Slack user to a ctrlplane user via the workspace’s Slack integration mapping.
- Calls the completion API internally.
- Updates the original Slack message to reflect the new status.
Message update on completion
After the job is completed (via Slack or any other method), the original Slack message is updated to show the resolved state. The action buttons are removed and replaced with a status block:Evidence collection via Slack modal
WhenrequireEvidence = true, clicking “Mark as Completed” opens a Slack modal
instead of immediately completing the job. The modal prompts for:
- A text description of what was done.
- An optional URL to supporting evidence (runbook, screenshot, monitoring dashboard).
Notification channels
The manual action agent supports multiple notification channels through the existing notification system. Each channel type has a specific renderer:| Channel | Behavior |
|---|---|
slack | Sends a Block Kit message with interactive buttons. Supports completion via Slack. |
email | Sends an email with task description and a deep link to the ctrlplane UI. |
webhook | POSTs a JSON payload to a configured URL. Used for custom integrations. |
action_required:
| Field | Description |
|---|---|
reminder.interval | ISO 8601 duration between reminders. |
reminder.maxReminders | Maximum number of reminders to send before stopping. Default 0 (no reminders). |
Registry registration
TRPC and UI integration
Job agent config type
Job completion tRPC route
UI: job detail view
When a job has statusaction_required, the job detail view displays:
- Task description — the rendered description from the agent config, formatted as markdown.
- Assignees — who is responsible for completing the task.
- Status timeline — when the job was dispatched, when notifications were sent, when reminders were sent.
- Action buttons — “Mark as Completed” and “Report Failure” buttons.
- Evidence field — if
requireEvidenceis true, a text input and URL field that must be filled before completion. - Timeout indicator — if a timeout is configured, a countdown showing remaining time.
action_required jobs with an amber badge
and the task name, making it immediately visible which deployments are waiting
on human action.
Deployment configuration
Terraform
CLI YAML
Examples
Multi-step deployment with manual checkpoint
A system has three deployments in sequence: database migration (automated), hardware verification (manual), and application deploy (automated). The manual step ensures a human confirms the target node is ready before the application is deployed to it:#field-ops:
Customer notification gate
Before deploying a breaking API change to a customer’s dedicated environment, the account team must confirm the customer has been notified and has acknowledged the maintenance window:Compliance sign-off
A regulated deployment requires a compliance officer to review and sign off before proceeding:Migration
- The
action_requiredvalue is added to thejob_statusenum. This is an additive change — existing jobs are unaffected. - No schema changes to existing tables. The manual action metadata is stored
in the job’s existing
metadataJSONB column. - The completion API endpoint is new. No changes to existing endpoints.
- The Slack interaction handler is new. It is registered alongside the existing Slack integration webhook handlers.
- The
manual-actionagent type is registered in the workspace engine’s controller. No changes to the reconciler or promotion lifecycle beyond recognizing the newaction_requiredstatus. - The notification system must support the
SendManualActionNotificationmethod. This extends the existingNotifierinterface. If the notification system is not configured, the agent still transitions the job toaction_required— the task is visible in the UI but no external notification is sent.
Open Questions
-
Reassignment. The initial proposal assigns the task at dispatch time via
the
assigneesfield in the agent config. Should the UI and API support reassigning a manual action to a different user or team after dispatch? This is useful when the original assignee is unavailable, but adds complexity to the notification flow (the new assignee needs to be notified, the original assignee’s notification should be updated). - Escalation. If a manual action is not completed within a configurable period (shorter than the timeout), should the system escalate to a different set of assignees? For example, after 2 hours notify the team lead, after 4 hours notify the on-call manager. This is a common pattern in incident management tools but adds significant complexity.
- Partial completion. Some manual tasks have multiple steps (a checklist). Should the agent support partial completion where each checklist item is tracked independently, or is a single “completed/failed” status sufficient? Partial completion provides better visibility but the checklist structure must be defined in the agent config and rendered in both the UI and Slack.
-
Restorable semantics. After a workspace-engine restart,
action_requiredjobs with configured timeouts need their timeout goroutines restarted. The agent should implementRestorableto query foraction_requiredjobs on startup and re-establish timeout enforcement. Should the initial implementation include restore support, or is it acceptable to lose timeout enforcement on restart (the job remains inaction_requiredindefinitely until manually completed or failed)? -
Slack app permissions. The interactive Slack integration requires the
ctrlplane Slack app to have
chat:write,commands, andinteractionsscopes. If the workspace does not have a Slack integration configured, should the agent fall back to a non-interactive notification (plain message without buttons), or should it fail at dispatch time with a configuration error? - Idempotent completion. If multiple people click “Complete” in Slack simultaneously, the second request should be a no-op (the job is already in a terminal state). The current proposal handles this via the status check in the completion endpoint. Should the UI also show who else attempted to complete the task, or is the first completion sufficient?
-
Webhook completion. The
webhooknotification channel sends a JSON payload with the task details. Should the webhook payload include a callback URL and a signed token that allows the external system to call the completion API without separate authentication? This enables “complete via webhook callback” for systems that can process and respond programmatically (e.g., a ServiceNow integration that auto-completes the ctrlplane job when a change request is approved). -
Interaction with deployment freeze. If a deployment freeze (RFC 0008) is
activated while a manual action job is in
action_requiredstate, should the freeze prevent the job from being completed? The freeze blocks new job creation, but anaction_requiredjob has already been dispatched. The safe default is to allow completion (the freeze prevents downstream jobs, not in-flight ones), but some organizations may want the freeze to also prevent manual action completion.