Runbook: Execute Your Ops Runbooks with an AI Agent

Every engineering team has runbooks. Markdown files sitting in an ops/ or runbooks/ folder describing how to deploy, how to roll back, how to run a migration. Written for humans. Followed by humans. Manually, step by step, every time.

The thing that bugged me is that the knowledge is already written down. The steps are documented, the order is clear, the "do not proceed if tests fail" is right there in step three. The only reason a person still has to sit and run each line is that nobody taught the agent to read it.

So I built Runbook, an Agent Skill that reads a Markdown runbook and runs it for you, one step at a time. It works in Cursor, Claude Code, Claude Web, and Claude Cowork, because it's packaged with skillship.

Why not just write a script

Most automation replaces the runbook with a script: a Makefile, a shell script, a CI pipeline. That works, and you lose the runbook. The plain English that says why a step exists, the conditional written in prose, the warning buried mid-list. Scripts capture what to do. Runbooks capture what to do and what to watch for while you do it.

I wanted to keep the runbook as written, no DSL and no rewrite, and still have it run.

Install it

Both paths need Node.js 18 or newer (that gives you npx).

Cursor or Claude Code

Install straight from the repo into your local agents:

npx skillship@latest install shivdeepak/runbook -a cursor -a claude-code
# or, via the underlying multi-agent installer:
npx skills add shivdeepak/runbook

Run your first runbook

Point the skill at a Markdown file and it walks the steps:

/runbook                          # list every runbook by number
/runbook 1                        # run runbooks/01-*.md by number
/runbook ops/deploy.md            # run one by path
/runbook --dry-run 1              # walk the steps, run nothing
/runbook 1 version=1.4.2          # pass a variable
/runbook create "deploy to prod"  # generate a new runbook

A first tip: run --dry-run before you trust a runbook with anything real. It walks every step and prints what it would run, how it classified each one, and where it would stop, without executing a thing.

How it decides what to run

Before it touches anything, the skill reads every step and sorts it into one of three buckets, then shows you that breakdown up front so there are no surprises mid-run.

Safe is read-only or fully reversible: read a file, run the tests, check a URL. Runs automatically.
Risky is destructive or hard to undo: deploy, delete, send a message. It shows you the exact command and waits for a y/n.
Blocked needs something the agent can't supply: a human judgment call, a missing credential. It stops and tells you exactly what's needed.

For each step it then reads and understands the step, classifies it, runs it (or pauses, or stops), watches the result, and if a command exits non-zero it halts and offers Retry, Skip, or Abort rather than barreling ahead.

The runbook format

Plain Markdown, no schema. You write steps the way you'd explain them to a new engineer:

# Deploy to Production
 
## Purpose
Deploy the latest main branch to production.
 
## Prerequisites
- AWS credentials configured
- Staging deploy passed smoke tests
 
## Steps
 
### 1. Pull latest main
Pull the latest code from the main branch.
 
### 2. Run tests
Run the full test suite. Do not proceed if any tests fail.
 
### 3. Deploy
Run `./scripts/deploy.sh production`
 
### 4. Smoke test
Fetch https://myapp.com/health and confirm it returns `{"status": "ok"}`.

A few reading conventions:

Inline code is a literal command. `./scripts/deploy.sh production` runs exactly that.
Plain English is intent. "Run the tests" gets resolved to the right command for your project (more on that below).
{variable} marks a value filled in at runtime.
Conditionals live in prose. "Do not proceed if any tests fail" is treated as real logic, checked against the previous step's exit code.

Marking the steps that need care

Two blockquote markers override the automatic classification, and they're the kind of thing you add when you review a generated runbook:

> ⚠️ forces a confirmation prompt on a step no matter how it was classified.
> 👤 marks a human step: the agent pauses, shows the instruction, and waits for you to confirm you did it before moving on.

Numbering

Runbook files carry a zero-padded two-digit prefix, like runbooks/01-deploy-production.md. The number gives each one a stable, memorable identity, so /runbook 1 beats typing the whole path for the ones you run often. With no arguments, /runbook prints the index, pulling each description from the file's ## Purpose line.

You don't have to write them yourself

The authoring flow is agent-generated and human-reviewed. You describe what you need:

/runbook create "deploy to production"

Before asking you a single question, it reads the project to draft a first version:

Stack from pyproject.toml, package.json, Makefile, Dockerfile.
Scripts from scripts/, Makefile, or a justfile.
CI from .github/workflows/ and friends, to match the pipeline you already use.
Git history for how deploys usually happen.
Existing runbooks for your conventions.

Then it asks only what it genuinely can't infer: which environment, which Slack channel, whether a step needs a human sign-off, what the rollback policy is. It writes the next-numbered file and offers a dry run so you can review before relying on it. The draft is a starting point, not the final word: you read it, fix anything it guessed wrong, and add the ⚠️ and 👤 markers where they belong.

What it handles under the hood

If you're still reading, here's the rest of what the skill does.

Step resolution

When a step is plain English instead of a literal command, it uses project context to figure out the real action:

"Run the tests" detects the framework from pyproject.toml, package.json, and similar, then runs the matching command.
"Check the health endpoint" fetches the URL (via Playwright MCP or curl) and verifies the response.
"Notify Slack" uses a Slack MCP or webhook if one is configured, otherwise it asks you to do it.
"Pull from main" and "create a branch" map to the obvious git commands.

When it isn't confident, it says so before running, not after: "I think this means X, is that right?"

Variables

A {variable} reference resolves in order from: arguments you passed at invocation, the output of an earlier step (a version pulled from a build artifact, say), or a prompt to you if it's still unknown.

Conditionals

Written in prose and treated as first-class logic. "Do not proceed if any tests fail" checks the exit code and aborts. "If the health check fails, run the rollback" branches to that section. If a conditional is ambiguous, it asks before acting.

Credentials

It never stores, prompts for, or injects secrets. It checks whether the environment variable is already set, and if not, it pauses and names exactly what's missing: "This step needs AWS_ACCESS_KEY_ID set in the environment." You set it, it continues.

Audit logs

Every run that does anything risky writes a Markdown log to .runbook-logs/, incrementally, so a log survives even if the run is interrupted halfway. Each entry records the step, its status, the command, and the output.

You can pin the behavior per runbook with an audit field in the frontmatter (auto, always, or never), and override it for a single run with --force-log or --no-log. By default it auto-detects: it logs when a runbook has any risky or blocked step, and skips the log when every step is safe. Whatever it decides, it tells you up front and why.

Try it on one of yours

The repo is at github.com/shivdeepak/runbook. If your team already keeps runbooks, that's the whole point: you don't have to convert anything. Point it at the messiest one you've got and run it with --dry-run first. The worst case is it tells you what it would have done, which is a decent thing to learn about your own runbook anyway.