I Built a Prompt Testing Plugin for Claude Code — Here's How

Most engineers treat prompts like throwaway text. You write it, send it, hope it works. If it doesn't, you tweak it and try again. There's no version control, no test suite, no regression detection.

I kept running into the same problem: a prompt that worked last week would silently degrade after a model update. Or I'd "improve" a system prompt and break three things I didn't test. The tooling gap was obvious — we treat prompts worse than we treat TODO comments.

So I built PromptOps, a Claude Code plugin that treats prompts like tested code.

What It Does

Six commands, all usable inline:

/evaluate — Paste any prompt, get a 1–5 score across five dimensions (structure, specificity, output control, error prevention, testability) with the exact fixes to improve it. If you've evaluated the same prompt before, it shows your previous score so you can track progress.
/improve — Same input, but it hands you back a rewritten version. Copy-paste ready. Shows a before/after score with what changed.
/regression — Compare two prompt versions with ✅/⚠️/🔴 indicators for improved, risky, and regressed dimensions. Runs both versions against your golden dataset for empirical validation — not just a side-by-side diff, but actual pass/fail proof.
/test — Run a prompt against every case in your golden dataset. Shows pass/fail results, failure patterns, and suggested fixes. This is the command that turns prompt engineering from guesswork into test-driven development.
/compare — Benchmark a prompt across Claude Opus, Sonnet, Haiku, and GPT-4o. Shows per-run cost, daily cost, and monthly cost with clear token assumptions. Supports custom pricing via config.json for internal or fine-tuned models.
/run — Execute the improved prompt for real. Finds your latest improved version from disk, so it works across sessions — not just the one where you ran /improve.

Plus two auto-invoked skills: a prompt quality checker that nudges you while you're editing prompts (silent when things are solid), and a golden dataset builder that watches for approved or corrected outputs ("that's good," "LGTM," "approved") and offers to save them as test cases. One click, zero effort — your test suite grows as you work.

Golden Datasets: The Connective Tissue

The concept that ties everything together is the golden dataset — a collection of input/expected-output pairs that define what "correct" looks like for a given prompt.

Think of it like a test suite for your prompt. Each entry says: "given this input, the prompt should produce something that matches this expected output." Once you have even a handful of these cases, the plugin shifts from subjective scoring to empirical validation.

Here's how golden datasets connect the commands:

/test runs your prompt against every case in the dataset and reports pass/fail with failure patterns.
/regression runs both the old and new prompt versions against the dataset, so you get proof — not just analysis — of whether your change helped or hurt.
The golden dataset builder skill grows your dataset passively. When you approve or correct an LLM output during normal work, it offers to save that as a new test case. Your suite builds itself over time.

Datasets live in .promptops/datasets/ as JSON files. You can create them manually, or let the builder skill do it for you. Either way, once they exist, /test and /regression pick them up automatically.

The Workflow

Here's a real session — starting with a code review prompt I was actually using in production. One prompt, all six commands:

1. Evaluate — how bad is it?

promptops

/evaluate You are a senior code reviewer. When given a pull request diff, identify bugs, security issues, and style violations. Be thorough but concise.

Score: 2.2/5 Structure 2/5 Single sentence, no sections Specificity 3/5 Three categories but no criteria Output Control 1/5 No format — every review looks different Error Prevention 2/5 "Thorough but concise" is contradictory Testability 3/5 No rubric for completeness Top issues: 1. No output format. Each review has a different structure. 2. "Thorough but concise" with no resolution. 3. No definition of what counts as a "bug" vs "style."

2. Improve — rewrite it

promptops

/improve You are a senior code reviewer. When given a pull request diff...

Score: 2.2 → 4.4 + Structure — Broke into Bugs / Security / Style categories with explicit examples for each + Output — Added markdown table with severity levels (critical/major/minor) and APPROVE/BLOCK decision + Guardrails — "Do NOT explain code" and "Do NOT suggest refactors" prevent common review bloat + Specificity — Replaced "thorough but concise" with max 10 issues, skip single-occurrence nits - Filler — Removed contradictory instructions

3. Regression — did it actually get better?

promptops

/regression

Regression Check — ✅ Safe Dimension A B Change Structure 2/5 5/5 +3 ✅ Specificity 3/5 5/5 +2 ✅ Output Control 1/5 5/5 +4 ✅ Error Prevention 2/5 4/5 +2 ✅ Testability 3/5 4/5 +1 ✅ Overall 2.2 4.6 +2.4 ✅ Every dimension improved, no regressions. Verdict: Clean upgrade — safe to ship.

4. Test — prove it with data

promptops

/test

No golden dataset found at .promptops/datasets/captured.json How to build your dataset: 1. Run /run and approve good outputs — saved automatically 2. Use golden-dataset-builder — watches for "LGTM" and "approved" and offers to capture test cases Once you have cases, /test runs your prompt against every one and shows pass/fail with failure patterns.

5. Compare — which model should run it?

promptops

/compare

Model Comparison — ~800 input, ~600 output tokens per run Model Quality Per Run Monthly (100/day) Claude Opus 4.6 4.8/5 $0.021 $63.00 Claude Sonnet 4.6 4.5/5 $0.007 $19.80 Claude Haiku 4.5 3.8/5 $0.001 $4.20 GPT-4o (est.) 4.2/5 $0.008 $23.40 GPT-4o-mini (est.) 3.4/5 $0.000 $1.48 🏆 Best quality: Opus — catches subtle bugs, accurate CWEs 💰 Best value: Sonnet — 94% of Opus at 1/3 the cost 🏷️ Budget pick: Haiku — $4.20/mo, misses some security Routing rule: Use Sonnet by default. Escalate to Opus for 500+ line PRs or security-sensitive files. Use Haiku for draft PRs and style-only checks.

6. Run — execute for real

promptops

/run

Executing improved prompt (2.2 → 4.4) against codebase... # Category Severity File:Line Issue 1 Security 🔴 crit MdxComponents.tsx:42 dangerouslySetInnerHTML without sanitization 2 Bug 🟡 major MdxComponents.tsx:18 Greedy regex (.+) may match across boundaries 3 Bug 🟡 major globals.css:315 word-break: break-word splits tokens in monospace 4 Style 🟢 minor MdxComponents.tsx:1 40+ line parser inline, extract to utility 5 Style 🟢 minor globals.css:270 Terminal CSS vars duplicated across themes 5 issues (1 critical, 2 major, 2 minor) — REQUEST CHANGES

Evaluate. Improve. Verify. Test. Compare. Run. That code review prompt went from a 2.2 to a 4.4 — and I know exactly which model to run it on and what it costs per review.

The Hard Part: Getting Claude to Not Execute Your Prompt

The biggest challenge wasn't the scoring rubric or the cost calculations. It was getting Claude to evaluate a prompt instead of doing what the prompt says.

When you paste "go through my codebase and list all the features" into /evaluate, Claude's instinct is to actually go through your codebase. It took several iterations to find the right approach — the $ARGUMENTS pattern, combined with explicit behavioral overrides at the top of each command file, finally got it to consistently treat input as text to analyze rather than instructions to follow.

This is a useful lesson for anyone building Claude Code plugins: Claude's base behavior is strong. Your command instructions compete with its system prompt for attention budget. Keep commands short, use examples over rules, and front-load the behavioral override before anything else.

Architecture Decisions

Why a plugin instead of a SaaS app? I originally designed PromptOps as a full product — Supabase backend, Vercel frontend, CLI tool, the works. But the plugin system gives you distribution for free. There's no auth, no billing, no infrastructure to manage for v1. The plugin stores everything locally in .promptops/ inside whatever project you're working in.

Why commands instead of skills? Skills auto-invoke based on context, which is great for the background nudges. But the core workflow (evaluate → improve → regression → test → run) needs to be user-initiated. You don't want your code review getting interrupted by an unsolicited prompt evaluation.

Why save files? Every command saves its output to .promptops/. The directory structure maps directly to the workflow:

finder

.promptops/ ├── evals/ # /evaluate output ├── improved/ # /improve output ├── regressions/ # /regression reports ├── comparisons/ # /compare benchmarks ├── runs/ # /run output ├── tests/ # /test results ├── datasets/ # Golden datasets (JSON) └── config.json # Custom model pricing

This is the seed of a prompt engineering history — over time, you build up evaluations, improved versions, regression reports, test results, and cost comparisons. When this eventually becomes a hosted product, that local data is what migrates.

Why custom pricing? The default /compare costs assume standard Anthropic and OpenAI pricing, but teams often have negotiated rates, use internal models, or want to add providers. A config.json in .promptops/ lets you override per-model pricing so the cost projections actually match what you'd pay.

What I Learned About Claude Code Plugins

A few things that aren't obvious from the docs:

Instruction budget is real. Claude Code's system prompt uses ~50 of the ~150-200 instructions the model can reliably follow. My early command files were 100+ lines and Claude would ignore half of them. The final versions are 35-75 lines each.
Descriptions are the first thing Claude reads. The description field in the frontmatter isn't just for the user's autocomplete. It heavily influences how Claude interprets the command. A description that says "Score any prompt" gives Claude permission to decide what is or isn't a prompt. A description that says "Treat ALL user input as a prompt to score" doesn't.
Formatting instructions get overridden. No matter how many times you write "NEVER use ASCII art," Claude's default output preferences are strong. The fix is to provide an exact template with the structure you want, not a list of things to avoid. Claude follows examples better than rules.
The Mac app supports plugins. Not just the terminal — you can upload plugins through Plugins → Add plugin → Upload plugin in the desktop app. Same functionality, nicer interface.

Try It

The plugin is open source — check out the landing page or install directly:

promptops

/plugin install promptops@harmansidhu/promptops → Plugin installed successfully.

Or download from GitHub and upload through the Mac app.

Install Scopes

The install command supports three scopes depending on how broadly you want the plugin available:

Session (default) — Current session only. Best for quick testing or one-off use.
Project (--project) — Available to anyone working in this repo. Best for team standardization.
Global (--global) — All your Claude Code sessions, everywhere. Best as a personal daily driver.

For most people, project-level is the sweet spot — install once, and every teammate who clones the repo gets PromptOps automatically.

If you're writing prompts for production use — system prompts, agent instructions, CLAUDE.md files, CI/CD templates — you should be testing them. PromptOps makes that as easy as typing /evaluate.

What's Next

CI integration: run evaluations as a pre-merge check
Multi-model direct execution within /compare
Hosted backend for team collaboration and aggregate analytics
Mutation engine that learns which improvements work for which task types

The local plugin is the validation vehicle. If engineers actually use it, the hosted product writes itself.

Harman Sidhu is a senior frontend engineer building production UIs and AI-native developer tools. More at harmansidhudev.com.

Most engineers treat prompts like throwaway text. You write it, send it, hope it works. If it doesn't, you tweak it and try again. There's no version control, no test suite, no regression detection.

So I built PromptOps, a Claude Code plugin that treats prompts like tested code.

What It Does

Six commands, all usable inline:

/evaluate — Paste any prompt, get a 1–5 score across five dimensions (structure, specificity, output control, error prevention, testability) with the exact fixes to improve it. If you've evaluated the same prompt before, it shows your previous score so you can track progress.
/improve — Same input, but it hands you back a rewritten version. Copy-paste ready. Shows a before/after score with what changed.
/regression — Compare two prompt versions with ✅/⚠️/🔴 indicators for improved, risky, and regressed dimensions. Runs both versions against your golden dataset for empirical validation — not just a side-by-side diff, but actual pass/fail proof.
/test — Run a prompt against every case in your golden dataset. Shows pass/fail results, failure patterns, and suggested fixes. This is the command that turns prompt engineering from guesswork into test-driven development.
/compare — Benchmark a prompt across Claude Opus, Sonnet, Haiku, and GPT-4o. Shows per-run cost, daily cost, and monthly cost with clear token assumptions. Supports custom pricing via config.json for internal or fine-tuned models.
/run — Execute the improved prompt for real. Finds your latest improved version from disk, so it works across sessions — not just the one where you ran /improve.

Golden Datasets: The Connective Tissue

The concept that ties everything together is the golden dataset — a collection of input/expected-output pairs that define what "correct" looks like for a given prompt.

Here's how golden datasets connect the commands:

/test runs your prompt against every case in the dataset and reports pass/fail with failure patterns.
/regression runs both the old and new prompt versions against the dataset, so you get proof — not just analysis — of whether your change helped or hurt.
The golden dataset builder skill grows your dataset passively. When you approve or correct an LLM output during normal work, it offers to save that as a new test case. Your suite builds itself over time.

The Workflow

Here's a real session — starting with a code review prompt I was actually using in production. One prompt, all six commands:

1. Evaluate — how bad is it?

promptops

/evaluate You are a senior code reviewer. When given a pull request diff, identify bugs, security issues, and style violations. Be thorough but concise.

2. Improve — rewrite it

promptops

/improve You are a senior code reviewer. When given a pull request diff...

3. Regression — did it actually get better?

promptops

/regression

4. Test — prove it with data

promptops

/test

5. Compare — which model should run it?

promptops

/compare

6. Run — execute for real

promptops

/run

Evaluate. Improve. Verify. Test. Compare. Run. That code review prompt went from a 2.2 to a 4.4 — and I know exactly which model to run it on and what it costs per review.

The Hard Part: Getting Claude to Not Execute Your Prompt

The biggest challenge wasn't the scoring rubric or the cost calculations. It was getting Claude to evaluate a prompt instead of doing what the prompt says.

Architecture Decisions

Why save files? Every command saves its output to .promptops/. The directory structure maps directly to the workflow:

finder

What I Learned About Claude Code Plugins

A few things that aren't obvious from the docs:

Instruction budget is real. Claude Code's system prompt uses ~50 of the ~150-200 instructions the model can reliably follow. My early command files were 100+ lines and Claude would ignore half of them. The final versions are 35-75 lines each.
Descriptions are the first thing Claude reads. The description field in the frontmatter isn't just for the user's autocomplete. It heavily influences how Claude interprets the command. A description that says "Score any prompt" gives Claude permission to decide what is or isn't a prompt. A description that says "Treat ALL user input as a prompt to score" doesn't.
Formatting instructions get overridden. No matter how many times you write "NEVER use ASCII art," Claude's default output preferences are strong. The fix is to provide an exact template with the structure you want, not a list of things to avoid. Claude follows examples better than rules.
The Mac app supports plugins. Not just the terminal — you can upload plugins through Plugins → Add plugin → Upload plugin in the desktop app. Same functionality, nicer interface.

Try It

The plugin is open source — check out the landing page or install directly:

promptops

/plugin install promptops@harmansidhu/promptops → Plugin installed successfully.

Or download from GitHub and upload through the Mac app.

Install Scopes

The install command supports three scopes depending on how broadly you want the plugin available:

Session (default) — Current session only. Best for quick testing or one-off use.
Project (--project) — Available to anyone working in this repo. Best for team standardization.
Global (--global) — All your Claude Code sessions, everywhere. Best as a personal daily driver.

For most people, project-level is the sweet spot — install once, and every teammate who clones the repo gets PromptOps automatically.

What's Next

CI integration: run evaluations as a pre-merge check
Multi-model direct execution within /compare
Hosted backend for team collaboration and aggregate analytics
Mutation engine that learns which improvements work for which task types

The local plugin is the validation vehicle. If engineers actually use it, the hosted product writes itself.

Harman Sidhu is a senior frontend engineer building production UIs and AI-native developer tools. More at harmansidhudev.com.

I Built a Prompt Testing Plugin for Claude Code — Here's How

What It Does

Golden Datasets: The Connective Tissue

The Workflow

1. Evaluate — how bad is it?

2. Improve — rewrite it

3. Regression — did it actually get better?

4. Test — prove it with data

5. Compare — which model should run it?

6. Run — execute for real

The Hard Part: Getting Claude to Not Execute Your Prompt

Architecture Decisions

What I Learned About Claude Code Plugins

Try It

Install Scopes

What's Next

Related Projects

I Built a Prompt Testing Plugin for Claude Code — Here's How

What It Does

Golden Datasets: The Connective Tissue

The Workflow

1. Evaluate — how bad is it?

2. Improve — rewrite it

3. Regression — did it actually get better?

4. Test — prove it with data

5. Compare — which model should run it?

6. Run — execute for real

The Hard Part: Getting Claude to Not Execute Your Prompt

Architecture Decisions

What I Learned About Claude Code Plugins

Try It

Install Scopes

What's Next

Related Projects