Most engineers treat prompts like throwaway text. You write it, send it, hope it works. If it doesn't, you tweak it and try again. There's no version control, no test suite, no regression detection.
I kept running into the same problem: a prompt that worked last week would silently degrade after a model update. Or I'd "improve" a system prompt and break three things I didn't test. The tooling gap was obvious — we treat prompts worse than we treat TODO comments.
So I built PromptOps, a Claude Code plugin that treats prompts like tested code.
What It Does
Six commands, all usable inline:
/evaluate— Paste any prompt, get a 1–5 score across five dimensions (structure, specificity, output control, error prevention, testability) with the exact fixes to improve it. If you've evaluated the same prompt before, it shows your previous score so you can track progress./improve— Same input, but it hands you back a rewritten version. Copy-paste ready. Shows a before/after score with what changed./regression— Compare two prompt versions with ✅/⚠️/🔴 indicators for improved, risky, and regressed dimensions. Runs both versions against your golden dataset for empirical validation — not just a side-by-side diff, but actual pass/fail proof./test— Run a prompt against every case in your golden dataset. Shows pass/fail results, failure patterns, and suggested fixes. This is the command that turns prompt engineering from guesswork into test-driven development./compare— Benchmark a prompt across Claude Opus, Sonnet, Haiku, and GPT-4o. Shows per-run cost, daily cost, and monthly cost with clear token assumptions. Supports custom pricing viaconfig.jsonfor internal or fine-tuned models./run— Execute the improved prompt for real. Finds your latest improved version from disk, so it works across sessions — not just the one where you ran/improve.
Plus two auto-invoked skills: a prompt quality checker that nudges you while you're editing prompts (silent when things are solid), and a golden dataset builder that watches for approved or corrected outputs ("that's good," "LGTM," "approved") and offers to save them as test cases. One click, zero effort — your test suite grows as you work.
Golden Datasets: The Connective Tissue
The concept that ties everything together is the golden dataset — a collection of input/expected-output pairs that define what "correct" looks like for a given prompt.
Think of it like a test suite for your prompt. Each entry says: "given this input, the prompt should produce something that matches this expected output." Once you have even a handful of these cases, the plugin shifts from subjective scoring to empirical validation.
Here's how golden datasets connect the commands:
/testruns your prompt against every case in the dataset and reports pass/fail with failure patterns./regressionruns both the old and new prompt versions against the dataset, so you get proof — not just analysis — of whether your change helped or hurt.- The golden dataset builder skill grows your dataset passively. When you approve or correct an LLM output during normal work, it offers to save that as a new test case. Your suite builds itself over time.
Datasets live in .promptops/datasets/ as JSON files. You can create them manually, or let the builder skill do it for you. Either way, once they exist, /test and /regression pick them up automatically.
The Workflow
Here's a real session — starting with a code review prompt I was actually using in production. One prompt, all six commands:
1. Evaluate — how bad is it?
2. Improve — rewrite it
3. Regression — did it actually get better?
4. Test — prove it with data
5. Compare — which model should run it?
6. Run — execute for real
Evaluate. Improve. Verify. Test. Compare. Run. That code review prompt went from a 2.2 to a 4.4 — and I know exactly which model to run it on and what it costs per review.
The Hard Part: Getting Claude to Not Execute Your Prompt
The biggest challenge wasn't the scoring rubric or the cost calculations. It was getting Claude to evaluate a prompt instead of doing what the prompt says.
When you paste "go through my codebase and list all the features" into /evaluate, Claude's instinct is to actually go through your codebase. It took several iterations to find the right approach — the $ARGUMENTS pattern, combined with explicit behavioral overrides at the top of each command file, finally got it to consistently treat input as text to analyze rather than instructions to follow.
This is a useful lesson for anyone building Claude Code plugins: Claude's base behavior is strong. Your command instructions compete with its system prompt for attention budget. Keep commands short, use examples over rules, and front-load the behavioral override before anything else.
Architecture Decisions
Why a plugin instead of a SaaS app? I originally designed PromptOps as a full product — Supabase backend, Vercel frontend, CLI tool, the works. But the plugin system gives you distribution for free. There's no auth, no billing, no infrastructure to manage for v1. The plugin stores everything locally in .promptops/ inside whatever project you're working in.
Why commands instead of skills? Skills auto-invoke based on context, which is great for the background nudges. But the core workflow (evaluate → improve → regression → test → run) needs to be user-initiated. You don't want your code review getting interrupted by an unsolicited prompt evaluation.
Why save files? Every command saves its output to .promptops/. The directory structure maps directly to the workflow:
This is the seed of a prompt engineering history — over time, you build up evaluations, improved versions, regression reports, test results, and cost comparisons. When this eventually becomes a hosted product, that local data is what migrates.
Why custom pricing? The default /compare costs assume standard Anthropic and OpenAI pricing, but teams often have negotiated rates, use internal models, or want to add providers. A config.json in .promptops/ lets you override per-model pricing so the cost projections actually match what you'd pay.
What I Learned About Claude Code Plugins
A few things that aren't obvious from the docs:
-
Instruction budget is real. Claude Code's system prompt uses ~50 of the ~150-200 instructions the model can reliably follow. My early command files were 100+ lines and Claude would ignore half of them. The final versions are 35-75 lines each.
-
Descriptions are the first thing Claude reads. The
descriptionfield in the frontmatter isn't just for the user's autocomplete. It heavily influences how Claude interprets the command. A description that says "Score any prompt" gives Claude permission to decide what is or isn't a prompt. A description that says "Treat ALL user input as a prompt to score" doesn't. -
Formatting instructions get overridden. No matter how many times you write "NEVER use ASCII art," Claude's default output preferences are strong. The fix is to provide an exact template with the structure you want, not a list of things to avoid. Claude follows examples better than rules.
-
The Mac app supports plugins. Not just the terminal — you can upload plugins through Plugins → Add plugin → Upload plugin in the desktop app. Same functionality, nicer interface.
Try It
The plugin is open source — check out the landing page or install directly:
Or download from GitHub and upload through the Mac app.
Install Scopes
The install command supports three scopes depending on how broadly you want the plugin available:
- Session (default) — Current session only. Best for quick testing or one-off use.
- Project (
--project) — Available to anyone working in this repo. Best for team standardization. - Global (
--global) — All your Claude Code sessions, everywhere. Best as a personal daily driver.
For most people, project-level is the sweet spot — install once, and every teammate who clones the repo gets PromptOps automatically.
If you're writing prompts for production use — system prompts, agent instructions, CLAUDE.md files, CI/CD templates — you should be testing them. PromptOps makes that as easy as typing /evaluate.
What's Next
- CI integration: run evaluations as a pre-merge check
- Multi-model direct execution within
/compare - Hosted backend for team collaboration and aggregate analytics
- Mutation engine that learns which improvements work for which task types
The local plugin is the validation vehicle. If engineers actually use it, the hosted product writes itself.
Harman Sidhu is a Product + Platform Engineer building reliable AI tooling. More at harmansidhudev.com.