https://preview.redd.it/4x5vte5n5a3g1.png?width=1536&format=png&auto=webp&s=9c0c35544c51d6dbd78a3c27b7cc271cc11cacae
I keep seeing prompts treated as “magic strings” that people edit in production with no safety net. That works until you have multiple teams and hundreds of flows.
I am trying a simple “prompt as code” model:
- Prompts are versioned in Git.
- Every change passes three gates before it reaches users.
- Heavy tests double as monitoring for AI state in production.
Three gates
- Smoke tests (DEV)
- Validate syntax, variables, and output format.
- Tiny set of rule based checks only.
- Fast enough to run on every PR so people can experiment freely without breaking the system.
- Light tests (STAGING)
- 20 to 50 curated examples per prompt.
- Designed for behavior and performance:
- Do we still respect contracts other components rely on?
- Is behavior stable for typical inputs and simple edge cases?
- Are latency and token costs within budget?
- Heavy tests (PROD gate + monitoring)
- 80 to 150 comprehensive cases that cover:
- Happy paths.
- Weird inputs, injection attempts, multilingual, multi turn flows.
- Safety and compliance scenarios.
- Must be 100 percent green for a critical prompt to go live.
- The same suite is re run regularly in PROD to track drift in model behavior or cost.
- 80 to 150 comprehensive cases that cover:
The attached infographic is what I use to explain this flow to non engineers.
How are you all handling “prompt regression tests” today?
- Do you have a formal pipeline at all?
- Any lessons on keeping test sets maintainable as prompts evolve?
- Has anyone found a nice way to auto generate or refresh edge cases?
Would love to steal ideas from people further along.