Prompt as code – A simple 3 gate system for smoke, light, and heavy tests

https://preview.redd.it/4x5vte5n5a3g1.png?width=1536&format=png&auto=webp&s=9c0c35544c51d6dbd78a3c27b7cc271cc11cacae

I keep seeing prompts treated as “magic strings” that people edit in production with no safety net. That works until you have multiple teams and hundreds of flows.

I am trying a simple “prompt as code” model:

  • Prompts are versioned in Git.
  • Every change passes three gates before it reaches users.
  • Heavy tests double as monitoring for AI state in production.

Three gates

  1. Smoke tests (DEV)
    • Validate syntax, variables, and output format.
    • Tiny set of rule based checks only.
    • Fast enough to run on every PR so people can experiment freely without breaking the system.
  2. Light tests (STAGING)
    • 20 to 50 curated examples per prompt.
    • Designed for behavior and performance:
      • Do we still respect contracts other components rely on?
      • Is behavior stable for typical inputs and simple edge cases?
      • Are latency and token costs within budget?
  3. Heavy tests (PROD gate + monitoring)
    • 80 to 150 comprehensive cases that cover:
      • Happy paths.
      • Weird inputs, injection attempts, multilingual, multi turn flows.
      • Safety and compliance scenarios.
    • Must be 100 percent green for a critical prompt to go live.
    • The same suite is re run regularly in PROD to track drift in model behavior or cost.

The attached infographic is what I use to explain this flow to non engineers.

How are you all handling “prompt regression tests” today?

  • Do you have a formal pipeline at all?
  • Any lessons on keeping test sets maintainable as prompts evolve?
  • Has anyone found a nice way to auto generate or refresh edge cases?

Would love to steal ideas from people further along.

Leave a Reply