How do you evaluate the quality of your prompts/agents? Here’s the strict framework I’m using

I’ve been building a lot of business-specific AI agents recently, and I realized I needed a consistent way to evaluate whether a prompt/agent is actually good, not just “sounds okay”.

So I built a strict evaluation system that I now use to score and improve my agents.
Sharing it here in case it helps someone, and also because I’d love feedback from others (to add/remove anything) who build agents/prompts regularly.

I evaluate two things:

  1. Sections (the actual agent instructions)

I check for:
• Goal clarity – does the agent know its mission?
• Workflow – step-by-step structure
• Business context – is the info complete?
• Tool usage – does the agent know when/how to trigger tools?
• Error handling – fallback responses defined?
• Edge cases – unexpected scenarios covered?

  1. Connected Tools

I check whether:
• tools are configured properly
• tools match real business needs
• tools are referenced in the actual instructions
• tool descriptions are explicit (what each tool has and when to use them)

Scoring (strict)

I use a 1–10 scale but I’m harsh with it:
• 9–10: exceptional, rare
• 7–8: good
• 5–6: functional but needs work (most agents)
• 3–4: critical issues
• 1–2: needs a full rebuild

Im only able to atleast consider 50-60% reviews from this evaluation agent. Need help improvising/refactoring this.

Leave a Reply