So I built a strict evaluation system that I now use to score and improve my agents.
Sharing it here in case it helps someone, and also because I’d love feedback from others (to add/remove anything) who build agents/prompts regularly.
I evaluate two things:
- Sections (the actual agent instructions)
I check for:
• Goal clarity – does the agent know its mission?
• Workflow – step-by-step structure
• Business context – is the info complete?
• Tool usage – does the agent know when/how to trigger tools?
• Error handling – fallback responses defined?
• Edge cases – unexpected scenarios covered?
- Connected Tools
I check whether:
• tools are configured properly
• tools match real business needs
• tools are referenced in the actual instructions
• tool descriptions are explicit (what each tool has and when to use them)
Scoring (strict)
I use a 1–10 scale but I’m harsh with it:
• 9–10: exceptional, rare
• 7–8: good
• 5–6: functional but needs work (most agents)
• 3–4: critical issues
• 1–2: needs a full rebuild
Im only able to atleast consider 50-60% reviews from this evaluation agent. Need help improvising/refactoring this.