Improved prompt variant shows progress, but effectiveness remains uncertain.
A pattern I've seen on more than one team: weekly eval run finishes, someone sorts the leaderboard, and the worst-performing prompt variant or model checkpoint gets flagged for attention. Someone makes a change, a tweak to the system prompt, a different few-shot example, sometimes just a rewording o










