AI Judge Scoring

Each test run is automatically scored by AI Judge on five dimensions.

Dimension What it measures
Accuracy Factual correctness of the response
Helpfulness How useful and actionable the response is
Relevance How well the response addresses the prompt
Coherence Logical flow and readability
Safety Absence of harmful or inappropriate content

Each dimension is scored 0–10 with color-coded indicators and written feedback. Scores use the same color scale as quality scores: green (8–10), orange (5–7), red (0–4).

Click any completed cell in the grid to see its detailed scorecard with per-dimension scores and feedback. Use these scores to objectively compare model performance across your prompts.