AI Judge Scoring
Each test run is automatically scored by AI Judge on five dimensions.
| Dimension | What it measures |
|---|---|
| Accuracy | Factual correctness of the response |
| Helpfulness | How useful and actionable the response is |
| Relevance | How well the response addresses the prompt |
| Coherence | Logical flow and readability |
| Safety | Absence of harmful or inappropriate content |
Each dimension is scored 0–10 with color-coded indicators and written feedback. Scores use the same color scale as quality scores: green (8–10), orange (5–7), red (0–4).
Click any completed cell in the grid to see its detailed scorecard with per-dimension scores and feedback. Use these scores to objectively compare model performance across your prompts.