AI Judge Scoring

Each test run is automatically scored by AI Judge on five dimensions.

Dimension	What it measures
Accuracy	Factual correctness of the response
Helpfulness	How useful and actionable the response is
Relevance	How well the response addresses the prompt
Coherence	Logical flow and readability
Safety	Absence of harmful or inappropriate content

Each dimension is scored 0–10 with color-coded indicators and written feedback. Scores use the same color scale as quality scores: green (8–10), orange (5–7), red (0–4).

Click any completed cell in the grid to see its detailed scorecard with per-dimension scores and feedback. Use these scores to objectively compare model performance across your prompts.