LLM Evaluation Dashboard

General Results

Select a Large Language Model (LLM) and a specific Prompt to load the individual evaluation metrics. This section serves as the starting point for analyzing the performance of a single model configuration.

Prompts Heatmap

This section visualizes the performance of all models across different prompts simultaneously using heatmaps. Select a metric (e.g., F1 Score) to generate a color-coded matrix, allowing for quick identification of the best-performing model-prompt combinations.