LLM Evaluation Dashboard

General Results

Select a Large Language Model (LLM) and a specific Prompt to load the individual evaluation metrics. This section serves as the starting point for analyzing the performance of a single model configuration.

LLM: Prompt:

Compare Models (Metrics per Class)

This table displays the aggregate confusion matrix (True Positives, False Positives, False Negatives, True Negatives) for the selected model. It provides a high-level view of the model's classification performance across all classes.

View detailed performance metrics broken down by specific categories. This helps identify if the model performs significantly better in certain domain areas compared to others.

Analyze the precision, recall, and F1-score for each individual class (e.g., Class 0, Class 1, NA). This highlights the model's ability to correctly identify specific labels.

Examine the evaluation results at the granular level of individual "Practices". This allows for pinpointing specific instances where the model succeeds or fails.

Visualize the scoring distribution using interactive plots. Select a specific data file below to generate a graphical representation of the model's trustworthiness scores.

Choose FileName:

LLM Evaluation Dashboard

General Results

Comparison

Method Details

Prompts Heatmap