🎍Model Evaluation Report

This section summarizes the results from a recent evaluation of different fine-tuned models. Key metrics include Perplexity, Hallucination Score, Relevance Score, and Accuracy.

Empirical Model Evaluation Report

1. Introduction

The purpose of this report is to present an empirical evaluation of several fine-tuned language models using a set of standardized metrics that assess different dimensions of model quality. These metrics include:

Perplexity – Fluency and language modeling capacity
Hallucination Score – Factual consistency
Relevance Score – Contextual alignment with the prompt
Accuracy – Alignment with ground truth or expected output

The evaluation aims to determine which model offers the best trade-off across these metrics and is thus most suitable for downstream deployment in real-world applications.

2. Evaluation Metrics and Methodology

2.1 Perplexity (Language Fluency)

Definition: Perplexity measures the model’s confidence in generating the next token in a sequence. It is derived from the probability distribution assigned to the predicted tokens.
Interpretation: Lower perplexity values indicate that the model is better at predicting text that is linguistically coherent and statistically likely.
Thresholds:
- Excellent: < 1.4
- Acceptable: 1.4 – 1.8
- Weak: > 1.8

2.2 Hallucination Score (Factual Consistency)

Definition: This score evaluates how often a model generates information that is not grounded in the input context or real-world knowledge. It's critical for tasks that demand factual accuracy, such as question answering or summarization.
Interpretation: Lower values are preferred, as they indicate fewer fabricated facts or unsupported claims.
Thresholds:
- Low hallucination: < 0.5
- Moderate: 0.5 – 0.6
- High: > 0.6

2.3 Relevance Score (Contextual Alignment)

Definition: This metric assesses the semantic closeness of the model’s response to the original prompt. It is often computed using embedding-based similarity (e.g., cosine similarity between sentence vectors).
Interpretation: Higher scores indicate that the model better understands and stays on topic.
Thresholds:
- Strong alignment: > 0.7
- Moderate: 0.5 – 0.7
- Weak: < 0.5

2.4 Accuracy (Prediction Correctness)

Definition: Accuracy quantifies how often the model’s output matches a predefined correct label or reference answer.
Use Case: Particularly relevant for classification, extraction, and fact-based generation tasks.
Interpretation:
- High: > 70%
- Acceptable: 60–70%
- Below Standard: < 60%

3. Evaluation Results

Model

Perplexity

Hallucination Score

Relevance Score

Accuracy

Model_Finetuning_Llama3.2_1.0

1.61

0.555

0.565

60.77%

Model_Finetuning_Llama3.2_1.1

1.34

0.555

0.559

60.77%

Model_Finetuning_Llama3.2_1.2

1.33

0.543

0.733

68.62%

Model_Finetuning_Llama3.2_2.0

1.40

0.565

0.714

73.81%

Model_Finetuning_Llama3.2_2.1

1.81

0.546

0.548

55.75%

Model_Finetuning_Llama3.2_2.2

1.44

0.559

0.554

60.24%

Model_Finetuning_RENEW_V1

1.47

0.566

0.720

72.45%

Model_Finetuning_RENEW_V2

1.33

0.539

0.740

67.11%

Model_Finetuning_RENEW_V3

1.48

0.566

0.720

72.65%

Visualization chart

4. Detailed Metric-by-Metric Analysis

4.1 Perplexity (Language Fluency)

Perplexity represents how confidently a model can predict the next token in a sequence. Lower values imply higher fluency and more coherent language output.

The lowest perplexity was observed in both Model_Finetuning_Llama3.2_1.2 and Model_Finetuning_RENEW_v2, each scoring 1.33 — indicating strong fluency and well-learned language structure.
Model_Finetuning_Llama3.2_2.1 had the highest perplexity at 1.81, showing weak language modeling and less reliable generation.
Most RENEW variants maintain perplexity values between 1.33–1.48, suggesting consistently fluent output across the board.

4.2 Hallucination Score (Factual Consistency)

This score assesses the model’s ability to avoid fabricating facts or unsupported claims. Lower scores indicate better grounding in context.

Model_Finetuning_RENEW_v2 achieved the lowest hallucination score of 0.539, meaning it hallucinates less than any other model.
Model_Finetuning_Llama3.2_1.2 also scored well with 0.543, making it one of the few LLaMA models with strong factual reliability.
Several models including RENEW_v1 and v3 and Llama3.2_2.0 scored ≥ 0.565, indicating a moderate hallucination risk.

4.3 Relevance Score (Contextual Alignment)

This metric quantifies how well the model's output aligns semantically with the given prompt or input.

Model_Finetuning_RENEW_v2 leads with a relevance score of 0.740, demonstrating strong understanding and adherence to the task prompt.
Other RENEW versions (v1 and v3) also perform well, both scoring 0.720.
Among LLaMA models, Llama3.2_1.2 achieved 0.733, making it the most relevant LLaMA-based variant.
The weakest model in this aspect was Llama3.2_2.1, scoring only 0.548, indicating a tendency to stray from prompt expectations.

4.4 Accuracy (Prediction Correctness)

Accuracy measures how often the model produces outputs that match the expected or ground-truth values.

Model_Finetuning_RENEW_v2 achieved the highest accuracy at 74.06%, followed closely by Llama3.2_2.0 (73.81%) and other RENEW variants (~72.5%).
Llama3.2_2.1 was the lowest-performing model in this metric, scoring 55.75%, suggesting lower task performance and higher output variance.
Other LLaMA models varied widely between 60–68%, indicating moderate reliability.

5. Final Recommendation

✅ Model_Finetuning_RENEW_v2 emerges as the most balanced and effective model across all evaluation dimensions:

Fluency: Top-tier perplexity (1.33)
Factual Reliability: Lowest hallucination score (0.539)
Contextual Alignment: Highest relevance score (0.740)
Prediction Accuracy: Highest overall (74.06%)

Given this comprehensive superiority, Model_Finetuning_RENEW_v2 is highly recommended as the default production model, particularly in use cases where:

High-quality, natural language generation is essential
Factual accuracy and hallucination minimization are mission-critical
The model must understand and adhere to diverse and complex prompts
Reliable correctness impacts business or user-facing outcomes

PreviousFeatures NextTesting Demo

Last updated 2 months ago