dllmforge.rag_evaluation¶
RAGAS Evaluation Module for DLLMForge
This module provides comprehensive evaluation metrics for RAG (Retrieval-Augmented Generation) pipelines using RAGAS-inspired metrics without requiring external dashboards or services.
The module evaluates four key aspects of RAG systems: 1. Context Precision - 2. Context Recall - measures the ability to retrieve all necessary information 3. Faithfulness - measures factual accuracy and absence of hallucinations 4. Answer Relevancy - measures how relevant and to-the-point answers are
All evaluations are performed using LLMs to provide human-like assessment without requiring annotated datasets.
Functions
|
Convenience function to evaluate a RAG response. |
Classes
|
Container for evaluation results. |
|
Container for complete RAG evaluation results. |
|
RAGAS-inspired evaluator for RAG pipelines. |
- class dllmforge.rag_evaluation.EvaluationResult(metric_name: str, score: float, explanation: str, details: Dict[str, Any])[source]¶
Container for evaluation results.
- metric_name: str¶
- score: float¶
- explanation: str¶
- details: Dict[str, Any]¶
- __init__(metric_name: str, score: float, explanation: str, details: Dict[str, Any]) None¶
- class dllmforge.rag_evaluation.RAGEvaluationResult(context_precision: EvaluationResult, context_recall: EvaluationResult, faithfulness: EvaluationResult, answer_relevancy: EvaluationResult, ragas_score: float, evaluation_time: float, metadata: Dict[str, Any])[source]¶
Container for complete RAG evaluation results.
- context_precision: EvaluationResult¶
- context_recall: EvaluationResult¶
- faithfulness: EvaluationResult¶
- answer_relevancy: EvaluationResult¶
- ragas_score: float¶
- evaluation_time: float¶
- metadata: Dict[str, Any]¶
- __init__(context_precision: EvaluationResult, context_recall: EvaluationResult, faithfulness: EvaluationResult, answer_relevancy: EvaluationResult, ragas_score: float, evaluation_time: float, metadata: Dict[str, Any]) None¶
- class dllmforge.rag_evaluation.RAGEvaluator(llm_provider: str = 'auto', deltares_llm: DeltaresOllamaLLM | None = None, temperature: float = 0.1, api_key: str | None = None, api_base: str | None = None, api_version: str | None = None, deployment_name: str | None = None, model_name: str | None = None)[source]¶
RAGAS-inspired evaluator for RAG pipelines.
This evaluator provides four key metrics: - Context Precision: - Context Recall: Measures the ability to retrieve all necessary information - Faithfulness: Measures factual accuracy and absence of hallucinations - Answer Relevancy: Measures how relevant and to-the-point answers are
Initialize the RAG evaluator.
- Parameters:
llm_provider – LLM provider to use (“openai”, “anthropic”, “deltares” or “auto”)
- __init__(llm_provider: str = 'auto', deltares_llm: DeltaresOllamaLLM | None = None, temperature: float = 0.1, api_key: str | None = None, api_base: str | None = None, api_version: str | None = None, deployment_name: str | None = None, model_name: str | None = None)[source]¶
Initialize the RAG evaluator.
- Parameters:
llm_provider – LLM provider to use (“openai”, “anthropic”, “deltares” or “auto”)
- evaluate_context_relevancy(question: str, retrieved_contexts: List[str]) EvaluationResult[source]¶
Evaluate the relevancy of retrieved contexts to the question.
This metric measures the signal-to-noise ratio in the retrieved contexts. It identifies which sentences from the context are actually needed to answer the question.
- Parameters:
question – The user’s question
retrieved_contexts – List of retrieved context chunks
- Returns:
EvaluationResult with score and explanation
- evaluate_context_precision(question: str, retrieved_contexts: List[str], ground_truth_answer: str | None, top_k: int = 5) EvaluationResult[source]¶
Evaluate Context Precision@k following the Ragas implementation.
For each of the top-k retrieved chunks, the LLM judges whether the chunk supports the reference answer. Average Precision (AP) is then computed as:
AP = sum(Precision@i * rel_i) / (# relevant chunks)
- Parameters:
question – The question to evaluate.
reference_answer – The correct or gold answer.
retrieved_contexts – Ranked list of retrieved chunks.
top_k – Number of top chunks to evaluate.
- Returns:
EvaluationResult with precision@k score and explanation.
- evaluate_context_recall(question: str, retrieved_contexts: List[str], ground_truth_answer: str) EvaluationResult[source]¶
Evaluate the recall of retrieved contexts against a ground truth answer. This metric measures the ability of the retriever to retrieve all necessary information needed to answer the question by checking if each statement from the ground truth can be found in the retrieved context.
- Parameters:
question – The user’s question
retrieved_contexts – List of retrieved context chunks
ground_truth_answer – The reference answer to compare against
- Returns:
EvaluationResult with score and explanation
- evaluate_faithfulness(question: str, generated_answer: str, retrieved_contexts: List[str]) EvaluationResult[source]¶
Evaluate the faithfulness of the generated answer to the retrieved contexts.
This metric measures the factual accuracy of the generated answer by checking if all statements in the answer are supported by the retrieved contexts.
- Parameters:
question – The user’s question
generated_answer – The answer generated by the RAG system
retrieved_contexts – List of retrieved context chunks
- Returns:
EvaluationResult with score and explanation
- evaluate_answer_relevancy(question: str, generated_answer: str) EvaluationResult[source]¶
Evaluate the relevancy of the generated answer to the question. This metric measures how relevant and to-the-point the answer is by generating probable questions that the answer could answer and computing similarity to the actual question. :param question: The user’s question :param generated_answer: The answer generated by the RAG system
- Returns:
EvaluationResult with score and explanation
- calculate_ragas_score(context_precision: float, context_recall: float, faithfulness: float, answer_relevancy: float) float[source]¶
Calculate the RAGAS score as the harmonic mean of all four metrics. :param context_precision: Context precision score :param context_recall: Context recall score :param faithfulness: Faithfulness score :param answer_relevancy: Answer relevancy score
- Returns:
RAGAS score (harmonic mean)
- evaluate_rag_pipeline(question: str, generated_answer: str, retrieved_contexts: List[str], ground_truth_answer: str | None = None) RAGEvaluationResult[source]¶
Evaluate a complete RAG pipeline using all four metrics.
- Parameters:
question – The user’s question
generated_answer – The answer generated by the RAG system
retrieved_contexts – List of retrieved context chunks
ground_truth_answer – Optional ground truth answer for context recall evaluation
- Returns:
Complete evaluation results
- print_evaluation_summary(result: RAGEvaluationResult)[source]¶
Print a formatted summary of the evaluation results. :param result: The evaluation results to summarize
- save_evaluation_results(result: RAGEvaluationResult, output_file: str)[source]¶
Save evaluation results to a JSON file. :param result: The evaluation results to save :param output_file: Path to the output JSON file
- dllmforge.rag_evaluation.evaluate_rag_response(question: str, generated_answer: str, retrieved_contexts: List[str], ground_truth_answer: str | None = None, llm_provider: str = 'auto', save_results: bool = True, output_file: str | None = None) RAGEvaluationResult[source]¶
Convenience function to evaluate a RAG response. :param question: The user’s question :param generated_answer: The answer generated by the RAG system :param retrieved_contexts: List of retrieved context chunks :param ground_truth_answer: Optional ground truth answer for context recall evaluation :param llm_provider: LLM provider to use (“openai”, “anthropic”, or “auto”) :param save_results: Whether to save results to a file :param output_file: Optional output file path
- Returns:
Complete evaluation results