🌌 DeepEval: A Comprehensive Guide to LLM Testing

Large language models (LLMs) have revolutionized the way we interact with technology, enabling everything from sophisticated chatbots to advanced content generation. However, ensuring the quality, reliability, and safety of these models is paramount. This is where DeepEval comes in, providing a robust framework for evaluating and testing LLMs.

🌟 What is DeepEval?

DeepEval is an open-source evaluation framework specifically designed for LLMs. It simplifies the process of evaluating LLM outputs by allowing users to “unit test” their models, much like using Pytest for traditional software testing 1. DeepEval offers a wide range of features to ensure comprehensive evaluation, including over 14 research-backed LLM evaluation metrics, synthetic dataset generation, LLMs benchmarks, red teaming, and real-time evaluations in production 1. DeepEval is developed and maintained by Confident AI 2, a company focused on building tools for AI development and evaluation. The platform boasts a significant user base, with over 300,000 daily evaluations and 100,000 monthly downloads 3.

🌟 Key Features and Capabilities

DeepEval offers a rich set of features that streamline the LLM evaluation process:

Modular Design: DeepEval’s modular architecture allows for flexible and customizable evaluation protocols tailored to specific needs and contexts 2. This flexibility ensures the framework can adapt to various LLM architectures and application domains.
Comprehensive Metrics: DeepEval offers a collection of plug-and-use metrics, with over 14 LLM-evaluated metrics backed by research 4. These metrics cover a wide range of use cases, from basic performance indicators to advanced measures of coherence, relevance, faithfulness, hallucination, toxicity, bias, summarization, and contextual understanding 4. These metrics are LLM-evaluated, meaning they better align with human expectations compared to traditional model-based approaches 5. DeepEval’s metrics are also extra reliable because LLMs are only used for extremely confined tasks during evaluation, reducing stochasticity and flakiness in scores 5. Moreover, DeepEval provides a comprehensive reason for the scores computed by its metrics, aiding in understanding and debugging evaluations 5.
Benchmarks: DeepEval offers state-of-the-art, research-backed benchmarks such as HellaSwag, MMLU, HumanEval, and GSM8K, providing standardized ways to measure LLM performance across various tasks 4. You can easily benchmark any LLM on these popular benchmarks in under 10 lines of code 6.
Synthetic Data Generator: Creating comprehensive evaluation datasets can be challenging. DeepEval includes a data synthesizer that uses an LLM to generate and evolve inputs, creating complex and realistic datasets for diverse use cases 4.
Real-time and Continuous Evaluation: DeepEval integrates with Confident AI for continuous evaluation, refining LLMs over time 4. This integration enables continuous evaluation in production, centralized cloud-based datasets, tools for tracing and debugging, evaluation history tracking, and summary report generation for stakeholders 4.
Red Teaming: DeepEval supports red teaming, a type of security testing for LLMs 2. While standard LLM evaluation tests your LLM on its intended functionality, red teaming tests your LLM application against intentional, adversarial attacks from malicious users. This helps identify vulnerabilities and improve the security of your LLM application.
Conversational Metrics: DeepEval supports conversational metrics, such as Knowledge Retention, Conversation Completeness, Conversation Relevancy, and Role Adherence 6. These metrics are specifically designed for evaluating dialogues and chatbot interactions, providing insights into the coherence and consistency of LLM-generated conversations.
Integrations: DeepEval integrates with other tools in the LLM ecosystem, including LlamaIndex and Hugging Face 6. This allows you to seamlessly incorporate DeepEval into your existing workflows and leverage its evaluation capabilities within these platforms.
Custom Metrics: DeepEval allows you to customize existing metrics or define your own custom metrics to suit specific needs 2. This flexibility enables you to tailor evaluations to specific criteria and application domains.
CI/CD Integration: DeepEval integrates seamlessly with any CI/CD environment 6. This allows you to automate LLM evaluation as part of your development pipeline, ensuring that your models meet the desired quality standards before deployment.

🌟 Installation

Installing DeepEval is a straightforward process. Follow these steps to get started:

1. Set up a Python Environment: Go to the root directory of your project and create a virtual environment. In the CLI, run: Bash python3 -m venv venv source venv/bin/activate 2. Install DeepEval: In your newly created virtual environment, run: Bash pip install -U deepeval

This command installs the latest version of DeepEval. It’s recommended to keep DeepEval updated to leverage the latest features and improvements 7. 3. (Optional) Login to Confident AI: To keep your testing reports in a centralized place on the cloud and access additional features, use Confident AI, the leading evaluation platform for DeepEval 7. You can sign up for free on their website. To log in via the CLI, run: Bash deepeval login

Follow the instructions displayed on the CLI to create an account, get your Confident API key, and paste it into the CLI 7.

🌟 DeepEval Commands

DeepEval provides a set of commands to facilitate LLM evaluation. Here are some of the key commands:

deepeval test run <file_name>.py: This command runs the test cases defined in the specified Python file. You can use flags like -n to specify the number of processes for parallel evaluation, -c to read from the local cache, and -d to display specific test cases (e.g., “failing” test cases only) 6.
deepeval set-azure-openai: This command configures your DeepEval environment to use Azure OpenAI for all LLM-based metrics 5. You need to provide parameters like —openai-endpoint, —openai-api-key, —deployment-name, etc5.. This allows you to leverage Azure OpenAI models for evaluating your LLMs.
deepeval set-ollama : This command configures DeepEval to use Ollama models for your metrics 5. You can optionally specify the base URL of your local Ollama model instance 5. This provides flexibility in choosing the LLM used for evaluation.
deepeval set-local-model: This command configures DeepEval to use other local providers like LM Studio or vLLM 5. You need to provide parameters like —model-name, —base-url, and —api-key 5. This allows you to utilize various local LLM providers for evaluation.
Logging Hyperparameters: DeepEval allows you to log hyperparameters during evaluation using the evaluate() function or by decorating a function with @deepeval.log_hyperparameters 2. This helps track and compare different model configurations and identify the optimal settings for your LLM application.
Labeling Test Cases: DeepEval allows you to label test cases, which can be useful for organizing and filtering test cases on Confident AI 8. This helps manage and analyze your evaluation data more effectively.

🌟 DeepEval Metrics

DeepEval offers a variety of metrics to evaluate different aspects of LLM performance. Here’s a breakdown of some key metric categories:

Answer Relevancy: This metric evaluates how relevant the LLM’s response is to the given input and context. It measures the alignment between the generated output and the user’s query or the provided information.
G-Eval: G-Eval is a versatile metric that uses chain-of-thought reasoning to evaluate LLM outputs based on custom criteria 1. It allows you to define evaluation criteria in natural language, making it highly adaptable to different use cases.
Conversational Metrics: These metrics, such as Knowledge Retention and Conversation Completeness, are designed specifically for evaluating dialogues rather than individual outputs 1. They assess the coherence, consistency, and overall quality of LLM-generated conversations.
Other Metrics: DeepEval provides a range of other metrics, including faithfulness, contextual relevance, hallucination, toxicity, bias, and summarization 5. These metrics cover various aspects of LLM performance, allowing for a comprehensive evaluation of your models.

🌟 Example Codes and Usage

Let’s explore some practical examples of how to use DeepEval for testing LLMs:

⚡ Example 1: Evaluating Answer Relevancy

This example demonstrates how to evaluate the answer relevancy of an LLM’s response. Python

from deepeval import assert_test from deepeval.metrics import AnswerRelevancyMetric from deepeval.test_case import LLMTestCase

🌌 Define the test case

test_case = LLMTestCase( input=“What if these shoes don’t fit?”, actual_output=“We offer a 30-day full refund at no extra costs.”, retrieval_context=[“All customers are eligible for a 30 day full refund at no extra costs.”] )

🌌 Define the metric

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.8)

🌌 Run the test

assert_test(test_case,)

In this code:

We define a test case with a user input (input), the LLM’s response (actual_output), and the relevant context (retrieval_context).
We use the AnswerRelevancyMetric with a threshold of 0.8 to evaluate the relevancy of the LLM’s response to the input and context.
The assert_test function runs the test case against the specified metric.

⚡ Example 2: Using G-Eval for Custom Evaluation

G-Eval allows you to create custom LLM-evaluated metrics using natural language. This example shows how to use G-Eval to evaluate the correctness of an LLM’s response. Python

from deepeval import assert_test from deepeval.metrics import GEval from deepeval.test_case import LLMTestCase, LLMTestCaseParams

🌌 Define the test case

test_case = LLMTestCase( input=“What if these shoes don’t fit?”, actual_output=“You have 30 days to get a full refund at no extra cost.”, expected_output=“We offer a 30-day full refund at no extra costs.” )

🌌 Define the metric

correctness_metric = GEval( name=“Correctness”, criteria=“Determine if the ‘actual output’ is correct based on the ‘expected output’.”, evaluation_params=, threshold=0.5 )

🌌 Run the test

assert_test(test_case,)

In this code:

We define a test case with a user input (input), the LLM’s response (actual_output), and the expected response (expected_output).
We use GEval to define a custom metric called “Correctness”. The criteria parameter specifies the evaluation criteria in natural language.
The evaluation_params specify which parameters of the test case should be used for evaluation.
The assert_test function runs the test case against the G-Eval metric.

⚡ Example 3: Evaluating a Dataset

DeepEval allows you to evaluate entire datasets of test cases. This example shows how to evaluate a dataset using the AnswerRelevancyMetric. Python

import pytest from deepeval import assert_test from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric from deepeval.test_case import LLMTestCase from deepeval.dataset import EvaluationDataset

🌌 Define test cases

first_test_case = LLMTestCase(input=”…”, actual_output=”…”, context=[”…”]) second_test_case = LLMTestCase(input=”…”, actual_output=”…”, context=[”…”])

🌌 Create a dataset

dataset = EvaluationDataset(test_cases=)

🌌 Define the test function

@pytest.mark.parametrize(“test_case”, dataset) def test_customer_chatbot(test_case: LLMTestCase): hallucination_metric = HallucinationMetric(threshold=0.3) answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5) assert_test(test_case,)

In this code:

We define two test cases (first_test_case and second_test_case).
We create an EvaluationDataset with the defined test cases.
We use pytest.mark.parametrize to run the test_customer_chatbot function for each test case in the dataset.
The test function evaluates each test case against the HallucinationMetric and AnswerRelevancyMetric.

⚡ Example 4: Using the evaluate() Function

The evaluate() function provides a way to evaluate test cases or datasets without Pytest integration. This is particularly useful in notebook environments. Python

from deepeval import evaluate from deepeval.metrics import AnswerRelevancyMetric from deepeval.test_case import LLMTestCase

🌌 Define the test case

🌌 Define the metric

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.8)

🌌 Evaluate the test case

evaluate(test_cases=, metrics=)

In this code:

We define a test case and a metric, similar to the previous examples.
We use the evaluate() function to run the evaluation, providing a list of test cases and metrics as arguments.

⚡ Example 5: Offloading Generation to a Separate Thread

For improved performance, especially when dealing with computationally intensive LLMs, you can offload the generation process to a separate thread using asyncio. Python

import asyncio from deepeval.models.base_model import DeepEvalBaseLLM

class MyLLM(DeepEvalBaseLLM):

🌌 … (existing code) …

async def a_generate(self, prompt: str) -> str: loop = asyncio.get_running_loop() return await loop.run_in_executor(None, self.generate, prompt)

In this code:

We define a custom LLM class that inherits from DeepEvalBaseLLM.
We implement the a_generate() method, which uses asyncio.run_in_executor to offload the generate() method to a separate thread.

⚡ Example 6: Evaluating Multimodal LLMs

DeepEval supports evaluating multimodal LLMs (MLLMs), which can handle both text and images. Python

from deepeval.test_case import MLLMTestCase, MLLMImage

🌌 Define the test case

mllm_test_case = MLLMTestCase( input=[“Change the color of the shoes to blue.”, MLLMImage(url=”./shoes.png”, local=True)], actual_output=[“The original image of red shoes now shows the shoes in blue.”, MLLMImage(url=“https://shoe-images.com/edited-shoes”, local=False)] )

In this code:

We define an MLLMTestCase with an input that includes both text and an image (MLLMImage).
The actual_output also includes text and an image, representing the MLLM’s response.

⚡ Example 7: “Golden” Test Cases

DeepEval introduces the concept of “Golden” test cases, which are LLMTestCase objects with no actual_output 2. These are used for generating LLM outputs at evaluation time, allowing you to evaluate the LLM’s ability to generate responses without pre-defined outputs.

🌟 DeepEval Integration with Haystack

DeepEval can be integrated with Haystack, an open-source framework for building search systems and question answering systems 9. This integration allows you to evaluate Haystack pipelines using DeepEval’s LLM-based metrics. You can use the DeepEvalEvaluator component in Haystack to evaluate a pipeline against metrics like answer relevancy, faithfulness, and contextual relevance. To use the DeepEvalEvaluator, you need to install the deepeval-haystack package and initialize the evaluator with the desired metric and parameters. You can then run the evaluator on its own or as part of a Haystack pipeline.

🌟 Tutorials and Guides

| Resource Name | Type | Description |

🔧 Works cited

1. Evaluate LLMs Effectively Using DeepEval: A Practical Guide - DataCamp, accessed on March 3, 2025, https://www.datacamp.com/tutorial/deepeval

2. Quick Introduction | DeepEval - The Open-Source LLM Evaluation Framework - Confident AI, accessed on March 3, 2025, https://docs.confident-ai.com/docs/getting-started

3. Confident AI - The DeepEval LLM Evaluation Platform, accessed on March 3, 2025, https://www.confident-ai.com/

4. What is DeepEval? Features & Examples - Deepchecks, accessed on March 3, 2025, https://www.deepchecks.com/glossary/deepeval/

5. Introduction | DeepEval - The Open-Source LLM Evaluation Framework, accessed on March 3, 2025, https://docs.confident-ai.com/docs/metrics-introduction

6. confident-ai/deepeval: The LLM Evaluation Framework - GitHub, accessed on March 3, 2025, https://github.com/confident-ai/deepeval

7. Confident AI QuickStart | DeepEval - The Open-Source LLM Evaluation Framework, accessed on March 3, 2025, https://docs.confident-ai.com/confident-ai/confident-ai-introduction

8. Test Cases | DeepEval - The Open-Source LLM Evaluation Framework - Confident AI, accessed on March 3, 2025, https://docs.confident-ai.com/docs/evaluation-test-cases

9. DeepEvalEvaluator - Haystack Documentation - Deepset, accessed on March 3, 2025, https://docs.haystack.deepset.ai/docs/deepevalevaluator

Deepeval: A Comprehensive Guide To LLM Testing

📖 Reading Mode

📖 Table of Contents

🌌 DeepEval: A Comprehensive Guide to LLM Testing

🌟 What is DeepEval?

🌟 Key Features and Capabilities

🌟 Installation

🌟 DeepEval Commands

🌟 DeepEval Metrics

🌟 Example Codes and Usage

⚡ Example 1: Evaluating Answer Relevancy

🌌 Define the test case

🌌 Define the metric

🌌 Run the test

⚡ Example 2: Using G-Eval for Custom Evaluation

🌌 Define the test case

🌌 Define the metric

🌌 Run the test

⚡ Example 3: Evaluating a Dataset

🌌 Define test cases

🌌 Create a dataset

🌌 Define the test function

⚡ Example 4: Using the evaluate() Function

🌌 Define the test case

🌌 Define the metric

🌌 Evaluate the test case

⚡ Example 5: Offloading Generation to a Separate Thread

🌌 … (existing code) …

⚡ Example 6: Evaluating Multimodal LLMs

🌌 Define the test case

⚡ Example 7: “Golden” Test Cases

🌟 DeepEval Integration with Haystack

🌟 Tutorials and Guides

🔧 Works cited