Technical Documentation

Deepeval And Ragas With Openwebui And Ollama: A Comprehensive Guide

Technical guide covering deepeval and ragas with openwebui and ollama: a comprehensive guide

👤
Author
Cosmic Lounge AI Team
📅
Updated
6/1/2025
⏱️
Read Time
14 min
Topics
#llm #ai #model #gpu #api #server #setup #installation #introduction #design

📖 Reading Mode

📖 Table of Contents

🌌 DeepEval and RAGAS with OpenWebUI and Ollama: A Comprehensive Guide

This manual provides a comprehensive guide to using DeepEval and RAGAS for assessing LLMs locally with openWebUI as the front end and Ollama as the backend. It includes step-by-step instructions, code examples, and troubleshooting tips to help you effectively evaluate your LLMs.



🌟 Prerequisites

Before we begin, ensure you have the following:

  • Basic understanding of Python and command-line interface (CLI)

  • Python 3.10 or later installed

  • openWebUI installed and running locally (accessible at 127.0.0.1:8080)

  • Ollama installed and configured to serve your LLMs



🌟 Introduction

🚀 Welcome to this comprehensive guide! This section will give you the foundational knowledge you need. Large Language Models (LLMs) have revolutionized various applications, from chatbots and question-answering systems to code generation and content creation. Evaluating the performance of these models is crucial to ensure their accuracy, reliability, and safety. DeepEval and RAGAS are two powerful frameworks designed for this purpose. DeepEval offers a range of LLM-based evaluation metrics, including answer relevancy, faithfulness, and hallucination detection. It allows you to test your LLM applications in a structured manner, providing valuable insights into their strengths and weaknesses. Some DeepEval metrics utilize OpenAI models and require an OpenAI API key1. RAGAS (Retrieval-Augmented Generation Assessment) focuses on evaluating Retrieval-Augmented Generation (RAG) pipelines, which combine information retrieval with LLMs to generate more comprehensive and informed responses. It provides metrics such as answer correctness and context relevance to assess the effectiveness of your RAG system. This guide will walk you through the process of setting up DeepEval and RAGAS with openWebUI and Ollama, enabling you to perform comprehensive evaluations of your LLMs locally. You can also use DeepEval with LlamaIndex, a data framework that provides tools for connecting LLMs with external data2.



🌟 Benefits of Using DeepEval and RAGAS

DeepEval and RAGAS offer several advantages for evaluating LLMs:

  • Efficiency and Automation: These frameworks enable efficient and automated evaluation of LLMs, saving time and effort compared to manual evaluation3.

  • Local Evaluation and Data Privacy: Using these frameworks with openWebUI and Ollama allows for local evaluation, ensuring data privacy and reducing reliance on external services5.

  • Standardized and Objective Assessment: DeepEval and RAGAS provide a standardized and objective way to assess LLM performance, facilitating comparison and improvement of different models and configurations7.



🌟 Setting up the Environment

This section guides you through setting up the necessary environment for using DeepEval and RAGAS with openWebUI and Ollama.

⚡ Installing DeepEval

Use pip to install DeepEval:

Bash

pip install -U deepeval

⚡ Installing RAGAS

Similarly, install RAGAS using pip:

Bash

pip install -U ragas

⚡ Installing Required Dependencies

DeepEval and RAGAS may have additional dependencies. Refer to their respective documentation for a complete list and installation instructions1.

⚡ Setting up openWebUI and Ollama

1. Ensure openWebUI is running locally at the specified address (127.0.0.1:8080)10. 2. Install and configure Ollama to serve your LLMs. 3. Configure openWebUI to connect to your Ollama instance. This typically involves specifying the Ollama server URL and API key in the openWebUI settings. Refer to the openWebUI documentation for detailed instructions on connecting to different LLM providers. To use DeepEval with Ollama, you need to configure DeepEval to use your local Ollama instance as the model provider. You can do this using the following command:

Bash

deepeval set-local-model —model-name=<your_model_name> —base-url=“http://localhost:11434/v1/” —api-key=“ollama”

Replace <your_model_name> with the name of the LLM you want to use in Ollama5.



🌟 Connecting to the openWebUI API

openWebUI provides an API for interacting with its functionalities programmatically. To connect to the API, you’ll need to obtain an API key.

1. Obtain API Key: Navigate to “Settings” -> “Account” in the openWebUI interface. Generate an API key or use a JWT (JSON Web Token) for authentication11. 2. Authenticate API Requests: Include the API key in the Authorization header of your API requests using the Bearer scheme. For example: Authorization: Bearer <your_api_key>



🌟 Using DeepEval to Assess LLMs

DeepEval provides a structured approach to evaluating LLMs, similar to unit testing in software development. Here’s how to use it:

1. Create a Test File: Create a Python file (e.g., test_llm.py) to define your test cases. 2. Import Necessary Modules: Import the required modules from DeepEval: Python from deepeval import assert_test from deepeval.metrics import AnswerRelevancyMetric from deepeval.test_case import LLMTestCase 3. Define Test Cases: Create test cases using the LLMTestCase class. Each test case should include the input to the LLM, the expected output, and any relevant context. For example: Python def test_answer_relevancy(): answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5) test_case = LLMTestCase( input=“What if these shoes don’t fit?”, actual_output=“We offer a 30-day full refund at no extra cost.”, retrieval_context=[ “All customers are eligible for a 30-day full refund at no extra cost.” ], ) assert_test(test_case,) 4. Run the Tests: Use the DeepEval CLI to run the tests: Bash deepeval test run test_llm.py

DeepEval will execute the tests and provide a report on the performance of your LLM based on the chosen metrics.



🌟 Using RAGAS to Evaluate RAG Pipelines

RAGAS offers a framework for evaluating RAG pipelines, which are designed to provide more contextually relevant and accurate answers by combining LLMs with external knowledge sources. Here’s how to use RAGAS:

1. Define Your RAG Pipeline: Implement your RAG pipeline using your preferred tools and libraries (e.g., LangChain, LlamaIndex). 2. Prepare Evaluation Data: Create a dataset with questions, relevant contexts, and expected answers. The evaluation data should be formatted as a list of dictionaries, where each dictionary represents a test case with the following keys: question, contexts, answer, and ground_truth12. You can also use RAGAS for synthetic test generation. This involves providing a set of documents and instructions to RAGAS, which will then automatically generate test cases based on the provided information13. 3. Integrate RAGAS: Use RAGAS to evaluate your RAG pipeline against the prepared dataset. RAGAS provides various metrics to assess different aspects of your pipeline. 4. Analyze the Results: Review the evaluation results provided by RAGAS to identify areas for improvement in your RAG pipeline.



🌟 Code Examples and Explanations

This section provides code examples demonstrating the use of DeepEval and RAGAS with openWebUI and Ollama.

⚡ DeepEval Example

Python

import pytest from deepeval import assert_test from deepeval.metrics import AnswerRelevancyMetric from deepeval.test_case import LLMTestCase

def test_answer_relevancy(): answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5) test_case = LLMTestCase( input=“What if these shoes don’t fit?”, actual_output=“We offer a 30-day full refund at no extra costs.”, retrieval_context=[“All customers are eligible for a 30 day full refund at no extra costs.”] ) assert_test(test_case,)

This example demonstrates a simple test case using the AnswerRelevancyMetric to evaluate whether the LLM’s response is relevant to the input question and provided context. The AnswerRelevancyMetric takes a threshold parameter, which determines the minimum score required for the test to pass. In this case, a threshold of 0.5 means that the LLM’s response must have a relevancy score of at least 0.5 to be considered relevant. The LLMTestCase class is used to define the test case. It takes three parameters:

  • input: The input prompt or question provided to the LLM.

  • actual_output: The actual output generated by the LLM.

  • retrieval_context: A list of relevant documents or contexts that were used by the LLM to generate the response. The assert_test function is used to run the test case and assert that it passes. It takes two parameters:

  • test_case: The LLMTestCase object defining the test case.

  • metrics: A list of metrics to use for evaluating the test case.

⚡ RAGAS Example (using LangChain)

Python

from langchain.chat_models import ChatOllama from langchain.embeddings import OllamaEmbeddings from ragas.llms import LangchainLLMWrapper from ragas.evaluators import RagasEvaluator from ragas.metrics import AnswerCorrectness

🌌 Initialize Ollama LLM and embeddings

llm = ChatOllama(model=“llama2”) embeddings = OllamaEmbeddings(model=“llama2”)

🌌 Wrap LLM for RAGAS

llm_wrapper = LangchainLLMWrapper(llm)

🌌 Initialize RAGAS evaluator

evaluator = RagasEvaluator(llm_wrapper=llm_wrapper, metrics=[AnswerCorrectness()])

🌌 Define your RAG pipeline and test data

🌌 Evaluate the pipeline

results = evaluator.evaluate(rag_pipeline, test_data)

🌌 Analyze the results

print(results)

This example shows how to integrate RAGAS with a LangChain-based RAG pipeline using Ollama as the LLM and embeddings provider. It uses the AnswerCorrectness metric to evaluate the accuracy of the generated answers. The LangchainLLMWrapper class is used to wrap the LangChain LLM object so that it can be used with RAGAS. The RagasEvaluator class is used to initialize the RAGAS evaluator. It takes two parameters:

  • llm_wrapper: The wrapped LLM object.

  • metrics: A list of metrics to use for evaluating the RAG pipeline. The evaluate method of the RagasEvaluator class is used to run the evaluation. It takes two parameters:

  • rag_pipeline: The RAG pipeline to evaluate.

  • test_data: The evaluation data. The evaluate method returns a dictionary of results, where the keys are the names of the metrics and the values are the corresponding scores.



🌟 Advanced DeepEval Usage

DeepEval offers several advanced features for more comprehensive LLM evaluation:

  • Custom Metrics: You can create your own custom metrics by inheriting from DeepEval’s base metric class. This allows you to tailor the evaluation to your specific needs and criteria1.

  • Conversational Metrics: DeepEval provides conversational metrics for evaluating multi-turn conversations with LLMs. These metrics assess aspects such as knowledge retention, role adherence, and conversation completeness14.

  • Red Teaming: DeepEval allows you to “red team” your LLM application by testing it against various adversarial prompts and scenarios. This helps identify vulnerabilities and improve the robustness of your LLM14.



🌟 Advanced RAGAS Usage

RAGAS also provides advanced features for evaluating RAG pipelines:

  • Diverse Metrics: RAGAS offers a variety of metrics for assessing different aspects of RAG pipelines, including answer correctness, context relevance, and faithfulness12.

  • Test Set Generation: RAGAS can automatically generate test cases for evaluating RAG pipelines, reducing the manual effort required for data preparation13.

  • Integration with Specific Pipelines: RAGAS can be integrated with various RAG pipeline implementations, including those built with LangChain and LlamaIndex.



🌟 Debugging and Tracing

LangFuse is a platform that can be used for tracing and debugging LLM applications. It provides detailed insights into the execution flow of your LLM, including the prompts, responses, and intermediate steps. LangFuse integrates with DeepEval and RAGAS, allowing you to analyze evaluation results in the context of the LLM’s execution trace15.



🌟 Community Resources

Here are some valuable community resources that can help you learn more about DeepEval, RAGAS, openWebUI, and Ollama:

These forums and communities provide a platform for asking questions, sharing experiences, and getting help from other users and developers.



🌟 Troubleshooting Tips and Best Practices

Here are some specific tips for troubleshooting common issues and best practices for using DeepEval and RAGAS with openWebUI and Ollama:

  • Check API Key: Ensure your openWebUI API key is valid and correctly included in the request headers.

  • Verify Ollama Setup: Make sure Ollama is properly configured and serving the desired LLMs.

  • Consult Documentation: Refer to the official documentation of DeepEval, RAGAS, and openWebUI for detailed information on their functionalities, metrics, and usage1.

  • Start with Simple Tests: Begin with basic test cases and gradually increase complexity as you become more comfortable with the frameworks.

  • Use a Variety of Metrics: Employ a diverse set of metrics to gain a comprehensive understanding of your LLM’s performance.

  • Analyze Results Carefully: Don’t just focus on the scores; delve into the explanations and insights provided by the evaluation frameworks to identify specific areas for improvement.

  • Troubleshoot ModuleNotFoundError: If you encounter a ModuleNotFoundError for DeepEval, ensure that you have installed it correctly using pip install -U deepeval. If the issue persists, check your Python environment and dependencies21.

  • Be Mindful of DeepEval Limitations: While DeepEval is a valuable tool, it’s important to be aware of its potential limitations. For example, some users have reported issues with random failures related to thread locks22.



🌟 Conclusion

DeepEval and RAGAS provide powerful tools for evaluating LLMs locally. By integrating them with openWebUI and Ollama, you can streamline your evaluation process and gain valuable insights into the performance of your LLM applications. This guide has provided you with the necessary steps, code examples, and troubleshooting tips to get started. Remember to consult the official documentation for more advanced features and customization options. Evaluating your LLMs is essential for ensuring their quality and reliability. DeepEval and RAGAS offer a structured and comprehensive approach to this evaluation process, enabling you to identify areas for improvement and build more robust and effective LLM applications.

🔧 Works cited

1. DeepEvalEvaluator - Haystack Documentation - Deepset, accessed on January 8, 2025, https://docs.haystack.deepset.ai/docs/deepevalevaluator

2. RAG/LLM Evaluators - DeepEval - LlamaIndex, accessed on January 8, 2025, https://docs.llamaindex.ai/en/stable/examples/evaluation/Deepeval/

3. Evaluation with DeepEval | Milvus Documentation, accessed on January 8, 2025, https://milvus.io/docs/evaluation_with_deepeval.md

4. explodinggradients/ragas: Supercharge Your LLM Application Evaluations - GitHub, accessed on January 8, 2025, https://github.com/explodinggradients/ragas

5. deepeval and api key · Issue #980 - GitHub, accessed on January 8, 2025, https://github.com/confident-ai/deepeval/issues/980

6. Reclaiming Control: The Emerging Open-Source AI Stack : r/LLMDevs - Reddit, accessed on January 8, 2025, https://www.reddit.com/r/LLMDevs/comments/1hfsoaq/reclaiming_control_the_emerging_opensource_ai/

7. AI-App/DeepEval: The Evaluation Framework for LLMs - GitHub, accessed on January 8, 2025, https://github.com/AI-App/DeepEval

8. Ragas | Haystack - Deepset, accessed on January 8, 2025, https://haystack.deepset.ai/integrations/ragas

9. Ragas, accessed on January 8, 2025, https://docs.ragas.io/

10. Open WebUI: Home, accessed on January 8, 2025, https://docs.openwebui.com/

11. API Endpoints | Open WebUI, accessed on January 8, 2025, https://docs.openwebui.com/getting-started/advanced-topics/api-endpoints/

12. Evaluating RAG Applications with RAGAs | by Leonie Monigatti | Towards Data Science, accessed on January 8, 2025, https://towardsdatascience.com/evaluating-rag-applications-with-ragas-81d67b0ee31a

13. Evaluating Rag Quality with Llama3, Gemma2 & Test Generation on Free GPUs - Medium, accessed on January 8, 2025, https://medium.com/@zayedrais/evaluating-rag-quality-with-llama3-gemma2-test-generation-on-free-gpus-09e91bb8125b

14. Introduction | DeepEval - The Open-Source LLM Evaluation Framework - Confident AI, accessed on January 8, 2025, https://docs.confident-ai.com/docs/metrics-introduction

15. Langfuse, accessed on January 8, 2025, https://langfuse.com/

16. open-webui open-webui · Discussions - GitHub, accessed on January 8, 2025, https://github.com/open-webui/open-webui/discussions

17. I’m the Sole Maintainer of Open WebUI — AMA! : r/OpenWebUI - Reddit, accessed on January 8, 2025, https://www.reddit.com/r/OpenWebUI/comments/1gjziqm/im_the_sole_maintainer_of_open_webui_ama/

18. OpenWebUI is absolutely amazing. : r/LocalLLaMA - Reddit, accessed on January 8, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1dh1hcp/openwebui_is_absolutely_amazing/

19. Use RAGAS with huggingface LLM - Intermediate - Hugging Face Forums, accessed on January 8, 2025, https://discuss.huggingface.co/t/use-ragas-with-huggingface-llm/75769

20. r/ollama - Reddit, accessed on January 8, 2025, https://www.reddit.com/r/ollama/

21. Quick Introduction | DeepEval - The Open-Source LLM Evaluation Framework - Confident AI, accessed on January 8, 2025, https://docs.confident-ai.com/docs/getting-started

22. Open Source and Locally Deployable AI Application Evaluation Tool : r/LLMDevs - Reddit, accessed on January 8, 2025, https://www.reddit.com/r/LLMDevs/comments/1hw1tl3/open_source_and_locally_deployable_ai_application/