🌌 Ollama API: An In-Depth Technical Analysis
🌟 1. Introduction to Ollama and its API
🚀 Welcome to this comprehensive guide! This section will give you the foundational knowledge you need. Ollama is a framework that enables the execution of large language models (LLMs) locally on a user’s machine 1. By bundling model weights, configuration details, and necessary data into a single package, Ollama simplifies the complexities typically associated with setting up and managing LLMs 2. It exposes a REST API, accessible by default on port 11434, which allows for programmatic interaction with these locally hosted models 1. This design choice provides users with the capability to leverage the power of sophisticated LLMs while maintaining control over their data and benefiting from reduced latency compared to cloud-based solutions 2. Ollama supports a wide array of open-source LLMs, making it a versatile tool for various natural language processing tasks 2. The Ollama API serves as the primary interface for developers and researchers to programmatically interact with the locally running LLMs 1. It provides functionalities for managing local models, such as downloading, listing, and deleting them 1. Furthermore, the API allows for the seamless integration of LLM capabilities into diverse applications and across various programming languages, as evidenced by the availability of Python and JavaScript client libraries 3. This report is intended for developers seeking to incorporate LLMs into their applications, researchers exploring and experimenting with these models, and technical users requiring a comprehensive understanding of the Ollama API’s usage and optimization strategies. It aims to provide a detailed analysis of the API’s features, functionalities, and best practices for achieving optimal performance.
🌟 2. Identifying the Latest Version of Ollama API
Determining the precise “newest” version of the Ollama API can present certain challenges. The versioning of the API might not always directly correspond to the version of the overall Ollama software 7. Additionally, different components of the Ollama ecosystem, such as the core Go package and client libraries like the Python library, may adhere to their own independent versioning schemes 4. Examining the available version information reveals several key data points. The core Go package, which underpins the Ollama service, is currently at version v0.6.2 7. The HexDocs documentation, on the other hand, refers to version 0.3.0 of the Ollama API 8. Notably, the Mistral model update mentions version 0.3, which introduced support for function calling 9. The Python client library is at version 0.1.1 4. Furthermore, the release notes available on GitHub mention software versions such as v0.6.0, which introduced support for Google’s Gemma 3 models 10. Considering the version of the core Go package and the information from the GitHub releases, it is highly probable that the latest core Ollama API version aligns with the v0.6.x series. To programmatically ascertain the exact version of the Ollama server being used, the API provides a dedicated /api/version endpoint 7. This endpoint, when accessed, returns the Ollama server version as a string 7. This capability offers a reliable method for users to confirm the specific version they are interacting with, which is essential for ensuring compatibility with particular API features and for troubleshooting purposes.
🌟 3. Comprehensive Guide to Ollama API Endpoints
The Ollama API provides a set of endpoints that enable various interactions with locally hosted language models. These endpoints facilitate tasks ranging from generating text completions and engaging in chat interactions to managing models and generating embeddings. The following table provides a summary of the available API endpoints, their corresponding HTTP methods, and a brief description of their functionality.
Endpoint Path | HTTP Method | Description |
---|---|---|
/api/generate | POST | Generates a response for a given prompt using a specified model. |
/api/chat | POST | Generates the next message in a chat conversation using a specified model. |
/api/create | POST | Creates a new model from a Modelfile or by importing an existing model. |
/api/tags | GET | Lists the models that are available locally. |
/api/show | POST | Shows detailed information about a specific model. |
/api/copy | POST | Creates a copy of an existing model with a new name. |
/api/delete | DELETE | Deletes a model and its associated data. |
/api/pull | POST | Downloads a model from the Ollama library. |
/api/push | POST | Uploads a model to a model library (requires ollama.ai registration). |
/api/embed | POST | Generates embeddings from a model for a given input text. |
/api/ps | GET | Lists the models that are currently loaded into memory. |
/api/version | GET | Retrieves the version of the Ollama server. |
/api/blobs/:digest | HEAD, POST | Checks for or pushes file blobs used in model creation. |
⚡ Detailed Description of Each Endpoint:
- /api/generate (POST): Generate a completion. This endpoint is used for single-turn text generation, where a prompt is provided, and the model generates a response based solely on that input 1. By default, this endpoint streams the response back to the client as a series of JSON objects 11. The request requires the model name and the prompt as parameters 11. Optional parameters include suffix, which adds text after the model’s response, and images, a list of base64-encoded images for multimodal models like LLaVA 11. Advanced parameters allow for further customization: format can be set to “json” or a JSON schema for structured outputs 11. The options parameter accepts additional model-specific parameters such as temperature 11.
system and template can override the system message and prompt template defined in the model’s Modelfile 11. Setting stream to false will return the entire response as a single JSON object 11. The raw parameter, if set to true, disables any formatting of the prompt 11.
keep_alive controls how long the model remains loaded in memory after the request, with a default of 5 minutes 11. Usage examples with curl demonstrate various ways to interact with this endpoint. A basic example to generate a response from the llama2 model for the prompt “Why is the sky blue?” is: curl -X POST http://localhost:11434/api/generate -d ’{ “model”: “llama2”, “prompt”:“Why is the sky blue?” }’ 12. To disable streaming and receive a single JSON response, the stream parameter can be set to false: curl http://localhost:11434/api/generate -d ’{ “model”: “llama3.2”, “prompt”: “How are you today?”, ” stream”: false}’ 5. Python examples using the ollama library provide another way to interact with this endpoint. To generate text using the llama3 model, the following code can be used: Python import ollama response = ollama.generate(model=‘llama3’, prompt=‘What are the benefits of using Ollama?’, options={‘num_ctx’: 2048}) print(response[‘text’])
6. This demonstrates the ease of use provided by the client libraries for making API calls. The expected response from this endpoint, in streaming mode, is a series of JSON objects. The final object in the stream includes statistics about the generation process, such as total_duration, load_duration, prompt_eval_count, and eval_duration 11. If streaming is disabled, the response will be a single JSON object containing the full response 11.
- /api/chat (POST): Generate a chat completion. This endpoint is designed for multi-turn conversational interactions, allowing the model to generate the next message in a chat based on the provided history 1. Similar to the /api/generate endpoint, it streams responses by default 11. The request requires the model name and an array of messages 11. Each message in the array must include the role (e.g., “system”, “user”, “assistant”, or “tool”) and the content of the message 11. An optional tools parameter allows for providing a JSON list of tools that the model can use if supported 11. A curl example to initiate a chat with the smollm2:135m model, without streaming, would look like this: Bash curl http://localhost:11434/api/chat -d ’{ “model”: “smollm2:135m”, “stream”:false, “messages”: [ { “role”: “user”, “content”: “Why is the sky blue? Give the shortest answer possible in under 20 words” } ] }’
13. This demonstrates how to structure the messages array to provide the conversation history. In Python, using the ollama library, a chat interaction can be initiated as follows: Python from ollama import chat conversation = [{“role”: “user”, “content”: “Hello, how are you?”}] reply = chat(model=‘llama2’, messages=conversation) print(reply.message.content)
14. This shows a simple example of sending a user message and printing the assistant’s reply. The tools parameter can be used to integrate function calling capabilities, allowing the model to interact with external functions based on the conversation 14. The expected response, whether streamed or as a single object, includes a message object containing the role and content of the assistant’s response 11. The /api/chat endpoint is essential for building interactive and context-aware applications that require maintaining a conversation history with the language model. The introduction of the tools parameter significantly enhances its capabilities for creating more sophisticated agent-like interactions.
- /api/create (POST): Create a Model. This endpoint enables the creation of new models locally. A new model can be based on an existing model, built from a Modelfile, or by importing models in formats like GGUF or Safetensors 11. The request requires the model name for the new model 7. The from parameter can specify an existing model to base the new one on 7. For creating a model from a Modelfile, the contents of the Modelfile would typically be included in the request body, although the exact mechanism might vary depending on the client library or tool used. Other parameters like files (a dictionary of filenames to SHA256 digests of blobs), adapters, template, license, system, parameters, messages, stream, and quantize allow for further customization of the new model 7. While a direct curl example for creating a model via the API is not explicitly provided in the snippets, the process would involve sending a POST request to /api/create with the necessary parameters, potentially including the Modelfile content. The Python library provides a more straightforward way to create models. For example, to create a model named “example” based on “llama3.2” with a custom system prompt, the following Python code can be used: Python import ollama ollama.create(model=‘example’, from_=‘llama3.2’, system=“You are Mario from Super Mario Bros.”)
- The expected response from this endpoint is typically a stream of progress updates as the model is created, culminating in a final success message 11. The /api/create endpoint is fundamental for users who need to customize models, import external models, or optimize them for specific hardware or use cases through techniques like quantization.
- /api/tags (GET): List Local Models. This endpoint retrieves a list of all models that are currently available locally on the Ollama server 11. It does not require any parameters in the request 11. A curl example to list the local models is: curl http://localhost:11434/api/tags 13. The output is a JSON object containing an array of model objects. Each object provides details such as the model’s name, the underlying model identifier, the modified_at timestamp, the size in bytes, a unique digest, and further details including the parent model, format (e.g., “gguf”), family (e.g., “llama”), parameter size, and quantization level 13. Using the Python library, the same information can be obtained with: Python import ollama models = ollama.list() print(models)
19. The response will be a list of dictionaries, each representing a local model with its associated details. This endpoint is crucial for users to quickly check which models they have downloaded and are ready to use with other API calls.
- /api/show (POST): Show Model Information. This endpoint displays detailed information about a specific model 11. The request requires the model name as a parameter 11. An optional verbose parameter can be set to true for more extensive output 11. While a direct curl example is not provided, a request would look like: curl http://localhost:11434/api/show -d ’{ “model”: “llama3.2” }’.
The expected response is a JSON object containing various details about the specified model, including its Modelfile contents, template, parameters, license information, and the system prompt 11.
-
/api/copy (POST): Copy a Model. This endpoint creates a new model that is an exact copy of an existing one 11. The request requires two parameters: source, the name of the model to be copied, and destination, the desired name for the new model 11. A curl request would be structured as: curl http://localhost:11434/api/copy -d ’{ “source”: “llama3.2”, “destination”: “llama3.2-copy” }’. The Python library provides a function for this as well: ollama.copy(source=‘llama3.2’, destination=‘llama3.2-copy’) 20. The expected response is a success message 11. This functionality is useful for creating backups of models or for experimenting with modifications on a copy without affecting the original model.
-
/api/delete (DELETE): Delete a Model. This endpoint removes a specified model and all its associated data from the local Ollama instance 11. The request requires the model name to be deleted as a parameter 11. A curl command to delete a model would be: curl -X DELETE http://localhost:11434/api/delete -d ’{ “model”: “llama3.2-copy” }’. In Python, the ollama library offers the delete function: ollama.delete(model=‘llama3.2-copy’) 20. The expected response is a success message 11. This endpoint is essential for managing storage space by removing models that are no longer needed.
-
/api/pull (POST): Pull a Model. This endpoint downloads a model from the Ollama library to the local machine 11. The request requires the model name to be pulled as a parameter 11. Optional parameters include insecure and stream 11. A curl example to pull the llama3.2 model is not directly shown in the provided snippets, but it would likely resemble: curl -X POST http://localhost:11434/api/pull -d ’{ “model”: “llama3.2” }’. The Python library provides the pull function: ollama.pull(model=‘llama3.2’) 19. The expected response is a stream of progress updates as the model layers are downloaded, followed by a success message 11.
-
/api/push (POST): Push a Model. This endpoint uploads a locally created model to a model library 8. This operation typically requires registering for an account on ollama.ai and adding a public key 8. The model name must be provided in the format
/ : 11. Optional parameters insecure and stream are also available 11. Similar to the pull endpoint, a direct curl example for pushing a model is not provided. However, using the Python library, the push function would be used: ollama.push(model=‘your-namespace/your-model:latest’) 20. The expected response is a stream of progress updates during the upload process, followed by a success message 11. This endpoint enables users to share their custom-trained or modified models with the broader Ollama community. -
/api/embed (POST): Generate Embeddings. This endpoint generates vector embeddings for a given input text using a specified model 11. The request requires the model name and the input text, which can be a single string or a list of strings 11. Optional parameters include truncate, options, and keep_alive 11. A curl example to generate embeddings for the text “The sky is blue because of Rayleigh scattering” using the nomic-embed-text model is: curl http://localhost:11434/api/embeddings -d ’{ “model”: “nomic-embed-text”, “prompt”: “The sky is blue because of Rayleigh scattering” }’ 21. The /api/embeddings path in this example is noted as superseded by /api/embed in other documentation 11. In Python, the ollama library provides the embed function: ollama.embed(model=‘llama3.2’, input=‘The sky is blue because of rayleigh scattering’) 16. This function can also handle a list of inputs to generate embeddings in batch: ollama.embed(model=‘llama3.2’, input=) 16. The expected response is a JSON object containing the model name and an array of embeddings, where each embedding is a vector representation of the input text 11.
-
/api/ps (GET): List Running Models. This endpoint lists the models that are currently loaded into the Ollama server’s memory 11. It does not require any parameters 11. A curl request to this endpoint would be: curl http://localhost:11434/api/ps. The expected response is a JSON object containing an array of running model objects. Each object includes information such as the model’s name, a unique id, its size in memory, the processor it is currently running on (e.g., “100% GPU” or “100% CPU”), and the time until it is scheduled to be unloaded 23.
-
/api/version (GET): Version. This endpoint retrieves the version of the Ollama server software 11. It does not require any parameters 11. A curl request to get the version is: curl http://localhost:11434/api/version. The expected response is a string containing the Ollama server version 7. This is the most direct way to programmatically determine the version of the Ollama server.
-
/api/blobs/:digest (HEAD, POST): Manage Model Blobs. These endpoints are used in the process of creating models from local files. The HEAD method checks if a file blob, identified by its SHA256 digest, exists on the server 11. The POST method pushes a file to the Ollama server to create a blob, also identified by its SHA256 digest in the path 11. The POST request would include the file content in the request body.
🌟 4. Optimizing Model Loading and GPU Utilization
To effectively utilize the Ollama API, understanding how to optimize model loading and leverage GPU resources is crucial for performance. Best practices for loading models involve using the ollama pull command or the corresponding API endpoint (/api/pull) to download models from the Ollama library 1. Once a model is downloaded, it can be loaded into memory when it is first used in an inference request (e.g., via /api/generate, /api/chat, or /api/embed). The keep_alive parameter in these endpoints provides control over how long a model persists in memory after a request is completed 11. By setting an appropriate keep_alive duration, users can ensure that frequently used models remain loaded, avoiding the overhead of reloading them for subsequent requests. Additionally, sending a request with only the model parameter and an empty prompt (for /api/generate and /api/embed) or an empty messages array (for /api/chat) will load the specified model into memory without generating any output. Conversely, setting keep_alive to 0 in such a request will unload the model from memory. Ollama is designed to leverage GPU acceleration if compatible hardware is available on the system 2. It supports NVIDIA GPUs with a compute capability of 5.0 and above 6, and also offers support for AMD GPUs 25, which might be in a preview state on certain operating systems 25. When Ollama starts via the ollama serve command 1, it automatically attempts to detect and utilize available GPUs 23. For systems with multiple NVIDIA GPUs, the environment variable CUDA_VISIBLE_DEVICES can be used to restrict Ollama’s usage to a specific subset of GPUs, specified by their IDs 23. The Ollama API provides parameters within the options field of the /api/generate, /api/chat, and /api/embed endpoints to directly influence GPU usage during inference 11. The num_gpu option controls the number of model layers that are loaded onto the GPU 27. A value of -1 for num_gpu might indicate that Ollama dynamically allocates layers to the GPU, while 0 forces the model to run on the CPU. Positive integer values specify the number of layers to offload to the GPU 27. For systems with multiple GPUs, the main_gpu option can be used to designate a specific GPU (by its index, typically starting from 0) to handle smaller tensors, as the overhead of distributing these computations across multiple GPUs might not be worthwhile 27. While these parameters offer fine-grained control over GPU utilization at the API level, it has been reported that their behavior might not be consistent across all Ollama versions, and users might encounter issues with them not functioning as expected in some cases 29. To monitor GPU utilization, the ollama ps command is an invaluable tool 23. The output of this command includes a PROCESSOR column, which indicates where the model is currently loaded. A value of 100% GPU signifies that the model is fully loaded onto the GPU, while 100% CPU indicates it is running entirely in system memory. Mixed percentages, such as 48%/52% CPU/GPU, suggest that the model is partially loaded on both 23. Additionally, standard system monitoring tools like nvidia-smi (for NVIDIA GPUs) or AMD Adrenaline (for AMD GPUs) can provide more detailed insights into GPU load, memory usage, and other performance metrics 23.
🌟 5. Ollama API Version History and Release Notes: A Detailed Overview of Recent Changes
🚀 Welcome to this comprehensive guide! This section will give you the foundational knowledge you need. The Ollama API has been under active development, with numerous updates and enhancements introduced in recent versions. Examining the release notes provides valuable insights into the evolution of the API and the new capabilities it offers. Key features and improvements in recent Ollama versions include the addition of support for several new models, such as Google’s Gemma 3 (in various parameter sizes), IBM’s Granite 3.0, and Meta’s Llama 3.2 Vision 10. Function calling capabilities have been introduced and refined, with support in Mistral v0.3 and Llama 3.1 9. The API now supports structured outputs, allowing responses to be constrained to a specific JSON schema 25. Tool support has been added, enabling models to interact with external tools to perform more complex tasks 25. The introduction of dedicated embedding models further enhances Ollama’s utility for tasks like semantic search 25. Hardware support has also been expanded, with AMD GPUs now supported in preview on Windows and Linux 25, and specific support for AMD Strix Halo GPUs has been added 10. The Windows preview of Ollama includes built-in GPU acceleration 25. Initial compatibility with the OpenAI Chat Completions API has been implemented, facilitating the use of existing tooling built for OpenAI with local Ollama models 25. Performance has been a significant focus, with optimizations for Gemma 3 model loading and inference speed 10. New command-line options, such as ollama show -v for more detailed model information, have been added 10. Several bug fixes have addressed issues related to memory errors, model loading failures, and permission problems 10. A new environment variable, OLLAMA_CONTEXT_LENGTH, allows users to set the default context length for models 10. Specific version notes highlight the introduction of Gemma 3 models in version 0.6.0 10. The latest information available from the research snippets indicates activity up to February 2025 13. Keeping track of these version-specific changes is important for users who wish to take advantage of the latest features or need to address specific issues present in earlier versions. For a comprehensive overview of the Ollama API’s version history and detailed change logs, the official Ollama GitHub releases page should be consulted 30. This page provides a chronological list of all releases, including pre-releases, with detailed descriptions of the changes, bug fixes, and new features introduced in each version.
🌟 6. Conclusion and Recommendations
The Ollama API presents a comprehensive and evolving interface for interacting with large language models locally. Its extensive set of endpoints enables a wide range of functionalities, from basic text generation to complex chat interactions and model management. The API’s design prioritizes ease of use and flexibility, allowing developers and researchers to seamlessly integrate LLMs into their projects. Based on the analysis, the following recommendations are provided for users of the Ollama API:
-
Always refer to the official Ollama API documentation on the GitHub repository for the most up-to-date and accurate information regarding endpoints, parameters, and usage 11.
-
Experiment with the options parameter in the inference endpoints, particularly num_gpu and main_gpu, while closely monitoring GPU usage using ollama ps and system monitoring tools, to optimize performance for specific models and hardware configurations.
-
Stay informed about the latest features, improvements, and bug fixes by regularly checking the Ollama blog and the release notes on the official GitHub repository 10.
-
Leverage the available client libraries for Python and JavaScript to simplify the integration of the Ollama API into applications 4.
-
Utilize the /api/version endpoint to programmatically determine the version of the Ollama server to ensure compatibility and to take advantage of version-specific features. Potential future directions for the Ollama API could include more detailed and version-specific documentation to address the challenges in tracking the latest changes. Enhancements to the consistency and robustness of GPU utilization controls, particularly the num_gpu and main_gpu parameters, would be beneficial. Further expansion of support for different hardware architectures beyond NVIDIA and AMD GPUs could also broaden the accessibility of Ollama.
🔧 Works cited
1. Ollama REST API | Documentation | Postman API Network, accessed on March 27, 2025, https://www.postman.com/postman-student-programs/ollama-api/documentation/suc47x8/ollama-rest-api 2. How to Run LLMs Locally with Ollama AI - GPU Mart, accessed on March 27, 2025, https://www.gpu-mart.com/blog/run-llms-with-ollama 3. Using the Ollama API to run LLMs and generate responses locally - DEV Community, accessed on March 27, 2025, https://dev.to/jayantaadhikary/using-the-ollama-api-to-run-llms-and-generate-responses-locally-18b7 4. ollama-api - PyPI, accessed on March 27, 2025, https://pypi.org/project/ollama-api/ 5. Ollama API Usage Examples - GPU Mart, accessed on March 27, 2025, https://www.gpu-mart.com/blog/ollama-api-usage-examples 6. Ollama Api Python Example | Restackio, accessed on March 27, 2025, https://www.restack.io/p/ollama-api-answer-python-example-cat-ai 7. api package - github.com/ollama/ollama/api - Go Packages, accessed on March 27, 2025, https://pkg.go.dev/github.com/ollama/ollama/api 8. Ollama. API - HexDocs, accessed on March 27, 2025, <https://hexdocs.pm/ollama/0.3.0/Ollama. API.html> 9. mistral - Ollama, accessed on March 27, 2025, https://ollama.com/library/mistral 10. Releases · ollama/ollama - GitHub, accessed on March 27, 2025, https://github.com/ollama/ollama/releases 11. ollama/docs/api.md at main · ollama/ollama - GitHub, accessed on March 27, 2025, https://github.com/ollama/ollama/blob/main/docs/api.md 12. llama2 - Ollama, accessed on March 27, 2025, https://ollama.com/library/llama2 13. Using Ollama APIs to generate responses and much more [Part 3] - Geshan’s Blog, accessed on March 27, 2025, https://geshan.com.np/blog/2025/02/ollama-api/ 14. Using Ollama with Python: Step-by-Step Guide - Cohorte Projects, accessed on March 27, 2025, https://www.cohorte.co/blog/using-ollama-with-python-step-by-step-guide 15. ollama/README.md at main - GitHub, accessed on March 27, 2025, https://github.com/ollama/ollama/blob/main/README.md 16. Ollama Python library - GitHub, accessed on March 27, 2025, https://github.com/ollama/ollama-python 17. Python & JavaScript Libraries · Ollama Blog, accessed on March 27, 2025, https://ollama.com/blog/python-javascript-libraries 18. How to Use Curl and API Request from Postman to Locally Run AI Models in Ollama | Chat Endpoints - YouTube, accessed on March 27, 2025, https://www.youtube.com/watch?v=QjdHorEwz5E 19. LLM with Ollama Python Library | Data-Driven Engineering - APMonitor, accessed on March 27, 2025, https://apmonitor.com/dde/index.php/Main/LargeLanguageModel 20. Changelog • ollamar - GitHub Pages, accessed on March 27, 2025, https://hauselin.github.io/ollama-r/news/index.html 21. nomic-embed-text - Ollama, accessed on March 27, 2025, https://ollama.com/library/nomic-embed-text 22. generate embedding | Ollama REST API - Postman, accessed on March 27, 2025, https://www.postman.com/postman-student-programs/ollama-api/request/tzimef1/generate-embedding 23. Ollama Gpu Usage Insights | Restackio, accessed on March 27, 2025, https://www.restack.io/p/ollama-answer-gpu-usage-cat-ai 24. Ollama Memory Usage Insights | Restackio, accessed on March 27, 2025, https://www.restack.io/p/ollama-answer-memory-usage-cat-ai 25. Blog · Ollama, accessed on March 27, 2025, https://ollama.com/blog 26. Four Ways to Check if Ollama is Using Your GPU or CPU - YouTube, accessed on March 27, 2025, https://www.youtube.com/watch?v=on3rtyPWSgA&pp=0gcJCfcAhR29_xXO 27. Ollama Chat :: Spring AI Reference, accessed on March 27, 2025, https://docs.spring.io/spring-ai/reference/api/chat/ollama-chat.html 28. How to run Ollama only on a dedicated GPU? (Instead of all GPUs) · Issue #1813 - GitHub, accessed on March 27, 2025, https://github.com/ollama/ollama/issues/1813 29. Stop ollama from running in GPU - Reddit, accessed on March 27, 2025, https://www.reddit.com/r/ollama/comments/1csj43l/stop_ollama_from_running_in_gpu/ 30. Ollama Update Notes | Restackio, accessed on March 27, 2025, https://www.restack.io/p/ollama-answer-update-notes-cat-ai 31. Ollama Version Updates | Restackio, accessed on March 27, 2025, https://www.restack.io/p/ollama-answer-version-updates-cat-ai