Should I use Ollama or llama.cpp?

If you’re a beginner or want an easy setup, we recommend starting with Ollama. It’s user-friendly and automates model management and API provisioning. Advanced users requiring fine-grained parameter tuning or custom build options might prefer using llama.cpp directly. Starting with Ollama and switching to llama.cpp as needed is an ideal learning pathway.

How good is the Japanese language capability of local AI?

The latest open-source models (e.g., Llama 3, Qwen 2) have significantly improved in understanding and generating Japanese. However, there are still scenarios where they may fall short compared to cutting-edge cloud-based models like GPT-4 or Claude 3.5. Combining Japanese-optimized models (e.g., Swallow, ELYZA) with Ollama can enhance the quality of Japanese interactions.

What kind of PC specs are needed?

A minimum of 8GB of memory is required, but 16GB or more is recommended for smoother performance. A GPU is not mandatory, but having an NVIDIA graphics card (VRAM 8GB or more) or an Apple Silicon machine will greatly improve response speed. Start with your current PC, and upgrade hardware if needed.

How do I choose the right model for my needs?

The choice of model depends on your use case and hardware specifications. For general conversations, Llama 3 (8B) is an excellent choice. For faster responses, try smaller models like Phi 3 Mini. For code generation, CodeLlama is recommended. Start with a smaller model to evaluate quality and speed, and scale up if your hardware can handle it.

Introduction to Developing Local AI Agents: A Complete Guide to Ollama and llama.cpp

Learn how to create AI agents that run entirely on your PC without sending data to the cloud. From setup to practical use cases, this comprehensive guide covers privacy-focused local AI development with Ollama and llama.cpp.

May 23, 2026 11 min read Reviewed & edited by the SINGULISM Editorial Team

Introduction to Developing Local AI Agents: A Complete Guide to Ollama and llama.cpp — Photo by Ales Nesetril on Unsplash

What is a Local AI Agent?

In recent years, services leveraging large language models like ChatGPT and Claude have rapidly gained popularity.

However, when using these services, all input text and conversation data are sent to cloud servers. Many individuals may feel uneasy about sending corporate confidential information or personal private data to external servers.

A local AI agent is an AI system that operates entirely within your own computer or server, without relying on cloud services. Since data is not transmitted externally, privacy is maximally protected. Additionally, it functions even without an internet connection, eliminating recurring costs like monthly fees or API usage charges.

This article provides a complete guide on building local AI agents using open-source tools “Ollama” and “llama.cpp.” Even those with limited programming experience can follow along, as it explains everything step-by-step, from fundamental concepts to practical implementation.

Why Local AI is Gaining Attention

Local AI is gaining attention for several key reasons:

Protecting Data Privacy

The primary advantage is privacy. When businesses use AI to process customer data or internal documents, there is a risk involved in sending such information to external servers.

By operating AI in a local environment, data remains within the company’s machines. This is a decisive benefit for industries that deal with highly sensitive information, such as medical records, legal documents, or financial data.

Reduced Ongoing Costs

Using cloud-based large language models incurs fees proportional to the number of tokens processed. For tasks involving large volumes of text, API costs can escalate quickly.

In a local environment, the only recurring cost after the initial hardware purchase is electricity. Over the long term, this can result in significant cost savings.

Customization and Flexibility

In a local environment, you have the freedom to adjust the size and parameters of the model you use. You can also fine-tune models for specific applications.

Additionally, since it does not rely on network connectivity, local AI can be useful in offline environments or in situations where security concerns prohibit internet access.

What is Ollama?

Ollama is an open-source tool designed to make running large language models locally both simple and accessible. Introduced in 2023, its community has been rapidly growing.

Key Features of Ollama

The primary appeal of Ollama lies in its ease of use. Traditionally, running large language models locally required multiple steps, such as setting up a Python environment, installing various libraries, and configuring GPU drivers. With Ollama, you can download and run models with a single command.

Here are its main features:

One-command model execution: Simply type ollama run [model name] in the terminal to begin interacting with the AI.
Wide range of supported models: Ollama supports many major open-source models, including Llama 3, Mistral, Gemma 2, Phi 3, and Qwen 2.
Cross-platform support: Available for macOS, Windows, and Linux.
REST API: Easily integrate with custom applications.

How Ollama Works

Under the hood, Ollama operates using a llama.cpp-based inference engine. This engine efficiently loads model weights into memory and performs fast inference using both CPU and GPU resources. Users can interact with the AI via a simple interface without worrying about the underlying complexities.

What is llama.cpp?

llama.cpp is an inference engine for large language models, written in C++ and developed by Georgi Gerganov. It was designed to run Meta’s Llama series models without requiring high-performance GPUs.

Key Features of llama.cpp

The standout feature of llama.cpp is its lightweight, pure C/C++ implementation. It does not require a Python runtime or additional libraries.

By quantizing model weights (e.g., 4-bit or 5-bit, which sacrifices some precision to reduce data size), it significantly reduces memory usage.

For instance, a PC with 8GB of memory can handle a 7-billion-parameter model. The engine is optimized to run relatively fast even in CPU-only environments.

The Relationship Between Ollama and llama.cpp

To clarify their relationship, llama.cpp acts as a low-level inference engine, while Ollama is a user-friendly wrapper built on top of it. While llama.cpp can be used independently, Ollama simplifies tasks such as model management and API provisioning.

Advanced users may prefer to work directly with llama.cpp for finer parameter tuning or to specify custom build options. Less experienced users, however, can benefit greatly from Ollama’s streamlined setup and management.

Preparing the Development Environment

Here’s what you’ll need in terms of hardware and software to build a local AI agent.

Recommended Hardware

The minimum requirement is a PC with 8GB of memory. However, for smoother performance, the following setup is recommended:

Memory: At least 16GB is ideal. A 7-billion-parameter model requires about 4GB of memory, while a 13-billion-parameter model needs about 8GB.
GPU: If equipped with an NVIDIA graphics card, inference speeds will significantly improve. A VRAM of 8GB or more is sufficient for running 7-billion-parameter models comfortably on a GPU.
macOS: Machines with Apple Silicon, such as the M1 chip or later, are extremely efficient for running large language models due to their unified memory architecture.
Storage: Model files can range from several gigabytes to tens of gigabytes, so ensure your SSD has ample free space.

Installing Ollama

Installing Ollama is surprisingly straightforward:

macOS: Download the installer from the official website and run it. Alternatively, use Homebrew by entering brew install ollama in the terminal.
Windows: Download the Windows installer from the official site and follow the instructions. Once installed, you can use the ollama command from Command Prompt or PowerShell.
Linux: Run the one-liner command provided on the official site: curl -fsSL https://ollama.com/install.sh | sh.

Testing the Installation

After installation, type ollama --version in the terminal to verify the version number.

Next, enter ollama run llama3 to automatically download the Llama 3 model. Once the download is complete, you can start interacting with the AI. Try typing “Hello” and check if the response appears — if it does, your setup is successful.

Steps to Build a Local AI Agent

Here, we’ll walk through the steps to create your own AI agent using Ollama’s API.

Step 1: Selecting a Model

Ollama offers a variety of models suited for different purposes:

General conversation: Llama 3, developed by Meta, is ideal for natural language interactions.
Fast responses: Smaller models like Phi 3 Mini or Gemma 2.
Code generation: Models like CodeLlama or DeepSeek Coder.
Japanese-focused tasks: Swallow or ELYZA models are optimized for Japanese.

Choose a model based on task suitability, available hardware resources, and required response speed. Start with a smaller model and scale up if your hardware allows.

Step 2: Designing the System Prompt

The system prompt defines the AI agent’s behavior and “personality.”

For instance, to create an assistant that summarizes internal documents, you could set the following prompt:

“You are an assistant specializing in summarizing internal documents. Please concisely summarize the key points of the given document in three bullet points, including brief explanations for any technical terms.”

You can include this system prompt in Ollama’s Modelfile, specifying the base model with FROM llama3 and defining the system prompt after the SYSTEM keyword. Use ollama create my-agent -f Modelfile to create a custom agent.

**Step 3:

Integrating with Applications Using the API** Ollama provides a built-in REST API that allows interaction with the AI through HTTP requests. The API endpoint runs on http://localhost:11434.

To use the chat API, send a POST request to the /api/chat endpoint, passing the model name and message in JSON format.

Here’s an example of how to use Python’s requests library to send a request and retrieve a response. Specify the model name (e.g., “my-agent”) and your message, then extract the AI’s response from the message field in the JSON response.

**Step 4:

Enabling Tool Integration for Agent Functionality** To turn your chatbot into a functional “agent” capable of interacting with external tools, utilize the Function Calling feature.

For example, if you want the agent to perform calculations, you can include a “tools” field in the API request, informing the AI about available tools. When a question requiring calculation arises, the AI will generate a request to utilize the tool, which your application can then execute and return the result to the AI for further processing.

This enables the AI to perform real-world tasks like database searches, file operations, or even web searches, all while operating locally.

Practical Use Cases

Here are some practical scenarios where local AI agents can be highly valuable:

**Summarizing and Searching Corporate

Documents** Develop a system to summarize and search through large volumes of corporate documents using local AI. Since there’s no need to send sensitive documents externally, it ensures compliance and data security. You can implement Retrieval-Augmented Generation (RAG) locally by storing documents in a vector database and letting the AI summarize relevant documents based on user queries.

Programming Assistance

Receive code reviews and debugging help in a local environment. This eliminates the risk of sending proprietary corporate code to external services while benefiting from AI programming assistance.

Using coding-specific models like CodeLlama, the AI can explain functions, identify bugs, and even generate test code.

Boosting Personal Productivity

Local AI can assist with drafting daily reports, proofreading emails, brainstorming ideas, and more. Since it runs offline, you can freely consult the AI on private matters without worrying about data security or internet connectivity.

Translation and Proofreading

Local AI can also be utilized for translating and proofreading sensitive documents. This is particularly useful in industries like legal services or translation agencies, where data confidentiality is critical.

Advantages and Disadvantages

Advantages

Complete Privacy: Data is never sent externally, making it ideal for handling sensitive information.
Cost Efficiency: With no recurring costs beyond the initial hardware investment, it becomes more cost-effective with higher usage.
Network Independence: Usable even offline or during network issues, such as on airplanes.
Customizability: Models and prompts can be tailored to specific needs.

Disadvantages

Performance Gap: Local models still lag behind cutting-edge cloud-based models (e.g., GPT-4, Claude 3.5) in handling advanced reasoning or complex instructions.
Hardware Limitations: Running high-performance models requires significant memory and GPU resources. Smaller models may not meet certain performance requirements.
Technical Knowledge: Setting up and managing local AI requires some technical proficiency, which may be a barrier for complete beginners.

Tips for Optimizing Performance

Choosing the Right Quantization Level

The quantization level determines the trade-off between model quality and speed.

Q4_K_M (4-bit quantization, medium quality): A balanced option, suitable for most use cases.
Q5_K_M: Slightly higher quality.
Q8_0: High quality but requires significantly more memory.

Begin with Q4_K_M and increase the quality if necessary, as long as your hardware can handle it.

Adjusting Context Length

The context length determines how much text the AI can process at once. While longer context lengths can be useful, they also increase memory usage and reduce speed.

For shorter conversations, 2048 tokens are sufficient. For summarizing long documents, consider 4096 or 8192 tokens. Avoid unnecessarily long context lengths to prevent memory overuse and slow performance.

Using GPU Offloading

If your system has a GPU, be sure to enable GPU offloading. Ollama automatically detects and utilizes GPUs, but when using llama.cpp directly, use the -ngl parameter to specify the number of layers to offload to the GPU. If your GPU has sufficient VRAM, you can achieve maximum speed by offloading the entire model.

Security Considerations

While local AI offers excellent privacy, some security precautions are necessary:

Trusted Sources Only: Download model files only from reputable sources, such as Ollama’s official library or verified models on Hugging Face. Unverified sources may contain malicious code.
API Access Control: By default, Ollama’s API only accepts connections from localhost. Changing this to allow external access increases the risk of unauthorized use. Unless necessary, it’s best to keep the API restricted to the local machine.

Future Outlook

The field of local AI is advancing rapidly. With hardware improvements and more efficient models, it will become easier to run high-performance models in accessible environments.

The proliferation of Apple Silicon, next-generation NVIDIA GPUs, and NPUs (Neural Processing Units) will further accelerate on-device AI processing.

On the model front, smaller yet more intelligent models are continually being developed. Recent open-source models like Llama 3, Phi 3, and Qwen 2 outperform models from just a year ago while maintaining or even reducing size. This trend suggests that in the near future, even smartphones might be capable of running high-quality local AI agents.

There has never been a better time to start developing local AI agents. Use this guide to build your own private, secure AI agent and explore the possibilities.

Frequently Asked Questions

Should I use Ollama or llama.cpp?: If you’re a beginner or want an easy setup, we recommend starting with Ollama. It’s user-friendly and automates model management and API provisioning. Advanced users requiring fine-grained parameter tuning or custom build options might prefer using llama.cpp directly. Starting with Ollama and switching to llama.cpp as needed is an ideal learning pathway.
How good is the Japanese language capability of local AI?: The latest open-source models (e.g., Llama 3, Qwen 2) have significantly improved in understanding and generating Japanese. However, there are still scenarios where they may fall short compared to cutting-edge cloud-based models like GPT-4 or Claude 3.5. Combining Japanese-optimized models (e.g., Swallow, ELYZA) with Ollama can enhance the quality of Japanese interactions.
What kind of PC specs are needed?: A minimum of 8GB of memory is required, but 16GB or more is recommended for smoother performance. A GPU is not mandatory, but having an NVIDIA graphics card (VRAM 8GB or more) or an Apple Silicon machine will greatly improve response speed. Start with your current PC, and upgrade hardware if needed.
How do I choose the right model for my needs?: The choice of model depends on your use case and hardware specifications. For general conversations, Llama 3 (8B) is an excellent choice. For faster responses, try smaller models like Phi 3 Mini. For code generation, CodeLlama is recommended. Start with a smaller model to evaluate quality and speed, and scale up if your hardware can handle it.

Source: Singulism

SINGULISM Editorial Team — Reviewed & edited by the SINGULISM Editorial Team

If you find any factual errors or inaccuracies, we will promptly publish a correction. Please contact us via the contact form to request a correction.

Comments

← Back to Home