Dev

Running DeepSeek V4 Flash Locally on MacBook Pro: Achieving Zero Token Costs

Italian developer antirez has open-sourced the inference engine 'ds4' for DeepSeek V4 Flash, enabling local execution on Apple MacBook Pro with zero token costs.

6 min read Reviewed & edited by the SINGULISM Editorial Team

Running DeepSeek V4 Flash Locally on MacBook Pro: Achieving Zero Token Costs
Photo by Daniil Komov on Unsplash

Token Cost Challenges in the Era of AI Agents

What is the most expensive aspect of the era of AI agents? It’s tokens. Heavy users often consume billions of tokens every month, leading to bills that can run into tens of thousands of yen. However, a developer has now open-sourced a method for local execution, enabling deployment on just one Apple laptop. This achievement allows for “token freedom,” meaning users can perform as many tasks as they want without spending a single yen on tokens.

The Emergence of the ds4 Project

A few days ago, antirez released a project called “ds4” on GitHub. Designed specifically for DeepSeek V4 Flash, this inference engine operates on Apple computers equipped with 128GB of memory using just a few thousand lines of C code.

Antirez, whose real name is Salvatore Sanfilippo, is an Italian programmer and the creator of the open-source database Redis. Redis later became one of the most widely used in-memory databases in the global internet infrastructure.

Features of DeepSeek V4 Flash and Why It Was Chosen

Why did antirez choose DeepSeek V4 Flash? Simply put, DeepSeek is the most suitable for embedding into local computers. While it has 284 billion total parameters, only 13 billion are activated per inference, making it less resource-intensive than conventional large models. It supports a million-token context, making it ideal for long-term tasks such as programming assistance. At the same time, its KV cache is sufficiently compressed, leaving space for operations in local memory and SSDs. DeepSeek V4 Flash strikes a unique balance—large enough to be worth fine-tuning, yet compact enough to be embedded into an Apple laptop.

The Core of ds4: Optimizing a Dedicated Engine

What exactly is ds4? In conclusion, ds4 is not a model but a “dedicated engine.” Until now, people have used tools like llama.cpp to run large models on their own computers. The advantage of llama.cpp is its support for various models like Llama, Qwen, and DeepSeek. However, the downside is that supporting all models means it can’t execute any specific model at maximum speed.

Antirez took a completely different approach. Instead of focusing on other models, he dedicated himself solely to optimizing DeepSeek V4 Flash to the extreme. He focused on three major improvements:

Asymmetric 2-Bit Quantization

The architecture of DeepSeek V4 Flash uses MoE (Mixture of Experts), with only 13 billion out of a total 284 billion parameters activated per inference. These 13 billion parameters are expert subnetworks selected during routing. It’s akin to having a toolbox with 284 tools and only using 13 for a specific task.

Antirez’s approach involved aggressive 2-bit quantization for these routed experts, using IQ2_XXS for the up and gate matrices and Q2_K for the down matrices. Meanwhile, all critical components in the model’s key pathways—shared experts, projections, routing networks—were preserved at their original precision. In essence, antirez compressed these “idle experts” down to one-fourth their original size while maintaining the integrity of key components that are used in every inference. This strategy represents an asymmetric compression that balances reducing size while preserving quality.

Moving KV Cache to SSD

DeepSeek V4 Flash supports a million-token context, meaning it can remember the entirety of a novel provided to it. However, such a long context necessitates frequent “look-back” actions during operations. To prevent these look-back actions from becoming too slow and causing the system to freeze, AI needs to temporarily store this data in a “cache” for quick retrieval.

Previously, this cache was typically stored in memory. But when running DeepSeek V4 Flash on a 128GB memory MacBook Pro, the cache alone would consume the memory, leaving no room for the model itself.

Antirez tackled this issue by moving the cache directly to the hard drive (SSD). The ds4 engine allows parts of the KV states to be written back to disk, so the AI doesn’t need to reprocess everything from scratch during extended prompts or agent tasks. Modern Mac SSDs are fast enough to handle the persistence and retrieval of KV cache. Moreover, since DeepSeek V4 Flash itself compresses the cache, the read and write volume is reduced, and disk operations remain efficient.

As a result, memory usage is significantly conserved, making it possible to execute ultra-long conversations with a million tokens on a single MacBook. However, according to the ds4 documentation, the 2-bit model itself occupies approximately 80GB of memory, and a more realistic daily-use context would range between 100k to 300k tokens.

Implementation of Pure Metal Native Paths

Antirez concentrated all optimizations on the GPU of Apple computers. He wrote a code set specifically for Apple chips to ensure that DeepSeek V4 Flash runs efficiently on Apple computers. The CPU, however, was not the focus of this project. The README explicitly states that the CPU mode is still unstable and may even cause system crashes.

Measured Speed and Practicality

Tests conducted on a MacBook Pro with an M3 Max chip and 128GB of memory revealed that the system could generate approximately 26 characters per second. On a Mac Studio with an M3 Ultra chip and 512GB of memory, this increased to 36 characters per second. While this is not extraordinarily fast, it is sufficiently practical for daily tasks such as coding and debugging.

Another intriguing aspect is that antirez completed the entire project single-handedly by leveraging GPT-5.5.

Ecosystem Significance for DeepSeek

According to reports from international media, DeepSeek is currently seeking funding of up to $7.35 billion. Liang Wenfeng, the founder, is at a critical juncture of shifting the narrative from technical achievements to commercial applications.

What do investors look for? It’s not just about model benchmark scores or API call volumes—it’s about the model’s ecosystem position and irreplaceability. The fact that a renowned developer overseas is willing to write a dedicated engine for the model indicates that DeepSeek has established a significant ecosystem presence abroad.

Over the past year, the primary metric for evaluating Chinese open-source models in global markets has been benchmarks. However, when someone is willing to develop secondary applications around your model, it signifies recognition of its value—a type of validation that arguably surpasses numerical scores.

Frequently Asked Questions

Can ds4 truly eliminate token costs?
Yes, ds4 is a dedicated inference engine for running DeepSeek V4 Flash in local environments. By not relying on cloud APIs, users avoid token-based charges. However, high-performance hardware like a 128GB memory MacBook Pro is necessary for operation.
How does ds4 compare to llama.cpp?
While llama.cpp is a general-purpose tool that supports various models, ds4 is exclusively optimized for DeepSeek V4 Flash. Its advantages include specialized techniques like asymmetric quantization and moving KV cache to SSD, enabling efficient processing of long contexts with limited resources.
Could this technology be applied to other models in the future?
Currently, ds4 is tailored for DeepSeek V4 Flash. However, antirez's approach could inspire other developers to create similarly optimized engines for specific models, depending on community interest and development.
Source: 钛媒体

Comments

← Back to Home