AI Inference Brings a New Opportunity to Chip Startups
As the use of AI shifts from training to inference, companies developing specialized chips beyond GPUs are finding new market opportunities. This article explores industry trends, including Nvidia's acquisition of Groq and the emergence of optical inference accelerators.
The Era of AI Inference Opens a New Horizon for Chip Startups
The application of AI technology is swiftly transitioning from the “training” phase to the “deployment” or “inference” phase. This shift holds the potential to disrupt Nvidia’s dominant position in the market.
According to a May 3, 2026 report by The Register, the diversification of inference processing is creating a “now-or-never” opportunity for startups in the AI chip sector.
Inference: More Diverse Workloads Than Training
Compared to training, inference involves far more diverse processing patterns. Large-scale batch inference requires a different combination of computational power, memory, and bandwidth than tasks like AI assistants or code agents. This diversity serves as a tailwind for specialized hardware beyond general-purpose GPUs.
Inference processing can be broadly divided into two stages: “pre-fill” and “decode.” Pre-fill demands extensive computation, while decode relies heavily on bandwidth. These differing characteristics have spurred the rise of “distributed architectures,” which combine different types of hardware tailored to specific tasks.
Nvidia’s $20 Billion Acquisition Signals a Strategic Shift
A symbolic example of this trend is Nvidia’s acquisition of Groq. Completed in December 2025 for approximately $20 billion (around 2 trillion yen), this move signifies Nvidia’s serious foray into the inference market.
Groq’s LPU (Language Processing Unit) features a unique architecture that heavily utilizes high-speed but low-capacity SRAM. When scaled with enough chips, it could achieve token generation speeds surpassing GPUs. However, limited computational capacity and immature chip technology presented challenges for efficient scaling.
Nvidia has cleverly addressed these issues by delegating computation-heavy pre-fill tasks to its GPUs and bandwidth-reliant decode tasks to Groq’s LPUs. This strategy combines the strengths of both types of hardware.
AWS and Intel Join the Distributed Inference Game
Nvidia isn’t the only player in this arena. Cloud computing giant AWS has also announced a similar distributed computing platform. Their configuration deploys in-house Trainium accelerators for pre-fill and Cerebras Systems’ massive wafer-scale accelerators for decode tasks.
Intel has also joined this trend, unveiling a reference design that utilizes GPUs for pre-fill and RDU (Reconfigurable Dataflow Unit) chips, developed by AI chip startup SambaNova, for decode tasks.
Startups Shine in the Decode Market
Thus far, the success stories of AI chip startups have primarily been concentrated on the decode side. While SRAM lacks large capacity, it boasts astonishing speed. With sufficient chips or massive chips like those produced by Cerebras, it is well-suited for accelerating decode tasks.
However, the potential for startups is not limited to decode alone.
Enter the Next-Generation Optical Accelerators
This week, Lumai revealed details about its optical inference accelerators. This groundbreaking technology uses light, rather than electricity, to perform the matrix calculations central to machine learning workloads. Compared to conventional digital architectures, this approach can significantly reduce power consumption.
Lumai projects that its next-generation “Iris Tetra” system will achieve exaOPS-class AI performance by 2029, operating within a 10kW power budget. While its technology adopts a hybrid electro-optical architecture, the majority of inference calculations are handled by optical tensor cores within the chip.
Initially, the company plans to position these chips as alternatives to GPUs for computation-heavy batch inference workloads. In the long term, it aims to utilize optical accelerators for pre-fill processing as well. For now, the architecture remains in its early stages, capable of running large models like Llama 3.1 with tens of billions of parameters, such as 8B and 70B.
A New Era of Distributed AI
With Nvidia aggressively targeting the inference market, cloud providers introducing proprietary chips, and emerging players leveraging technologies like optics and SRAM, the AI chip market is shifting from the era of general-purpose GPUs to a distributed model that combines hardware optimized for specific workloads.
The shift from training to inference could pose a threat even to Nvidia. The diversity of inference workloads suggests that specialized chips may outperform general-purpose GPUs for certain tasks. For AI chip startups, this could be a once-in-a-lifetime opportunity.
Q: What is the difference between AI inference and training?
A: Training is the process of adjusting an AI model’s parameters using large datasets, requiring enormous computational resources. In contrast, inference involves using a trained model to perform tasks in real-world applications. Compared to training, inference features more diverse workloads, ranging from batch processing to real-time interactions.
Q: Why are non-GPU chips well-suited for inference tasks?
A: Inference consists of distinct stages: pre-fill, which is computation-intensive, and decode, which is bandwidth-intensive. Chips with unique characteristics, such as high-speed SRAM or optical accelerators, may be better suited for specific tasks than general-purpose GPUs.
Q: What are the features of Lumai’s optical inference accelerator?
A: By using light for matrix calculations instead of electricity, this technology can significantly reduce power consumption compared to traditional digital architectures. Lumai aims to achieve exaOPS-class performance by 2029 with just a 10kW power budget and plans to expand its applications to pre-fill processing in the future.
Comments