If you were looking for the ideal time to IPO, being a chip company in May 2026 is hard to beat. Reuters reported over the weekend:
Cerebras Systems is set to raise the size and price of its initial public offering as soon as Monday, as demand for the artificial intelligence chipmaker’s shares continues to climb, two people familiar with the matter told Reuters on Sunday. The company is considering a new IPO price range of $150-$160 a share, up from $115-$125 a share, and raising the number of shares marketed to 30 million from 28 million, said the sources, who asked not to be identified because the information isn’t public yet.
The fundamental driver of the ongoing surge in semiconductor stocks is, of course, AI, particularly the realization that agents are going to need a lot of compute. What Cerebras represents, however, is something broader: while the compute story for AI has been largely about GPUs, particularly from Nvidia, the future is going to look increasingly heterogeneous.
The GPU Era
The story of how Graphics Processing Units became the center of AI is a well-trodden one, but in brief:
- Just as drawing pixels on a computer screen was a parallel process, which meant there was a direct connection between the number of processing units and graphics speed, making AI-related calculations was a parallel process, which meant there was a direct connection between the number of processing units and calculation speed.
- Nvidia enabled this dual-usage by making its graphics processors programmable, and created an entire software ecosystem called CUDA to make this programming accessible.
- The big difference between graphics and AI has been the size of the problem being solved — models are a lot bigger than video game textures — which has led to a dramatic expansion in high-bandwidth memory (HBM) per GPU, and dramatic innovations in terms of chip-to-chip networking to allow multiple chips to work together as one addressable system. Nvidia has been the leader in both.
The number one use case for GPUs has been training, which stresses the third point in particular. While the calculations within each training step are massively parallel, the steps themselves are serial: every GPU has to share its results with every other GPU before the next step can begin. This is why a trillion-parameter model needs to fit in the aggregate memory of tens of thousands of GPUs that can communicate as one system. Nvidia dominates both problem spaces, first by securing HBM ahead of the rest of the industry, and second thanks to its investments in networking.
Of course training isn’t the only AI workload: the other is inference. Inference has three main parts:
- Prefill encodes everything the LLM needs to know into an understandable state; this is highly parallelizable and compute matters.
- The first part of decode entails reading the KV cache — which stores context, including the output of the prefill step — to make an attention calculation. This is a serial step where bandwidth matters, but the memory requirements are variable and increasingly large.
- The second part of decode is the feed-forward computation over the model weights; this is also a serial step where bandwidth matters, and the memory requirements are defined by the size of the model.
The two decode steps alternate for every layer of the model (they’re interleaved, not in sequence), which is to say that decode is serial and memory-bandwidth bound. For every token generated, two distinct memory pools must be read: the KV cache, which stores context and grows with each token, and the model weights themselves. Both must be read in full to produce a single output token.
GPUs handle all three needs: high compute for prefill, abundant HBM for KV cache and model weights, and chip-to-chip networking to pool memory across multiple chips when a single GPU isn’t enough. In other words, what works for training works for inference — look no further than the deal SpaceX made with Anthropic. From Anthropic’s blog:
We’ve signed an agreement with SpaceX to use all of the compute capacity at their Colossus 1 data center. This gives us access to more than 300 megawatts of new capacity (over 220,000 NVIDIA GPUs) within the month. This additional capacity will directly improve capacity for Claude Pro and Claude Max subscribers.
SpaceX retains Colossus 2 — presumably for both training of future models and inference of existing ones — and can afford to do both in the same data center precisely because xAI’s models aren’t getting much usage; more pertinently to this piece, they can do both in the same data center because both training and inference can be done on GPUs. Indeed, the GPUs Anthropic is contracting for at Colossus 1 were originally used for training as well; the fact that GPUs are so flexible is a big advantage.
Understanding Cerebras
Cerebras makes something completely different. While a silicon wafer has a diameter of 300mm, the “reticle limit” — the maximum area that a lithography tool can expose on that wafer — is around 26mm x 33mm. This is the effective size limit for chips; going beyond that entails linking two separate chips together over a chip-to-chip interposer, which is exactly what Nvidia has done with the B200. Cerebras, on the other hand, has invented a way to lay down wiring across the so-called “scribe lines” that are the boundary between reticle exposures, making the entire wafer into a single chip with no need for relatively slow chip-to-chip linkages.
The net result is a chip with a lot of compute and a lot of SRAM that is blisteringly fast to access. To put it in numbers, the WSE-3 (Cerebras’ latest chip) has 44GB of on-chip SRAM at 21 PB/s of bandwidth; an H100 has 80GB of HBM at 3.35 TB/s. In other words, the WSE-3 has just over half the memory of an H100, but 6,000 times the memory bandwidth.
The reason to compare the WSE-3 to an H100 is that the H100 is the chip most used for inference — and inference is clearly what Cerebras is most well-suited for. You can use Cerebras chips for training, but the chip-to-chip networking story isn’t very compelling, which is to say that all of that compute and on-chip memory is mostly just sitting around; what is much more interesting is the idea of getting a stream of tokens at dramatically faster speed than you can from a GPU.
Note, however, that the limitation in terms of training also potentially applies in terms of inference: as long as everything fits in on-chip memory Cerebras’ speed is an incredible experience; the moment you need more memory, whether that be for a larger model or, more likely, a larger KV cache, then Cerebras doesn’t make much sense, particularly given the price. That whole-wafer-as-chip technique means high yields are a massive challenge, which hugely drives up costs.
At the same time, I do think there will be a market for Cerebras-style chips: right now the company is highlighting the usefulness of speed for coding — reasoning means a lot of tokens, which means that dramatically scaling up tokens-per-second equals faster thinking — but I think this is a temporary use case, for reasons I’ll explain in a bit. What does matter is how long humans are waiting for an answer, and as products like AI wearables become more of a thing, the speed of interaction, particularly for voice — which will be a function of token generation speed — will have a tangible effect on the user experience.
Agentic Inference
I have previously made the case, including in Agents Over Bubbles, that we have gone through three inflection points in the LLM era:
- ChatGPT demonstrated the utility of token prediction.
- o1 introduced the idea of reasoning, where more tokens meant better answers.
- Opus 4.5 and Claude Code introduced the first usable agents, which could actually accomplish tasks, using a combination of reasoning models and a harness that utilized tools, verified work, etc.
All of this falls under the banner of “inference”, but I think it will be increasingly clear that there is a difference between providing an answer — what I will call “answer inference” — and doing a task — what I will call “agentic inference.” Cerebras’ target market is “answer inference”; in the long run, I think the architecture for “agentic inference” will look a lot different, not just from Cerebras’ approach, but from the GPU approach as well.
I mentioned above that fast inference for coding is a temporary use case. Specifically, coding with LLMs requires a human in the loop. It’s the human that defines what is to be coded, checks the work, commits the pull request, etc.; it’s not hard to envision a future, however, where all of this is completely handled by machines. This will apply to agentic work broadly: the true power of agents will not be that they do work for humans, but rather that they do work without human involvement at all.
This, by extension, will mean that the likely best approach to solving agentic inference will look a lot different than answer inference. The most important aspect for answer inference is token speed; the most important aspect for agentic inference, however, is memory. Agents need context, state, and history. Some of that will live as active KV cache; some will live in host memory or SSDs; much of it will live in databases, logs, embeddings, and object stores. The important point is that agentic inference will be less about GPUs answering a question and more about the memory hierarchy wrapped around a model.
Critically, this articulation of an agentic-specific memory hierarchy implies a necessary trade-off of speed for capacity. Here’s the thing, though: lower speed isn’t nearly as important a consideration if there isn’t a human in the loop. If an agent is waiting around for a job that is being run overnight, the agent doesn’t know or care about the user experience impact; what is most important is being able to accomplish a task, and if entirely new approaches to memory make that possible, then delays are fine.
Meanwhile, if delays are fine, then all of the focus on pure compute power and high-bandwidth memory seems out of place: if latency isn’t the top priority, then slower and cheaper memory — like traditional DRAM, for example — makes a lot more sense. And if the entire system is mostly waiting on memory, then chips don’t need to be as fast as the cutting edge either. This represents a profound shift in future architectures, but it also doesn’t mean that current architectures are going away:
- Training will continue to matter, and Nvidia’s current architecture, including high-speed compute, large amounts of high-bandwidth memory, and high-speed networking, will likely continue to dominate.
- Answer inference will be a meaningful market, albeit a relatively small one, and speed from chips like Cerebras or Groq (I explained how Nvidia is deploying Groq’s LPUs here) will be very useful.
- Agentic inference will gradually unbundle the GPU, which alternates between stranding high-bandwidth memory (during the prefill process) and stranding compute (during the decode process), in favor of increasingly sophisticated memory hierarchies dominated by high capacity and relatively lower cost memory types, with “good enough” compute; indeed, if anything it will be the speed of CPUs for things like tool use that will matter more than the speed of GPUs.
At the same time, these categories won’t be equal in size or importance. Specifically, agentic inference will be the largest market by far, because that is the market that won’t be limited by humans or time. Today’s agents are fancy answer inference; in the future true agentic inference will be work done by computers according to dictates given by other computers, and the market size scales not with humans but with compute.
The Implications of Agentic Inference on Compute
To date the invocation of “scaling with compute” has implicitly meant Nvidia bullishness. However, much of Nvidia’s relative advantage to date has been a function of latency: Nvidia chips have fast compute, but keeping that compute busy has required big investments in ever-expanding HBM memory and networking. If latency isn’t the key constraint, however, then Nvidia’s approach seems less worth paying a premium for.
Nvidia does recognize this shift: the company launched an inference framework called Dynamo that helps disaggregate different parts of inference, and is shipping products like standalone memory and CPU racks to enable increasingly large KV caches and faster tool use, the better to keep their expensive GPUs busy. Ultimately, however, it’s easy to see cost and simplicity being increasingly attractive to hyperscalers for agentic inference that isn’t remotely GPU-bound.
China, meanwhile, for all of its lack of leading edge compute, has everything it needs for agentic inference: fast-enough (but not leading-edge) GPUs, fast-enough (but not leading-edge) CPUs, DRAM, hard drives, etc. The challenge, of course, is compute for training; it’s also possible that answer inference is more important for national security, at least when it comes to military applications.
The other interesting angle is space: slower chips actually make space data centers more viable for a number of reasons. First, if memory can be offloaded, chips can be made much simpler and run much cooler. Second, older nodes, by virtue of being physically larger, will better withstand space radiation. Third, older nodes require less power, which means there will be less heat to dissipate via radiation. Fourth, not being on the bleeding edge will mean higher reliability, an important consideration given that satellites won’t be repairable.
Nvidia CEO Jensen Huang regularly says that “Moore’s Law is Dead”; what he means is that the future of computing speed-ups will be a function of systems innovation, which is exactly what Nvidia has done. Maybe the most profound implication of agents that act without humans in the loop, however, will be that Moore’s Law doesn’t matter, and that the way we get more compute is by realizing that the compute we have is already good enough.













