Groq Costs, Gemini Pro 1.5, Google’s Timidity

Good morning,

There will be a new Sharp Tech in a couple of hours where I give my final Xbox verdict after the previous | two Updates. In short, I think my initial Xbox call was right: Microsoft should have gotten out years ago, and they are pretty stuck because they didn’t. I think this was good one, so add the podcast to your podcast player using the link at the bottom of this email.

On to the update:

Groq Costs

I noted in the conclusion to yesterday’s Article that architecture like Groq may not end up being cheap enough for widespread inferencing; time will tell, but the reason to bring up the chip startup was because for me the demo captured something essential about AI specifically and tech broadly: speed matters, and it’s exciting to see what might be possible, even if it’s not at scale.

That noted, Dylan Patel and Daniel Nishball did a deep dive into Groq over at SemiAnalysis:

Groq has a genuinely amazing performance advantage for an individual sequence. This could enable techniques such as chain of thought to be far more usable in the real world. Furthermore, as AI systems become autonomous, output speeds of LLMs need to be higher for applications such as agents. Likewise, codegen also needs token output latency to be significantly lower as well. Real time Sora style models could be an incredible avenue for entertainment. These services may not even be viable or usable for end market customers if the latency is too high.

That last sentence right there is exactly why I wanted to discuss Groq in the first place. Groq’s memory approach in particular, though, has trade-offs:

Groq’s chip has a fully deterministic VLIW architecture, with no buffers, and it reaches ~725mm2 die size on Global Foundries 14nm process node. It has no external memory, and it keeps weights, KVCache, and activations, etc all on-chip during processing. Because each chip only has 230MB of SRAM, no useful models can actually fit on a single chip. Instead, they must utilize many chips to fit the model and network them together. In the case of the Mixtral model, Groq had to connect 8 racks of 9 servers each with 8 chips per server. That’s a total of 576 chips to build up the inference unit and serve the Mixtral model. Compare that to Nvidia where a single H100 can fit the model at low batch sizes, and two chips have enough memory to support large batch sizes.

This doesn’t just mean you have to buy a large system: it also means you need a lot of power, which is a massive consideration in the total cost of ownership. Moreover, as Patel and Nishball note, larger models require larger systems, as do longer context lengths (more on that below); moreover, higher latency but higher throughput systems benefit from improvements in things like speculative decoding (think branch prediction, but at a higher abstraction level). Patel and Nishball conclude:

The question that really matters though, is if low latency small model inference is a large enough market on its own, and if it is, is it worth having specialized infrastructure when flexible GPU infrastructure can get close to the same cost and be redeployed for throughput or large model applications fairly easily.

I firmly believe the answer to the first question is yes; the second is very much up in the air. In the meantime, do check out the post at SemiAnalysis for specific numbers.

Gemini Pro 1.5

Speaking of performance relative to total cost of ownership, that, more than anything, is a reason to bet on Google, who had the other big AI announcement last week. From The Verge:

Barely two months after launching Gemini, the large language model Google hopes will bring it to the top of the AI industry, the company is already announcing its successor. Google is launching Gemini 1.5 today and making it available to developers and enterprise users ahead of a full consumer rollout coming soon. The company has made clear that it is all in on Gemini as a business tool, a personal assistant, and everything in between, and it’s pushing hard on that plan…

There’s one new thing in Gemini 1.5 that has the whole company, starting with CEO Sundar Pichai, especially excited: Gemini 1.5 has an enormous context window, which means it can handle much larger queries and look at much more information at once. That window is a whopping 1 million tokens, compared to 128,000 for OpenAI’s GPT-4 and 32,000 for the current Gemini Pro. Tokens are a tricky metric to understand (here’s a good breakdown), so Pichai makes it simpler: “It’s about 10 or 11 hours of video, tens of thousands of lines of code.” The context window means you can ask the AI bot about all of that content at once.

This is a very big deal. First, here are two tweets that captured two interesting use cases:

OK, granted, slightly altering The Great Gatsby isn’t a use case per se, but it is a useful demonstration of how powerful a large context window is: you can find the proverbial needle in a haystack in a massive amount of information. Moreover, importing The Great Gatsby via RAG — Retrieval-Augmented Generation — isn’t really what is RAG is designed for. The canonical use case for RAG is getting specific information in order to answer a question accurately; when you ask an LLM with web access a question about current events, for example, it fetches the answer using RAG, imports the relevant information into the context window, and then generates an answer. That’s not really what is happening in Mollick’s example.

That, though, helps make the point: a massively larger context window makes it possible to do silly stuff like alter The Great Gatsby, which is to say it lets you do things that never seemed possible previously. Indeed, a larger context window would make RAG work better: instead of importing snippets or summarizing information you can simply import entire documents — or books, or more. The end result is an LLM that is simply a lot easier to use for a lot more use cases, without having to build or tune RAG inputs.

It’s also very expensive: everything in the context window needs to be in memory, and every token in the context window goes into every calculation. One way Google is accomplishing this is by using a “Mixture of Experts” (MOE) approach (which is also used by GPT-4); different parts of the answer are handed off to different parts of the model, and the answers are weighted and mixed together to achieve the final result. This reduces overall computation (in that large parts of the model are not used if deemed unnecessary) and also increases speed, as it brings parallelism to bear on the answer.

Google’s infrastructure team, meanwhile, has been heavily focused on enabling parallelism at every level of the stack, from chips to clusters to data centers, with everything built around their TPU line of chips. These chips are simpler than a GPU, but more flexible than something like what Groq has built; they are also cheaper than both, which makes them well-suited to workloads that can effectively run across a huge number of chips at a time. I mentioned performance/total cost-of-ownership at the beginning, and Google is really starting to bring their massive advantage in that metric to bear.

Google’s Timidity

Let me reiterate: Gemini Pro 1.5 is really remarkable, and to expand on the last point, Google is not only bringing its infrastructure to bear, it is also bringing its integration to bear: Gemini Pro 1.5 is in many respects only possible because it is running on Google’s infrastructure. And yet, these tweets kind of resonate:

Yesterday on X folks were discovering that Gemini has some interesting ideas about history; all of the screenshots below were my reproduction of prompts that I saw posted (note: two of these outputs had the photos stacked vertically; I put them back in a 2×2 grid):

Gemini's depiction of a 17th century physicist

Gemini's depiction of Roman emperors

Gemini's depiction of British royalty

Due to what I presume was a bug Gemini briefly showed the prompts it used for the image of British royalty:

Gemini's prompts for British royalty

There are other images Gemini refused to create:

Gemini refuses to depict WW2-era German soldiers

Gemini refuses to depict Tiananmen Square in 1989

Gemini refuses to depict white men

The excuse in that last one, by the way, is not universal:

Gemini depicts black men

Gemini depicts Asian men

This is, needless to say, a sort of virtual reality I am — contra yesterday — considerably less excited about and impressed by (I should also note that there are reports Gemini has been changing its results for some of the more widely criticized outputs).

Stepping back, I don’t, as a rule, want to wade into politics, and definitely not into culture war issues. At some point, though, you just have to state plainly that this is ridiculous. Google specifically, and tech companies broadly, have long been sensitive to accusations of bias; that has extended to image generation, and I can understand the sentiment in terms of depicting theoretical scenarios. At the same time, many of these images are about actual history; I’m reminded of George Orwell in 1984:

Every record has been destroyed or falsified, every book has been rewritten, every picture has been repainted, every statue and street and building has been renamed, every date has been altered. And that process is continuing day by day and minute by minute. History has stopped. Nothing exists except an endless present in which the Party is always right. I know, of course, that the past is falsified, but it would never be possible for me to prove it, even when I did the falsification myself. After the thing is done, no evidence ever remains. The only evidence is inside my own mind, and I don’t know with any certainty that any other human being shares my memories.

Even if you don’t want to go so far as to invoke the political implications of Orwell’s book, the most generous interpretation of Google’s over-aggressive RLHF of their models is that they are scared of being criticized. That, though, is just as bad: Google is blatantly sacrificing its mission to “organize the world’s information and make it universally accessible and useful” by creating entirely new realities because it’s scared of some bad press. Moreover, there are implications for business: Google has the models and the infrastructure, but winning in AI given their business model challenges will require boldness; this shameful willingness to change the world’s information in an attempt to avoid criticism reeks — in the best case scenario! — of abject timidity.


This Update will be available as a podcast later today. To receive it in your podcast player, visit Stratechery.

The Stratechery Update is intended for a single recipient, but occasional forwarding is totally fine! If you would like to order multiple subscriptions for your team with a group discount (minimum 5), please contact me directly.

Thanks for being a subscriber, and have a great day!