I interviewed Daniel Gross and Nat Friedman just two-and-a-half months ago to talk about the then-stunning developments in AI in 2022; then, since that interview, ChatGPT happened, and the innovation spurred by Stable Diffusion started to show up in completely unexpected ways. Given that I wrote in the Year in Review that AI is the most important story of not just this year but of the past several years, it felt appropriate to have them back for the final Interview of 2022.
Gross founded Cue, a search engine that was bought by Apple and incorporated into iOS, and led machine learning efforts at Apple from 2013-2017, before becoming a partner at YCombinator and then transitioning into angel investing. Nat Friedman co-founded Xamarin, an open-source cross-platform SDK which was bought by Microsoft in 2016; Friedman led Microsoft’s acquisition of GitHub in 2018, and was CEO of the developer-focused company until last year; he too is now focused on angel investing. Gross and Friedman now collaborate together, primarily on AI-focused investments, and are one of my personal resources for understanding what is happening in the space; I’m pleased to once-again share their insight with you.
To listen to this interview as a podcast, click the link at the top of this email to add Stratechery to your podcast player.
On to the interview:
An Interview with Daniel Gross and Nat Friedman about ChatGPT and the Near-Term Future of AI
This interview is lightly edited for clarity.
Teleprompter and AI Toys | RLHF and AI Alignment | Truth, Hallucination, and Bias | Brand and Commodity Information | Apple and the Local Opportunity | Stable Diffusion and Thinking in Images | AI as Platform | The Future of AI | The GPU Paradox
Teleprompter and AI Toys
Daniel and Nat, thanks for coming back for another Stratechery interview. This is an unusually short amount of time to have a repeat guest, but given that I just proclaimed in my Year in Review that AI is the biggest development in tech, not just this year, but since mobile and cloud computing, it felt right to talk about AI for the last Stratechery interview. And needless to say a lot has happened. You would think I would want to start with ChatGPT, but in fact you two just tweeted a link to a GitHub repository about an AI product. I don’t know even what to call it…
Nat Friedman: It’s a toy.
It’s a toy that is a teleprompter that prompts you for the next thing to say. So I do need to be confident that you are not using this teleprompter in this interview. Can you both assure me of that?
Daniel Gross: I do so affirm.
I mean, Daniel, what is it? How does it work?
DG: First off, thanks for having us back on. The main shortage in the market right now, as I think we talked about last time, is actual AI products that are fun, useful and engaging to have in our lives. And we started to see a little bit more of that in the Stable Diffusion world with apps like Lensa exploding on the App Store, where people are able to make virtual avatars of themselves and obviously ChatGPT, which was launched by OpenAI.
Those are two good topics. We’ll definitely get to them shortly!
DG: Anyway, there should be 10 times more tinkering with different types of models and different products. So Nat and I have a hobby of sort of ping-ponging these ideas back and forth. And yeah, we have this funny idea of making sort of a teleprompter that would make you more charismatic by giving you charismatic things to say in particular styles. So you can ask for the style Jerry Seinfeld if you want it to be funnier, or in the style of Bobby Axelrod, or JFK, whatever you wanted. Obviously you do that entirely privately and use a local model so you don’t have to send to a server all the things you’re saying.
So yeah, so we made a very silly, stupid demo of this and use some of the latest and greatest in the natural language processing and larger language models world to do it. I mean, it’s a total joke, but I think it’s really meant to show what can be done with little effort. That’s really the point we were trying to make. But yeah, it’s a lot of fun to use, but no, I am not using it right now.
I was going to say, you should be using it right now, then just drop that disclosure like a bomb: “So everything I just said was actually produced AI.” And the funny thing is — this is part of the reason I want to talk about this — is that there’s been a real sort of awakening I think in the general population about AI, but I think you might be able to trick people: they’re not sure what this is all sort of capable of.
The biggest news since we talked has been the explosion in ChatGPT. I mean, as you just said, Daniel, and Nat, you were talking about this the last time we were on, it definitely seems to validate the take that good enough technology is available. There’s just a product hole. Because yes, there are differences between the GPT-3 model and the ChatGPT model, and we can maybe talk about that a little bit, but at the end of the day, this is built on two year old technology.
NF: Yeah, I think it is still the case that, as Daniel said, researchers have raced ahead. They’ve created this incredible bounty of capabilities and the product folks, the tinkerers, the hackers, for whatever reason, they’re still lagging way behind. They’re starting to catch up. That’s exciting to see, but we still see a huge dearth of new hacks, new products, great news user interfaces, and just the desire to tinker with these things.
Why are they behind? Is this a lack of awareness? Is it a cost issue? Because you do have to pay to use GPT-3, what do you think is the hole here?
NF: Yeah, good question. I’ve sort of debated this a lot. I think it’s a little bit mysterious, but here are some theories. One theory is that when people hear about getting involved in AI and large language models, they assume a lot of specialized knowledge is needed. In order to work with these things I have to know deep learning and gosh, I probably have to know calculus or at least linear algebra, and I’m just sort of not into that sort of thing. Do I need to know how to program CUDA kernels for Nvidia hardware? It’s intimidating.
And what I think they’re missing is that I think this is a fallacy, and I think the fallacy is sort of saying, gosh, to make paint you have to be a chemist. And so if I want to be a painter, I have to learn chemistry. And the reality is you can be a great painter and not know anything about making paint. And I think you can build great products with large models without knowing exactly how they’re made. And in fact, just using them a lot, tinkering with them in the way that we try to do.
Certainly people like Riley Goodside have done an amazing job of developing a kind of level of expertise in the way that you can build things with these models that I think making the models themselves doesn’t actually teach you. So you may understand them as a platform better than the people who made them. They’re like these, I don’t know, sort of alien artifacts that have been discovered, and now we have to poke at them and prod at them and play with them to figure out what they’re really capable of. And it’s exciting. It’s a video game, and you’re on the mini map, there’s a fog of war, you get to go explore. And for whatever reason the ratio of people talking about it to people actually just going and exploring is way off. So I think that’s sort of one theory.
I think the other is it’s very high status to get into AI and do research. And so if you’re super smart and you know you want to do prestigious things in the AI field, then you want to train models and write papers to get lots of citations. And so that’s the leaderboard you just sort of jump onto naturally.
I would think those are probably two different people though. The people who don’t find it intimidating, who understand calculus or whatever it might be, yeah, status may be in their sort of reward function as it is. And what I’m hearing from you is if your reward function is actually I want to build a company and make a whole bunch of money, there is actually this massive opportunity that no one seems to be seizing.
NF: Yeah. I mean, definitely building companies, making lots of money is great and we encourage people to do that. And if they’re going to do that, we’d love to be involved, but also just making awesome things that people respond to, making great products. So I think the product people are starting to arrive now. It’s been too exciting. I think ChatGPT is a kind of starting gun on the text side, the race is on, the game is afoot. People have noticed.
That by the way is what I don’t think OpenAI expected when they released ChatGPT. In fact, I think for them this was sort of incremental step 37. We’re going to do this model, we’ll improve it, we’ll do this and then I guess we’ll add a chat interface to it. And it just so happened that in this case incremental step 37 was like, okay, then they will open the curtains on the play and everyone will see what’s inside. And it was the introduction, it was the AI moment for millions of people who were not aware of the capabilities of these models.
It’s exciting to see people’s reaction to it, but if you were really close to this stuff, you would see many of the positive examples or screenshots of ChatGPT that people were producing were things that you could have done in the OpenAI Playground a year ago, but having this sort of dialogue form factor and this easy accessibility meant that a lot more people could figure that out.
RLHF and AI Alignment
So there’s been a lot of skepticism about chat interfaces. We talked about it last time where you explored a chat interface for the GitHub Copilot stuff at GitHub and realized, actually no, that didn’t work very well. And part of it was because it would just frequently be wrong, and we can get to that point in it a little bit. But in the case of ChatGPT case, the chat interface worked.
This raises a question about what makes a good product: is it the interface? Is it the UI element or is it the fact that what makes ChatGPT different from GPT-3 is this reinforcement learning with human feedback (RLHF)? It’s interesting because we were talking about this before ChatGPT came out, and the thought was people would use RLHF because they wanted to control what the AI says, and the worry was that people that want to police everything want to police yet another thing, and sure, that is happening.
But what has been interesting is that from a product perspective RLHF seem to have made ChatGPT much more accessible because the answers are much more predictable and relatively concise as opposed to the sort of endless stream of text from GPT-3 that starts out really strong and then went in some crazy directions. Has that been a surprise to you that this reinforcement learning actually seems to have been essential to the product component of what makes ChatGPT a success?
NF: OpenAI introduced InstructGPT in January of this year. And so that was actually our first introduction to the idea that you could fine tune and guide these models to follow instructions better. And interestingly the challenge with the sort of pre-instruction guided models was that they were unwieldy and unreliable and you’d sort of try to get them to answer a question and they would just continue your question. For example, they would auto-complete your question instead of going into question answering mode.
And so InstructGPT was a really big breakthrough and I would say it was the first major practical application of any quote-unquote “AI alignment” technology, which is this idea of saying how can we get the models to do more of what we want? And so the way you do that is you come up with a bunch of samples of questions and answers that you like or you ask humans to rate them and then you just basically fine tune the model.
Well, you basically have a model training a model, the human rating model training the derived-from-human answers model.
NF: If you have a lot of money you can just make samples and fine tune on those. And if you want to be super efficient, you can have a model rating a model and get away with generating less super high quality data in that case.
Well, let’s put a check on that because I actually want to get to that cost component of humans, but sorry, continue.
NF: So there’s no question that fine tuning and RLHF techniques are great at making the models just sort of more fun and easier for people to talk to. When Daniel and I were doing the teleprompter, we tried using some not very well instruction tuned models to do it. And I think the prompt I was using at one point said, “You are a teleprompter. Here are the words that appear on the screen right now. You want to make sure the next words are funny in the style of Jerry Seinfeld, say the next words.” And a really nice instruction-tuned model would do that. One that wasn’t instruction-tuned would just say, “Teleprompter, Jerry Seinfeld.” It just would not fully understand the instructions. And so it’s definitely helpful.
So that’s sort of inner alignment. So once you’ve decided what you want the model to do, you can fine tune it to do that. Then there’s the question of what should those goals be and is it just about the usability of the model, or do you want it to be in inoffensive in specific ways or protect the reputation of the company or product that you’re shipping? These are also goals that companies have when they ship these things.
Daniel, can you explain what is alignment? It’s a word that comes up all the time, how does it relate to coherence? Because alignment as Nate described, isn’t that basically coherence, or am I getting things mixed up that shouldn’t be mixed up?
DG: Well, Silicon Valley has sort of forever been the land of spiritual people that have found religion in software. And I’m not using the teleprompter by the way. And so you tend to get these people working really not for money, not for fame, not for power, but because it’s a really important thing. And as a byproduct of that, companies develop cultures that are cult-like and develop various worries about how the world might go. And everyone feels like they’re working on an existential threat for the world. And I think to some extent this was the case with internet.
Even if it’s SaaS software.
DG: SaaS software is probably the most commercial era, but I think the early days of the Tim Berners-Lee era of the internet definitely thought they were going to create world peace and everyone is going to get connected with each other and really like everyone. And maybe that will be true at some point, but these AI companies sort of do believe they are the modern day Manhattan project and are creating these large language models that some people would say are just sort of outputting the next statistically-most-likely token. And others believe they’re sort of really sentient beings.
And so this whole field of alignment was created at the extreme to ensure that we don’t extinct ourselves as a species with the sort of spiritual belief that we’re working on this really powerful thing. Everyone sort of wants to be Oppenheimer in that documentary after the Manhattan project where he says, “I am death, destroyer of worlds.” Or whatever. Everyone has that view that they’re creating a very powerful entity or being. And so this term alignment means different things at different companies, but at the extreme, to the most profound zealots it means aligning an AI to not sort of do bad and nefarious things.
I think to more pragmatic people it means basically not having a bug. So you ask a model to do a thing and it should do that thing and not something else. And there’s a whole continuum sort of in-between. These different large language model foundation companies have different views on what they’re trying to achieve with alignment. And there’s been this whole cottage industry called AI safety that’s been created around it that also has tie-ins to this sort of effective altruism movement. And a lot of it really I think is part religious movement and in part pragmatism.
The most pragmatic view of alignment, though, I think is a real one, which is these models today have a tendency sometimes to emit the actual answer to the question, but sometimes to emit in an extremely deceitful sort of very plausible looking thing that’s not the answer. The simplest way to articulate this is if you ask any of the models today to do three or four digit math. They’ll give you a result that won’t be off by an order of magnitude.
Right. It’ll be off by 10.
DG: It’ll be off by 10. And so that is true not just in the domain of math. So if you ask a model to generate a database query for you, it’ll generate a sensible seeming database query that might have an extremely subtle bug. And so alignment is really, I think in this most pragmatic way, just sort of making sure that doesn’t happen. And it’s rooted in this deep issue where these models they have this effect where they quote-unquote “hallucinate”, is the industry term.
In reality, I think Nat has pointed out it’s not hallucination. I mean, the term hallucination involves so many other things these models aren’t doing. The models are just roaming through this embedded space they’ve created looking for things that are near each other in this multi-dimensional world. And sometimes the statistically most-next-likely thing isn’t the thing we as humans sort of agree on is truth. And that creates all sorts of issues. So there’s all sorts of ways to get around this. And we mentioned Instruct GPT and RLHF, all these techniques to basically try and get the model to say the right thing.
I happen to think there’s this fundamental issue with the training data that the models are created with, where I think ultimately as I’m talking to you now and as I’m communicating to you, there’s all sorts of words I’m saying that are the final words, which are the words I’m saying, but there’s all sorts of things I’m not saying because I have a model of a world, of the listener, of Nat, and I’m trying to say things that will satisfy all those constraints. And I’m really reasoning through sort of word-by-word and finally you get some tokens submitted. These models ultimately were just trained on the final output by humans, not on the actual reasoning step.
And so I think they lack a real fundamental piece of our psyche, and we can see hints in just how valuable this is in the sense that some of the instruct tuning that Nat talked about, all that it really is are examples of what’s called thinking step-by-step, where the model is literally shown a problem, shown the solution to the problem, shown the problem again, and then shown the solution, but literally thinking about it step-by-step. A very simple logic problem…
You get your classic sort of math logic problems that you would get in middle school. And I mean ChatGPT does a lot better than previous models, but it will still fail in really hilarious ways.
DG: And it has. It’s very odd to reason about this in biological terms, because on the one hand it sort of has all the words of the internet, on the other hand it doesn’t know how to activate the sort of very basic mobilization of the words and the right way to do thinking. And again, I think math is the simplest, purest form of where you can see its errors and its flaws, but it’s also true when it speaks English as well.
I think this ties back, Nat, maybe into why there aren’t products yet, because I think the hardest thing to wrap your head around with these models is just the fundamental difference between probabilistic thinking and deterministic thinking. To your point, Daniel, there is a deterministic process that might be going through your head when you’re thinking and you’re going through a logic chain, but that logic chain is not being exposed. It’s not on the internet, it’s only ever in your head.
And so the output comes out and then this has to backwards engineer it, but it doesn’t backwards engineer it by discerning the logic. It’s purely probabilistic. And my suspicion is that you will see the same dynamic you encounter every time a new technology comes along. My go-to example always is advertising on the web where the first version of advertising on the web was literally putting an ad next to text because that’s what we do with newspapers. And I mean it worked, but then you had digital dimes compared to print dollars and had to come up with something that was fundamentally unique to the internet, which was the feed where you had this dynamic bit of content, and that suddenly made perfect sense for dynamically inserting ads that might be interested to you.
And so, I’m not surprised there’s this struggle for products, because if this is true, and I would argue that this is just as big a shift as the Internet as far as products are concerned, it requires a completely different way of thinking. We’re going to have to build a whole bunch of bad products because we’re building products that would actually be better served with a deterministic model before we even figure out the use cases where probabilistic models makes sense.
NF: Yeah, I think that’s right, but there are techniques that can be used to dramatically improve the predictability and reliability of what these models do. And the surprising thing I think to us is maybe they’re not quite ready for products, but we’re definitely ready for a huge wave of people tinkering with these experiment in super practical ways.
We have to go through that V1 before we can get to V2 anyway so please get building!
NF: Yeah. And just the proof of concepts, and discover what you can do if you push this technique a little further. And right now all that work is in the hands of researchers who write papers about it afterwards. And I think we’re ready for the hackers to show up. And maybe they won’t formalize the things they discover in papers, maybe they just show up in tweets or they just show up in GitHub repos.
Truth, Hallucination, and Bias
But there are so many ways you can sort of take the model, wield it as a paintbrush and do pretty amazing things with it that no one’s done before already. And for example, on the question of how do you get the models to stop hallucinating? It would be extremely surprising to me if were possible to invent an algorithm that only says true things. I don’t think there’s going to be a mathematical solution that it’s an oracle that only outputs true things, maybe, but it’s hard to imagine.
Well no, there’s not in the real world.
NF: Right. Exactly. So is this a research problem that requires some kind of breakthrough, that requires deep levels of math, or is it just that we need to improve the data, as Daniel says? Improve the fine tuning steps that we use. Maybe have certain tricks around prompting? There’s a technique that we can do now where we do sort of document embedding and retrieval where you give a model some samples of documents from the web that you think might be relevant to the question that it’s trying to answer. And then you, say, synthesize a paragraph answer to this question, but make sure that you’re citing sentences or facts from these documents that I’ve retrieved and make sure that what you’re saying is fully factually consistent with what’s in there. And this technique, it was sort of pioneered in WebGPT, it shows up in other places too. It works really well. It’s not perfect, but it massively reduces hallucination and you get these sort of citations. So if something’s wrong, at least it’s wrong because the source material you’re citing was also wrong.
Well, what’s the issue here between, oh, on one hand we say the model is wrong and so we want to fix that problem, but also there’s sometimes situations where the model is right and we just don’t like the result? And I think that’s one of the big questions that’s sort of raised here is, how is there any distinguishing between the two? I mean, there was that bit going around the other day I think from Reason where they gave ChatGPT those political compass tests and it sort of consistently landed on the liberal/progressive quadrant of the axes.
And on one hand it’s like, well okay, we don’t know who the actual trainers were, but you can certainly imagine there’s a particular point of view, particularly given the makeup of the tech industry. On the other hand, you go back to this point about can you even create a good product, at least at this point, that doesn’t have a particular point of view, given how important it seems this reinforcement learning has become? I mean, is bias fundamental to making a good product?
NF: It’s interesting to think about this historically. So when Google was created, the big innovation was ranking. It was PageRank. And it was showing you the right link on top. And often you are asking Google questions. Sometimes you’re searching for pages, but sometimes you’re asking questions. And their answer to the problem, at least in the form of PageRank was an algorithm which basically said we’re taking a bunch of human feedback in the form of links to pages and we’re using those to determine what the right answer is in this case.
And so I do think we’re probably in this era of humans in the loop one way or another, and it’ll be through human feedback, it’ll be through maybe citing documents, maybe even realtime documents that are out there. I was talking to someone earlier today about this idea of a kind of a Hacker News site or a Quora site where language models and people who are maybe experts are interacting kind of back and forth. And I think the technology will allow us to guide these things and towards many different sets of values. And it’ll just be a question of whose values those are? It’ll just be a question of whose values those are. Is it just the creators of the models? How many models would there be? If models are only created in San Francisco, you can expect maybe one set of values from them. But, if lots of people have the ability to fine tune and refine these over time, I think you’ll be able to have models that reflect the goals of different groups.
DG: It would seem to me that a critical mistake made in this first iteration of fine-tuning was the idea of asking the question in a binary way, sort of is this output good or bad, versus is this something that has absolute consensus or not? I think it’s very natural to predict the next step beyond this where even one centralized model company I think can have a model that has a very good grasp on the things that are generally agreed upon facts, like dates, times, math, physics; and basically everything else that I do think it can sort of cluster and personalize to the user that it’s talking to. I think that’s sort of the direction that these companies are going to go in, and their mistake, which I think is sort of a cultural San Francisco issue, was to just sort of assume, initially at least, that some of the answers to these questions are just yes and no. Whereas in reality, the question is, “Is this a yes and no question?”
Yeah, that’s a good way to put it. Here’s the question. If this is sort of human reinforcement aspect is a key part to it being a product — and setting aside the bias question, although this does get into it — is that actually the human component is where a moat may be? Maybe it ends up being the case that you can train very large models and compute comes down and the ability to do this increases, and maybe there’s even open source possibilities, but at the end of the day, paying humans to do this training is going to be very expensive and that’s actually what will be a distinguishing characteristics of models that gain large usage and ones that don’t.
DG: So, certainly this was the case for the prior generation of, effectively, ML or AI, or search companies in Google’s case. I think the real moat was everyone became an effective employee of Google, because when you clicked on the fourth result instead of the first, it made their algorithms a little bit better. But, I think one difference between the bag-of-words 2002 variant of machine learning and the embedding space 2022 version of machine learning, is if you imagine in your head a Richter scale or blast radius chart, a single point of feedback in that prior era is maybe a Richter two earthquake, a very small blast radius in terms of how it affects the model. Whereas in the embedding space, because it’s so much more dense, a single point of user feedback has a massive effect on model quality. So, sort of imagine a massive blast radius every time someone clicks on something, which creates this open question that I don’t know that we know the answer to, which is how many feedback data points do you need in order to make a useful product?
We know that in order to get something out the door, you don’t need that many, you may need tens, thousands, maybe a hundred thousand, but not hundreds of millions, and sort of in the old school collaborative, filtering or bag-of-words type of machine learning, you did need much more, because you weren’t able to infer that much from every single data point. So, to me, I don’t know if Nat has a different view, but it’s still an open question as to whether data and feedback is a moat in this world, just because the way these models work is by having a much denser representation of things, and so it’s able to learn so much more from feedback.
Wait, I’m actually curious about that because obviously humans can’t go through and pre-answer every single question. At that point, you might as well just skip the whole machine learning model and just ship the human answers. So, how does human feedback, to whatever comes up in that training sort of set, actually make other answers better that weren’t necessarily surfaced in that set?
DG: Ultimately, the model is sort of interpolating through this embedding space, which is basically a lot of dense mathematical representations of words and concepts where similar things sit together, and are near each other. So, Jerry Seinfeld might be near Louis C.K, so to speak, but even sentences that are sort of similar in their semantic meaning are near each other, and so it’s able to infer things beyond the specific Q&A session that it had. So if it gets in fine tuning or in its RLHF process a point of feedback about a very right-leaning viewpoint, it’s going to learn broader things about whether right viewpoints are good or bad. Just because it’s able to sort of interpolate, not just across the words that are similar, which is how we used to do things, but literally the underlying meaning of the words.
Right, because you have so much more context than you ever did before.
NF: Yeah, and it can also learn meta lessons from fine tuning. For example, it can learn that you like answers where you kind of explain how you arrived at that answer, as opposed to just sort of telling you the answer, and when the models do explain how they arrived at the answer, they tend to have better answers as well, because they can spend a little bit more compute calculating the in between step tokens, which allow them to think longer, essentially, about the answer, because the amount of compute per token is fixed for the model, and so it can learn these meta lessons. It doesn’t just learn the answer to a question, it learns how it should think about answering questions and how it should answer questions to some extent as well.
DG: If you have an executive assistant, a simple executive assistant, if you tell them, “That meeting at 2:00 PM, move it to 3:00”, would just move the meeting, and that would be the end of it, and then you’d maybe have to repeat that next week. But, a really smart executive assistant would say, “Wait, what’s going on? Why did you move it?” Maybe this wasn’t an important person. Maybe this is a bad time a day for you. Maybe you’d like to eat lunch. They would start looking at other meetings. That’s what the model’s able to do, and I think sort of the big, one of the many, innovations of the current era of machine learning is that we’re able to work both in images and in text, in this sort of latent or embedding space, which is a much more pure representation of the underlying concepts, which is honestly how we developed language, I think, as a way to emit. All that we do with these words is we transfer, ultimately, these mathematical vectors to one another. That’s sort of what I’m doing now, and you could listen to words and your brain is retranslating them to this, hopefully, a similar mathematical vector. In fact, if it doesn’t, that would be a miscommunication. So, we’ve managed to now represent, in software, I think it’s really interesting, the same underlying representation of reality that our minds use, and so when we learn lessons, we learn the broader lesson, because we’re sort of training on concepts, not letters and words.
So, what will change with GPT-4? What do you guys think is going to happen?
NF: It’s better than GPT-3. I mean, I haven’t had a chance to use it extensively. I’ve seen some outputs, looked pretty cool. My guess is that the people who just discovered AI with ChatGPT, it won’t be as big a moment in the sense that it won’t be as big a leap. To go from zero to ChatGPT is quite a large step, and then to go from ChatGPT to whatever GPT-4 ends up being, whenever it ends up coming out — Q1 is my guess, but I don’t know for sure — will be a smaller step, but it’ll also be, for them, a very fast step. Because if you just discovered ChatGPT in December and then there’s this noticeable quality improvement that takes place in, I don’t know, March or February, then it will definitely seem as if things are improving very, very quickly.
My guess, without knowing, is that it’s just going to be kind of better at everything. It’ll probably give better answers, be more thoughtful, know more things. It’ll probably still hallucinate, but less, I would hope. What we’ve seen is as things improves, you could say they improve somewhat linearly, there are these non-linear capabilities that just pop out where it couldn’t do X and now it can do X.
Like how GPT-3 was terrible at math, and ChatGPT is actually significantly better, even though you can still get it to make mistakes.
NF: Yeah, and by the way, I do think it’s not quite right to say GPT-3 is a two year old model, because of instruct tuning, there was this step in January. I think that was a big improvement. Then OpenAI’s referred to GPT-3.5 as the sort of improved model that we’re sort of running on. So, I think they’ve been continuously shipping unversioned improvements. We’ve left the major version at 3, but yeah, I think 4 will be a noticeable improvement. I’m really excited to get to play with it extensively and see what are the things that GPT-3 cannot do at all, that 4 can do very well, and I think that’s the big question in everybody’s mind, but I’m not expecting a hundred times better or anything like that.
Brand and Commodity Information
It’s interesting to see how this plays out. There’s an aspect where, as a writer, should I feel threatened? Well, I don’t feel threatened because I’d like to think that what I do is fairly creative. Of course, everyone in my space would think that. But, I could also see this world where the output is not just that sort of uniqueness continues to be rewarded, but actually there’s a decrease in trust in everything you read on the internet because you don’t know if it’s AI generated or not, and the actual return to someone like myself or publications is less about the quality of the output and more about the reputation. It’s like, “Oh, yeah. At least I know this is a human that wrote this for sure.”
You mentioned GPT-3.5, well, that’s a branding thing, right? There’s a bit where the return to branding could actually increase in this world because you go back to consumer packaged goods, like what’s the difference between this deodorant and that laundry detergent and this X, Y, Z? Actually, there’s not really a difference, it’s more a matter of how do you derive value from a commodity, and if text and images are all a commodities, that value increasingly comes not from the item itself but from the brand surrounding it. I mean, does that seem like a reasonable way this might play out?
NF: I think it does increase the returns to things that you trust, and I think it increases the returns to thoughtfulness, insight, surprising ideas that are true.
Or just surprising ideas that are not necessarily true.
NF: Yeah, I mean, if the models cause us to downgrade the appearance of authoritativeness, then that might be an excellent thing for society. If our societal adaptation is, just because it sounds formal and authoritative, maybe we shouldn’t trust it, that would be probably great. We become altogether more truth seeking. It’s like you can no longer judge people based on whether they wear a suit because everyone can afford a suit, and so wearing a suit may not be the perfect signal of reliability. I think that’s sort of where we are too. Just because you wrote four paragraphs full of complete sentences, doesn’t mean necessarily that you have an original or really thoughtful idea here.
The model so far cannot produce these big out of distribution kind of insights that kind of cause you to rewrite your whole model of the world in your head. I’m not finding that. I do occasionally find myself using ChatGPT for brainstorming and it’s like, “Gosh, how should I solve this problem?” And it’ll come up with sort of five obvious ideas. The problem sometimes is that I haven’t tried two of them
Well, also, sometimes it’s actually it can be very helpful when someone is wrong because it’s much easier to be inspired to correct someone than it is to actually generate the idea for myself. I know I certainly work that way. Just give me something to start with and I’ll start out by telling you why you’re wrong and then I will actually be on a roll and then can produce something new.
DG: Nat, in the situations where you’re using it to brainstorm, is what it’s doing effectively sort of semantic search where it’s just pulling together an interesting related word or idea? Could you have gotten there just with semantic embeddings or search or is it doing something beyond just the related concept extraction?
NF: Usually it’s something like, “We have an ant infestation, what should we do?” And it’ll give you four ideas. Or, actually, here’s just a funny fact, I’ve talked to three people in the last 24 hours who said they used it for brainstorming, all of whom were using it to brainstorm baby names.
The other thing I think is really interesting about these models is I don’t think we’ve actually figured out fully what they’re capable of. It’s like a command line tool with no documentation, and we haven’t explored the full space. One of the things I’m doing with ChatGPT is using it to program and I’ll say, “Gosh, I need to do this and this. How would I do that in Python, on my Mac?” And it’ll generate some code, and that won’t work, and I’ll say, “That didn’t work because I got this error message.” And it’ll say, “Oh, that’s probably because of this. Here, I’ll correct it for you.” It’ll correct it, and then you’ll say, “Wait a minute, the code that you just generated doesn’t quite make sense. This isn’t consistent with” whatever, and it’ll say, “Oh, yeah. Sorry about that mistake,” and it’ll sort of correct it.
So, this idea of self-correcting iterative invocations of the model to get closer to what you want, having it kind of self debug, there are zero products today that do this, there are zero public demonstrations that you can download and use to do this. People are doing it kind of manually with ChatGPT, which is exciting. But, there’s clearly the opportunity to produce things here that leverage this ability of the model to respond to itself, and self correct.
I think there’s a bit about a opportunity that has made sense to me all along is basically a compiler for AI, like a mechanism such that it can self-check, and the cool thing is a compiler will just throw an error and then you have to go back and actually figure out what actually generated this error. But, in this case, if you can tighten that loop and have it be sort of feeding back on itself, it could be very powerful. But, then the challenge is, “How do you know if it’s right and if it works?” But, I mean, I want to get more to the products.
Apple and the Local Opportunity
I think one of the most interesting products that has come out is Lensa. I wrote about it in an Update last week, and this is downstream from Stable Diffusion. I think what’s interesting about this is a few things. Number one is it took a really very clearly defined problem that was very narrow, which is you are giving it 20 images and it has to generate variations of those images. So the problem space is very small and very constrained and there’s a nice overlap there with, “What’s the number one thing people are the most interested in?” Themselves, right? Right there the product market fit seems very, very compelling, and I don’t know, is that a generalizable lesson? I mean I think the first part is for sure, very narrowly constrain the problem space, but then find something that people are just inherently interested in or willing to experiment with. I mean, Daniel, I’ve seen your Lensa images. Nat, I’m not sure if we’ve gotten any from you.
NF: I mean, anything that makes me look awesome, I like.
What would you say, Daniel? I think you were the first one to try it out.
DG: I think Steve Jobs originally created the iPhone App Store as a way to let these small developers, that he originally met through the Mac platform, really make a living off of these small fun features, and those were the early iPhone apps. We all remember the demos at WWDC of dedicated app that would find, if you took a pill, and you didn’t know what you took, but you only know the shape, and it would go through the menus and figure it out — that’s what the App Store was built on, and then there was this era of, I would say, modest innovation, and then it sort of went dormant, and you had these massive platform companies effectively dominate the App Store leaderboards for about a decade. I mean, it’s basically been Facebook, TikTok, Instagram, whatnot.
Now, with this new capability, I sort of feel like we’re back in the 2010 App Store era where you’re getting these one-off, cool, often single developers doing millions of dollars of revenue in some cases, which is the original iPhone vision, making these interesting capabilities. Look, you have long-term questions like, “Where does it go and how do you get a moat and does it become a feature in something else?” But, the whole beauty of the original iPhone vision was you don’t have to worry about that. You could just get out and go and we’ll take care of distribution and hosting for you. We started seeing with Lensa, I think, that original vision return, which I think is really great.
The interesting thing to me about Lensa is it wasn’t the first to do this sort of Stable Diffusion-based fine tuning on an image of yourself, but it just did it right and it packaged it and it was a great lesson in the last mile of user experience, which researchers often overlook. I think actually it’s another thing I learned from watching Nat make Copilot, is another related idea in the domain of text, but just speed and responsiveness matters so much more than accuracy. I guarantee you in the hallways of Lensa, to the extent that they have hallways, I’m pretty sure it’s like a desk. So, they were thinking about the trade off between better quality or results coming up faster, and the company that wins is the company that just goes for speed and it’s pragmatic, and that’s another thing they just got right. Winning markets is often a battle of inches on these user experience things, and I think Lensa was not the first, but they were the best, and that’s often the case in the tale of winners. Now, it’s really up to them to try and maintain the lead.
I mean, Stable Diffusion, we’ve seen some pretty amazing demos of what’s to come, and I actually think the clarity one could have of understanding where the future is going in images is much more clear than it is in text, because the problem is actually solved. It’s solved in a single frame at this point, it’s not solved in terms of high quality photos, but for Marvel-style image generation, it is basically fully feature complete.
Well, I want to get into the Stable Diffusion thing in a bit, but one bit you mentioned is the speed. Lensa was very fast when you used it. By the time most people encountered it, it was actually taking days to get your images back.
So, there’s a couple of things here. Number one, you talked about the App Store making it super easy to get that product out there. I completely agree, it’s a great point. Meanwhile, in the backend, Stable Diffusion was not a product because it was open source, it was generally available. They could modify it and set it up to use it in the way they wanted to. Now, they did, they ran it in the cloud, and so you would upload your stuff and then they would make it in the cloud. At the same time, there was also one developer that got it working on the iPhone. It’s actually in the App Store and it’s called Draw Thing. Around the same time, Apple releases these optimizations for Stable Diffusion to run on their silicon, and it’s not just that, they actually implemented that into the operating system itself. So, you can get these hooks into it, it will work better.
Now, that’s a good way to solve the speed problem. If you’re running it locally, then you’re not in a queue, you’re not waiting for time on the server, and number two, it fixes the cost problem, which we haven’t had a chance to talk about. I mean, the problem with OpenAI and ChatGPT being such a huge success is they’re paying cents per query, which adds up phenomenally fast, right? But, if it’s local, it’s basically zero cost from the developer’s perspective. I mean for the user too, it only costs a bit of battery.
So, is local where this is going to go, or is the relative performance in the cloud just going to be so great that it’s going to stay ahead? It seems from a Lensa perspective in particular, this is where the actual victory is: you could get feedback immediately and then you can get that feedback loop going also. For example, there was one image that I really liked from Lensa, but one of my eyes wasn’t quite in focus. It would be great if I could fix that, right? So, where does this balance between local and cloud end up? I mean, I guess it’s going to be both, but certainly the local nature of Stable Diffusion has been one of the biggest surprises, and it seems like it’s going to, to your App Store point, lead to a whole explosion of things.
DG: I very much agree with you that the local story of both larger language models and Stable Diffusion is completely underrepresented in the market now. Apple has massive incentives to make this a huge story. So, they will, they will make the bindings perfect. In the case of the Apple Neural Engine already, there was torch bindings for it. M2 will come out on MacBook Pros, I imagine next year. The M1 Ultra or whatever is sort of V100 compatible on a flops basis, obviously has much less memory. But, in general, I think really interesting capabilities should emerge for local, where you want high response rate, or it’s just a consumer good where you don’t want to pay for the servers.
I think it is the M4s that will be really interesting. I mean, they’re probably already close to taping out the M3, but you could see down the line where their silicon doesn’t just have a generic neural engine, it’s a Stable Diffusion-tuned neural engine at a very deep level.
NF: I was just going to say the H100, I think one of the things we’re excited to see from that, from Nvidia, is this transformer engine that it has, which is this specialized CUDA kernel and optimizations for transformers, and I think you’re right. Yeah, the subsequent generations will have these for Stable Diffusion, for diffusion models, for transformers, et cetera.
Yeah. When were Transformers introduced again? Was it 2018 or-
DG: It’s 2017 is when they-
2017, and so, if the A100 came out around 2019, 2020, yeah, it wasn’t incorporated. And so, yeah, I think that’s really interesting to see how the H100, when it’s tuned for this specific sort of application, we could see a real step change in performance.
NF: Today actually is 120 days since the Stable Diffusion release. And so, it was sort of interesting to reflect on what 120 days of open source is and means. And I think… I don’t know. Daniel, were you surprised that Apple actually did this optimization themselves? That’s extremely fast, from introduction of a new technology to Apple optimizing it for their hardware.
DG: Yeah, I was surprised that it was an Apple.com release. I sort of expected them to maybe encourage the open source community or to help out with PyTorch through some proxy. But I think there is actually a surprising amount of information in that. Apple has, until recently, sort of really struggled. We all remember those iPhone launch keynotes where they’re saying they have all this hardware and it’s not really clear what you’re going to use it for. New iPads are just running laps around the last iPad, which is still running laps around the competition.
But now, the transformer and Stable Diffusion have given them this massive gift, which is we are finally CPU-bound. Software has not been CPU or GPU-bound for a long time. It’s mostly been memory bound, and now that’s flipping again. And I think it’s a fantastic thing for anyone who makes hardware for TSMC, for Nvidia, for Apple. And so, yeah, I imagine that if they did Stable Diffusion that quickly, I can imagine they have a lot more than that.
Yeah. I’ll tell you why I am not surprised and why I’m pleasantly surprised, both. So, from a pure strategic perspective, there’s the classic line, commoditize your complements. Open source is sort of inherently, by definition, commoditized, and it’s a phenomenal complement, to your point, to Apple’s actual differentiation. So, it was the greatest gift possible to Apple. Your point about their performance advantage being brought to bear, and not just that, but their design advantage, their integration advantage, all of these tie tightly together.
The pleasant surprise is I think the reason to be skeptical is that while the opportunity was obvious, it wasn’t clear that Apple still had it in them to seize it.
And I think that’s what is, if you’re just a broad observer of Apple, it’s perfectly fine to be cynical and say, “Yeah, this is the most obvious thing in the world to do.” But by the way, there are obvious things in the world for very large companies to do that they don’t do or they take forever to sort of get done. And to me, again, the strategy here makes absolute sense. It is the biggest gift in the world. Apple, to make everything local, to make everything delivered through the App Store, it aligns with basically every single part of their business. They can sell a whole privacy story around this. Everything about this aligns with what Apple wants to do, but you could see them either moving slowly or getting hung up on these prospective fears. We didn’t talk about the artist backlash that’s sort of been emerging around AI Arts. You could imagine Apple being like, “We’re the artist company, we’re concerned about that.” And no, they didn’t. They went in the opposite direction. They moved with incredible speed that frankly, I didn’t know if they still had that speed in them.
DG: It did feel like Steve temporarily stepped out of his grave, pushed play on that WordPress that runs the website, and stepped right back in. I think this is public now, it’s in one of the interviews Craig did recently, but Apple themselves were surprised when M1 numbers came out internally. They did not expect it to be anywhere near as good as it was. And I actually think this speaks to how bad Intel is, that they were able to completely knock it out of the park on the first implementation and surprise themselves.
I was going to make the analogy earlier, the M1 was like ChatGPT, where the first version just blows you away.
DG: Yeah. Right.
And M2 is better, but it’s like, it’s not quite the same leap.
Stable Diffusion and Thinking in Images
Let’s talk more about Stable Diffusion. Stable Diffusion remains probably the most surprising aspect broadly, just because, again, if you were in this space, ChatGPT isn’t necessarily a surprise. I think maybe the degree to which it resonates, the degree to which it really is well done — I’ve been surprised to this point, because I thought about human reinforcement learning as being something that would nerf it, because it would just be people worried about giving it the wrong answer which would actually make it worse. But no, it’s actually integral to making it a better product that people like to use. So, that’s been a surprise. But the biggest surprise by far is that there is actually this open source product that is good and runs locally, and just what a tremendous difference this makes. What is some of the stuff that has come out that people probably have no idea about?
NF: Yeah, I think the sort of obvious things have been, first, once open source developers got their hands on it, they took an already small model that was fairly efficient, and they were able to massively optimize it to the point where, as you said, they got it down to I think 400 megabytes. They got it to run on an iPhone.
And this was before Apple’s optimizations.
NF: Yeah, exactly. I’ve seen versions of Stable Diffusion that can make a hundred frames a second, running on server hardware.
DG: Which is great for video.
NF: We’re great at local hill climbing in the open source world, which is really exciting. I think that the other thing is once you have the weights to the model, it’s very different from having API access to the model, and suddenly you can do all of these things. We saw some fun hacks where people were sort of exploring latent space and generating these videos of nearby images. But we also had this whole fine-tuning revolution. The Stable Diffusion concept library showed up on Hugging Face where people were taking the model and fine-tuning its awareness of objects into it. That was the first idea. That preceded this idea of fine-tuning the model not on arbitrary objects, but on yourself. And the Dream Booth, Avatar AI lens — turns out, the object we’re most interested in is ourselves.
NF: But the fine-tuning broadly is still useful. Let’s say you’re doing advertiser images, you want to make ads. We’ve seen a series of companies that are taking Stable Diffusion, you take some images of your ad, it will fine tune it, and now we can produce, whatever object it is that you’re selling, your merchandise, we can produce all kinds of advertising images related to it that faithfully reproduces the way it looks, but beautifully in a ski lodge or whatever.
And then you plug it into Meta’s models where it’ll test the heck out of it, see which one performs the best. It’s models upon models.
NF: There’s an amazing one I was playing with last night, a company that we were lucky to get involved in, that has basically built a kind of prompt-to-puppet pipeline. So, you can type a text prompt describing a character, and then it will create a 3D controllable puppet out of them using Stable Diffusion and a pipeline of other models.
You mentioned Riffusion. What’s that?
NF: Oh, yeah, okay. So, this is one of the wildest things. This is really unexpected. So, the craziest example of this is these two guys took Stable Diffusion, which generates images from text, and they decided to fine-tune the model on images of spectrograms of music. So, they would take music, so, for example, upbeat jazzy brunch music or whatever, and they would take the textual description of it, and then they would generate the spectrogram image of it and fine-tune the Stable Diffusion model on that.
And then, having done that, they would then prompt the model with a description of music that it doesn’t exist, and it would generate a spectrogram image, and then they would convert that spectrogram image into audio, and it works ridiculously well. The audio sounds fantastic. And so, they turned this image generation model into a music generation model simply by fine-tuning it on spectrogram images. So, that this works, that it’s so kind of obvious and simple and that it works so well is rather shocking.
Wait, so I just had a realization that just occurred to me. Okay? So, we talk about basically these image generation models and it’s like, “Oh, that’s nice,” and it’s neat that they can work in much less memory space, in part because image is actually a more constrained problem than text where you need this coherency and you want all this context. And so, one of the reasons you can’t run these text models locally is you just don’t have the memory, is basically one of the limitations there. But, inherent in that judgment is that text is going to be the most important, right? After all, look around. Text is actually what matters.
There’s a few points here. Number one, text is what matters in part that’s downstream from the printing press. It’s downstream from that being the way we work. And text is inherently a good match with deterministic thinking, because you can lay down explicitly what you mean. Yes, you don’t have the person’s internal monologue and thoughts, but they can articulate a whole lot of what is important and get that down on the page and walk you through their logic chain. And text lets you do that to a much greater sense than images. Images, a lot of it is your interpretation of the image. It’s you perceiving what it is. And interestingly, from a biological perspective, vision itself is probabilistic, right? That’s why you get those optical illusions, because your brain is filling in all the different pieces that go into it.
And this makes me wonder, maybe the real difference between deterministic computing and probabilistic computing is in fact the difference between text-based computing or text-based thinking and visual-based thinking. And this visual stuff is in fact not just a toy, it’s not just out there first because it’s easier to do. It actually might be the future of all of this. And there’s just more affordances biologically, there are more affordances in everything that comes from images, for exploration and hallucinating and when you say hallucinating, you think about being on an LCD trip or something like that.
NF: You think visual hallucinations.
Yeah, exactly. And so, I don’t know, it makes me wonder, maybe the game really all is the visual stuff.
NF: Yeah, I think you point out this surprising thing about Stable Diffusion, it being the big surprise, and I think that the models are small, they can run locally. Now we are finding out they can generalize to produce music. That’s a sort of second surprise. I think it maybe comes down to this idea that aesthetics requires fewer bits than we realized. We had this statement, a picture is worth a thousand words…
Because you’re tapping into the viewer’s own understanding of the world. Again, it’s how it works biologically and how it works sort of in practice.
NF: Yeah, and I think that the third surprise might be that videos won’t actually need much larger models. There’s even fewer bits. There’s more redundancy between frames than we expect. And so, we probably think, “My God, to produce video, it’s this gigantic leap. You’ll have coherence.” Maybe that’s true, but maybe not. It similarly has fewer bits in it, effectively, than we think of it as having. So, I was going to say, a picture worth a thousand words might mean a picture is only 3K worth of information. A thousand words is not that much data potentially.
Yeah. Text is kind of a hack anyway, right? The picture is worth a thousand words is a really great way to put it. Humans are visually oriented. If the printing press had never been created, well, there’s a reason why text was the province of monks in monasteries copying things over, and the vast majority of interaction was oral interaction, was visual sort of interaction. It’s fascinating because in general, the impact of the Internet and computing in general feels in some respects like a return to a different sort of world, where you have this globalization effect where everyone is online in the same space, but they have their distinct physical world fiefdoms or whatever it might be.
And I don’t know, it’s kind of wild, I’m probably hallucinating here, to use the word, but there’s almost like this return to a world where you couldn’t trust anything. You couldn’t look stuff up. There was just things that were passed down and people had good reputations and you believed what they actually said. It’s a weird sort of return to the past, whereas previously we operated that way because of a deficit of knowledge, now, because there’s too much knowledge out there that’s basically unnavigable by normal people, that some of the same sort of organizing principles in society might actually return to the forefront.
DG: Yeah. Generally, I think the key issue with text that images doesn’t have is the lack of error correcting. And the type of text that GPT and ChatGPT do really well at are skimmable text, honestly, that you’re consuming a little bit more like an image than you are carefully reading. And math, of course, at the limit is the extreme version of where it struggles because it’s incredibly information dense and errors are very expensive or catastrophic. Whereas, a long paragraph, or write me a movie script about whatever, whatever. If it gets a couple of things wrong in there, it still satisfies their request.
And so, I think these models, ultimately, the Stable Diffusion model, the reason humanity’s entire collection of pixels fits into 400 megabytes is because it drastically has compressed our representations of reality. I think that is basically because it has figured out what the minimum viable thing is to output an image that our error correcting machine can sort of see through and piece together. And it’s not perfect. And you can sort of see if you look carefully at the Stable Diffusion images, like getting hands right is very tricky, and you look carefully at the hands and they’re not quite right.
There are a lot of six-finger hands out there.
DG: Six-finger hands. But as they fix that, the models will get bigger because we will have sort of realized, “Well, okay, you can’t error correct over there.” We really care about these things. So, I think the fact that it worked on music, though, is mind boggling to me, and makes me wonder if Stable Diffusion is a little bit more of a discovery than an invention. Clearly, there’s some natural harmonic relationship between whatever our brains have in concept space and words and error correcting, and I don’t know how to reason about it, but I think it’s really amazing…
Well, I think text is unnatural, or at least permanent text, right?
Because if you go back to how humans evolved, and it was all oral, which changes every time you tell it, and it’s retained in the lossy space that is our brain. This idea that you can write something down and it’s immutable and doesn’t change, that is the weirdness. That’s not actually human. That’s the weird bit. So, in a weird way, this probabilistic bit is actually making these more human in some regards. Humans hallucinate on the internet all the time. You’re looking for an answer, you go to some random forum. Someone very authoritatively and very surely and often writes out an answer, “This is exactly what the issue is, what you should do,” and it’s completely wrong.
NF: Well, and that’s what we trained the models on, too. Exactly that text.
It’s a good point.
DG: What we didn’t train the models on is — it goes back to your point, Ben, about authoritativeness in a world of abundant and free information factories. When you go to Google and you put in a term, and the first result is en.wikipedia.org, you feel really good about that. And when the first result is ten-vacation-ideas.blogspot.com, you sort of know that’s going to be garbage. And I don’t think models fully understand that. That’s why, for example, if you ask any of these models, “What’s the 10th wonder of the world?” It’ll confidently tell you it’s the Roman baths because there are websites on the internet that say that. Why are there all those websites? Well, because there’s this whole thing with SEO and Google, and you’ve got to get out there. This whole information factory is cutting the wrong way.
Maybe Google’s big, big plot to make search results increasingly bad is to confound these models that threaten their business model.
DG: I wish that the market was that efficient.
AI as Platform
Yeah. Once again, we are going long, but there’s a couple of things I want to get to. We talked a little bit last time about AI as a platform. And you think about a platform, you think of something like Windows. What was really compelling is if you built your application to the Windows API, you would get the performance improvement that came from new processor technology for free because that was all abstracted away. So, you could write an application that barely functioned on the 386, then the 486 comes out, you’re like, “Oh my God, my application’s so much better,” and nothing changed about the application. It was just a faster processor underneath. What’s interesting is, in contrast to your point earlier, Nat, products outpaced technology in the early days of computing. Software was too heavy, but it was written too heavy knowing that processors were increasing so quickly that they would catch up. And so, it was almost product pulled it forward in some respects.
The question here is, when Stable Diffusion 2 came out, one of the big bits of backlash was — how did Lensa work? Lensa actually came up with a list of prompts that would produce in this constrained space the right output they wanted every time. The problem is, the prompts didn’t work on Stable Diffusion 2. And so, one of the first updates that Hugging Face came out with was actually making it possible for people to use Stable Diffusion 1 prompts on Stable Diffusion 2 to get the output that they wanted. I’m curious, what was entailed in that? To what extent can there be a separation of prompt and model, or is there a bit where this is so intertwined it’s very hard to pull it apart?
NF: In the case of these Stable Diffusion models, if you ever go through the architecture of the model, it’s interesting. It has a lot of components. And in this case, it’s the clip model that does the textual representation of the image. And with Stable Diffusion 2, they switched clip models from OpenAI’s preexisting open source clip model, to the one from the LAION Group. And I think it turned out first, OpenAI’s model, their clip was very good, had great datasets, but also, the LAION one was just a little bit different, sort of like you used different terms, didn’t recognize certain terms, that sort of thing.
I think the other change with Stable Diffusion 2 is that they filtered the dataset a lot more aggressively. Stable Diffusion 1’s dataset contained a fair amount of porn. They wanted to avoid that with Stable Diffusion 2. I don’t know, there’s arguments I’ve heard that they may have been too aggressive, like if there’s a thigh or whatever, a bare leg, that image would get thrown away. And so, that also changed the type of results that people were getting compared to what they had previously expected from Stable Diffusion 1.5. So, yeah, I think your question is, gosh, is this almost like an API, and is this a breaking change in the API? Or user interface, and the user interface is completely revamped.
Right, right, because I think the answer to this is really important to figuring out where value is going to be created in the long run.
If the models can just be subbed out, then there will be a massive value for companies that build that API layer, for lack of a better term, number one. And number two, you can invest in building very large product-led companies because you can have confidence in the underlying structure. On the other hand, if every app needs to be rewritten for a new model, that is going to make the underlying model makers much more valuable, but I would argue is probably going to make the broader ecosystem smaller because you can’t make large investments because you’re going to have to throw it away in a year.
NF: Yeah. This question of where the moat is is this perennial market structure question everyone’s asking, and there’s been this theory that maybe the moat’s in the model, and if you build these super cutting-edge models that are state of the art, maybe they’re very expensive to build, for some reason. You need a ton of compute, so maybe your compute costs are a part of your moat. Maybe you have just the best dataset and it’s really hard to build that dataset, or you have some exclusive access to it. It’s part of your telemetry from your users. That’s a part of it. I think a lot of people have been operating under the theory that the moat could be some technical innovation that they come up with, some breakthrough model architecture or some idea that they keep secret. And I think so far, the evidence is that none of those things feel really durable long term.
Yeah, it’s like saying that the long-term vision of computing is who can make the best new processor every couple years. And with the assumption that all software is going to be rewritten for that processor. When in reality, Intel benefited just as much as anyone else from Windows establishing a stable layer on top of their progress.
NF: There’s lots of interesting structural reasons why those things just don’t apply in this case. One is that the technical secrets are pretty simple, and people jump from company to company, and they tend to get out. So, that’s one. Another is that if I have access to your model — this is sort of interesting. One of the big things that happened since we last talked was OpenAI released this Whisper model, which is a speech recognition model. It’s similar to Stable Diffusion, I think, in that the performance is really surprisingly good. I’m sure you’ve played with it, but if you run it over your podcasts, the large model, it’s probably near flawless in terms of its transcription. And you can run it locally. And OpenAI was open about how they trained it. They’re brilliant. They did a lot of brilliant things. But one of the things they did was they got a dataset of YouTube videos, I think hundreds of thousands of hours of YouTube videos, and the captions from those videos. And so, exactly how they did it, I’m not sure. I can’t speak for OpenAI obviously, but-
So, basically, again, models upon models.
NF: Well, it’s models upon models, but it’s also just this observation that if I have access to a lot of inputs and outputs from your model, which I do have in the case of YouTube captions, which I could scrape from you, then I can distill from your model my own model. I can use those samples as teachers for my model. And so, the more access I have to outputs from your model, the more I have your model in a way. And so, Whisper is better, I would say, than the YouTube captioning model, but it probably got a big leg up from being able to start from that dataset.
That’s what I mean. Google basically helped train OpenAI’s model.
NF: Yeah. You can sort of suck somebody else’s model over the internet through a straw and then improve it yourself. And I think that’s pretty interesting.
What are your thoughts on this middle layer, Daniel? I mean, obviously, I’m reaching for an analogy that’s in a completely different frame of computing, but is there the prospect of there being this sort of stable layer in the middle between this stuff?
DG: There is. In the particular world of images, I do think there was this temporary moment where, because they changed the underlying clip model, people had to rewrite their prompts or use the thing Hugging Face put out. That, in particular, I don’t think was as painful as like a breaking API change. But in the world of language models, there’s a related thing going on that I think might help, which is if you have a language model, you have really three options. You can fine tune to accomplish whatever you want to do with an open source model or with OpenAI’s model, at one extreme. At the other extreme, you can just use the API just putting in your query and hoping that the result goes back. But in the middle there’s this idea of what’s called k-shot learning, sort of prompt engineering, giving the model a couple of examples, letting the model know that it can write code in a couple of cases, then executing that code, returning it to the model. And there are a bunch of people that have built libraries around that middle layer. And that turns out to be pretty useful just because sort of it creates a new aperture of use case that didn’t exist before, turns out there’s a lot you can do with these models, with this aforementioned k-shot learning.
It’s kind of like that compiler idea that I was talking about earlier.
DG: It’s exactly your compiler idea where you basically tell the model ahead of time, “Look, you can call out to Python if you need to. If you’re not sure what the answer is, you can call out to Python. And use Python however you want, otherwise give me the answer.” And then the library, if it calls out to Python, will execute that Python and then bring it back to the model. And this whole idea of, it’s called prompt chaining, has taken off quite a bit. And there are popular, at this point, libraries for that. Maybe there’ll be companies built around that. And I think those will remain popular abstraction points, whether they’re libraries or companies sort of remains to be seen.
But I think there will always be a layer in between, because the large language model companies to date, maybe this changes over time, but to date, OpenAI’s, primary goal is to create AGI [artificial general intelligence]. Say what you want, but that’s sort of what they talk about around the water cooler, that’s the cultural mission, that’s the all hands, that’s the conversation. And as a result, the API I don’t think is as great as it could be. And it reminds me a lot of hyperscalers, where at the end of the day, AWS was sort of built to serve at scale enterprise customers and their API was just not as pleasant for startups. It never has been. It never will be.
So you have these companies like Vercel, Heroku that pop up as sort of the middleware in between AWS or GCP, serving the person that just wants to get going. And this sort of exists in the payment space with Stripe, and back in the day, authorize.net, instead of going directly to Visa. So I think it might exist in a large language model space, not necessarily because of backwards and compatible changes, but just because the API that these companies have created is actually not that easy to use. If you want to accomplish use cases, you have to do this k-shot prompting trick. And I don’t think those companies Anthropic or OpenAI are going to bother sort of really building the most bespoke API. I think they do, they care about it. I don’t mean to insult anyone at those companies that’s working on it, but…
It gets your cultural point before, if the all-in census-
DG: The culture is AGI.
…is research and is papers, there’s a different focus. Basically what we’re waiting for is middleware is what I’m hearing from you. Because the product people can build products, the research people can do research, and someone in the middle needs to translate it all.
DG: Yeah, I think that’s right. And by the way, I think there’s more middleware to be built also in the open source language model, fine-tuning world. For this teleprompter thing where we were discussing throughout this podcast, I fine-tuned a model. And it’s not as easy as it could be. There’s no reason it shouldn’t be as easy as just upload a bunch of text, and you get your fine-tuned model and it’s done.
There are people are working on this, but I do think there is another level of abstraction that’s yet to be built, both in the open source world and in the closed source world. And those companies will be very interesting to watch over time. One of the challenges I think that you have in the very early days of an ecosystem, if you’re middleware, if you try to build middleware when the whole marketplace is evolving, because you’re sort of in the not good enough Clayton Christensen phase, your customers are themselves vertically integrated.
DG: They’re not great customers, they’re going to switch to the next greatest thing tomorrow. You’re not going to be able to get a lot of money out of them. And over time you switch to sort of modular dynamics when things get commoditized, but that’s the one issue those guys will have.
The Future of AI
Yeah, I think you touched on an important point. I want to ask you guys about what you’re looking forward to next year and going forward. But the reality is an implication of this being similar on the same scale as a mobile or as a cloud computing is that takes time. These shifts take a long time, and I think your dynamic you’re talking about is a perfect example where, when the underlying stuff’s changing very rapidly and when it’s not good enough, you have to be fully integrated. But you can’t build large companies that way. And there’s always this washout period where tons of stuff needs to be explored, needs to be figured out, and then suddenly problem spaces emerge. And the market pulls in the product as opposed to the product sort of defining the market.
DG: That is definitely true, and I definitely think we’re in the vertical integration phase.
I mean, what’s going to happen next year other than GPT-4?
DG: Nat, what do you think?
NF: I think, first of all, I’d say this trend of capabilities outpacing products and tinkerers is going to continue. And I think it’s not just because tinkerers will still be catching up. I think it’s because capabilities are going to keep racing ahead. And I think the pace of change will not slow down. I think we’ll just see a lot more capabilities coming out. So that’ll be on the image side. The media side, I think video certainly is a thing we can predict a lot of results from next year. I think we’ll also see multimodal models, so models that are doing text and images…
NF: You’re chatting with ChatGPT for example. It says, “Well, it’d be easier to explain this in a diagram,” and it just gives you a diagram. Or you give it a screenshot of something and it can read the text in the screenshot and answer it. So it can sort of consume and produce images just like words in its conversations with you. I think we’ll start to see multimodal models. I don’t know how capable they’ll be, but I think they’ll show up next year, maybe by the end of the year. Yeah, GPT-4 will come out at some point, and it’ll be another sort of step improvement. It’ll increase the sense of the heat increasing, a hot space, getting hotter. And then I think one of the trends we did touch on here will continue, which is the sort of ratio of cost of data to cost of compute will keep growing. So compute will keep growing, but I think people will recognize that getting super valuable human feedback data or really curating and cleaning their pre-training data, these are very valuable things. High quality data sets matter a ton.
Right, which seems to be a cyclical thing. First everyone believed the inputs have to be super high quality data. Then like, “No, actually quantity is more important.” Then, it’s like, “Well, actually no quality turns out to be pretty important.
NF: Some of the backlash we’ve started to see, I’m named in a lawsuit about CoPilot and co-generation. We had this sort of ArtStation rebellion against AI, where the ArtStation front page was sort of covered by anti-AI images for a day. I think it’s going to continue and maybe sort of enter some new domains. I think it’s possible we’ll see text models start to make some people feel threatened. Or just the ability to produce fake content. You can have this kind of industrial accident where your script goes wrong and it publishes a million articles about some topic on the internet.
Or it goes right, depending on your point of view.
NF: And I think the backlash has been relatively small so far. I haven’t seen any large content creators suing anyone yet, for example.
Well, I think what’s interesting, there are two differences between the art backlash and text. One is a matter of timing. Art has been out there. I think Midjourney is the one that really opened the gates by being widely available, and that was in the summer, so we’re talking six months ago or something like that. Whereas GPT was out there, but ChatGPT didn’t really open people’s eyes until relative recently, that’s one.
But number two, I would imagine there’s a bit where people who are text oriented, there’s probably a bit of complacency that comes from the fact that ChatGPT is easily shown to be wrong. And so it’s like, “Well, okay, yeah it’s pretty impressive but it does get stuff wrong all the time. Whereas art, what’s wrong art?” Well, okay, six fingers, that’s wrong. But by and large, it’s an interpretive thing, so that feels in some respects more threatening.
One of my takes about how this AI stuff will impact companies that are doing stuff that is non-deterministic are going to have a huge benefit. So Meta ad serving is I think an obvious example. If you serve someone the wrong ad, it’s a infinitesimal opportunity cost and it’s not any sort of real cost. But if you can get marginally better targeting and serving it’s a huge gain. Whereas you talk about something, and I almost feel like your success in CoPilot sort of maybe led maybe me astray a little bit, the reality is stuff that does require accuracy — you take like legal brief, for example, lots of busy work generating text. Oh, obviously use for AI. Well, you really don’t want to get that wrong, right?
NF: Yeah. I think the sort of fiction to non-fiction ratio is a really interesting question for next year. It’s clear that we’re way better at fiction than non-fiction today. The fiction stuff is getting much better, actually, at a very high rate, and the non-fiction getting better too, but, gosh, it’s awfully expensive and it’s awfully hard. And there do seem to be some really hard problems in there, and the models can’t run locally and you can’t tinker with them. And we don’t yet have great open source ones and so maybe it just keeps going that way. Maybe we continue to just be amazing at this fiction, error tolerant space next year. And non-fiction takes longer to catch up.
I think paradoxically that actually means a larger startup opportunity for non-fiction. And the reason is that the incumbent companies in this space will not incorporate AI because it’s clearly worse, and it makes bad mistakes that are bad for your reputation. And so they will double down on the human component. Whereas if you’re starting from scratch and you say, “Look, okay, we’re going to be mostly right and sometimes wrong, but our costs are going to be 1% of what the alternative, or even lower.” You talked about Clay Christensen earlier, that’s the classic disruption story, where you take advantage of a new technical paradigm to fundamentally transform your cost structure to produce a product that is worse than the incumbents, but is way cheaper and is on a much steeper slope of improvement over time.
DG: Yeah, I mean it definitely seems like we have not yet fully seen the sort of AI native thing that couldn’t exist before iPhone launches. It takes a while for Uber and Instacart to emerge. I don’t think we’ve yet sort of seen that. Maybe we’ll see that next year. I think that would be really exciting.
Riffusion might be an example.
DG: Yeah, something like just SoundCloud, but it’s sort of automated, and it’s the music you want and it’s on tap. But I think it should be even weirder and more native. By the way, there’s a whole space around sort of therapy and there’s a loneliness epidemic in America. And I do sort of think the degree of creativity and imagination required in having an entertaining conversation with someone is a problem that AI is excellent at. You don’t need to be very factual.
Forum seeding, right? How do you get a forum off the ground? You need to get a critical mass of people. We already talked about forums are not necessarily accurate. What if there was forum participants that are always there and always replying?
NF: By the way, though, I will say the nonfiction use cases may just be a bit more invisible, because they may be embedded use cases. I did hear this morning about a company that’s using a trained transformer to calculate sales tax. They take a product description and information about the customer’s location and so forth, and they output sales tax and it outperformed their previously built deterministic systems in accuracy and their prior models. They’ll crow about that, but it won’t make quite the same splash because it’s embedded, but they happen to have a ton of data they could use for training and got a great result from a transformer.
Yeah, that’s interesting.
DG: That reminds me of, I do think we were talking about this idea of the 175 billion parameter GPT-3 sized model sort of might be this uncanny valley, where it’s a little bit too unwieldy, you can’t run it on one GPU today, it’s hard to reason about, and it’s also not an AGI that answers questions truthfully. And so, sure, a 10 trillion parameter model that can do all the things is really useful and one billion parameter model that is calculating sales tax is probably really useful. It’s doing one specific thing, and it does it really well. We’ve heard of companies doing very simple things named entity recognition, sentiment analysis much better than…
The constrained problem space, if you have a very narrow problem.
DG: Exactly. So I think billion parameter functions and 100 trillion parameter AGIs is interesting. It’s unclear if the space in between is actually that useful.
The GPU Paradox
Last question, what do you make of the GPU paradox? By this I mean if AI is exploding then GPU should be in exceptionally high demand. But Nvidia’s not simply writing down their existing inventory, but also writing down their future purchase orders from TSMC. So we’re talking future production. We’re not talking about A100s, we’re talking about H100s. They are planning on making fewer of them than they thought they would.
There are a number of possibilities as to why this is the case. It may be that these models are turning out to be much more efficient than we thought. GPU power is less important. It could be maybe Meta’s actually succeeding and building their own silicon, so they’re buying a lot fewer, I don’t know. Maybe it’s just AI is very big for us because we’re right in the middle of it, and it’s not that big of a deal broadly. What do you think explains this?
DG: I think my variant on that would be your last point, but not that it’s not a big deal for everyone else, I think we’re just really early, in hindsight. And this whole idea of large companies and nation states having to have an answer to GPT-3 or 4, and having to have clusters of tens of thousands of GPUs will happen. I just think it’ll take a really long time. And I actually think ChatGPT was very significant in that regard. But to us, in Silicon Valley, it was actually not a big event.
Its main significance was, as I think about this Christmas, we’re recording this before Christmas, this is going to be one of those things where American families are going to be sitting at the table and discussing AI and having conversations about what is human creativity really? And people are developing opinions, that’s happening. So if you give that a couple of months, I think you’re going to end up seeing, at some point, regulators, conversations, it’s going to get on C-SPAN.
So I think the GPU paradox sort of will be solved. I always think of one of the worst things one could do is be a trader in the public market sort of using West Coast mindsets, because in your West Coast head, everything happens extremely quickly. You’re very early to things.
Diffusion, no pun intended, takes a long time.
NF: I think it’s that too. Hardware demand comes from inference, not from training. Inference is going to dominate. Inference comes from products that have wide usage, and we just don’t have very many products with wide usage yet. We do have some that have grown a lot and you saw Lensa ran out of GPUs, at least at some point.
Honestly, I think Lensa is on the level of ChatGPT for importance. And not because Lensa’s going to be a big company in the long run.
But what it signifies about what products that leverage AI actually look like, and how quickly demand can ramp.
NF: And I do think around the dinner table at Christmas, the phone gets passed around, “check this out. Check out ChatGPT. Check out Lensa.”
And everyone’s going to want these images immediately. I honestly think Nvidia’s in a really hard spot, because they obviously dramatically over forecasted for the previous generation, and they’ve just gotten destroyed financially. And I do wonder if they over-corrected. And Lensa is kind of my example for why. No one saw this coming because no one saw Stable Diffusion coming, and to just explode like that and to utterly and completely hit the wall as far as GPU availability goes, I don’t think that’s a one-off. I see that as being the first of many explosive uses like this.
DG: Yeah, I think that’s right. And I think that use case was a relatively small model. I think things will get much more complicated from some of these larger models.
Well, I mean, the question too is if it ends up running locally then that doesn’t help Nvidia very much.
DG: Right. Nvidia’s hope is that the really valuable stuff requires running inference on eight GPUs. That’s their hope.**
Yeah, we’re back to the local question I mean Apple is so well place. All this stuff is downstream to Stable Diffusion. That probably is I think is the most important moment of 2022 when we look backwards, because all this sort of stuff is so downstream from that. That lends itself to product exploration. The other thing is that OpenAI costs money, so it literally costs money to experiment, and Stable Diffusion does not. And when you scale that across all the potential explorers and hackers and tinkerers in the world, it’s a massive difference. And then you add on the potential to run locally, the potential to expand it to other places. That’s the breakthrough. And it may be even a bigger one if my on-the-spot idea that vision actually might be more important than text, contra to all our expectations, is true.
DG: It’s a big area of debate. It’s good you have an opinion.
Well, happy to change it, if necessary. It’s good to talk to you guys. Merry Christmas, Happy Hanukkah, Happy Holidays, Happy New Year. What a year. The great thing is it’s really exciting. At the end of last year, personally, I was kind of feeling like, “Man, I just don’t know if I want to keep doing this, what is in the future other than regulation and antitrust?” which I was just sick of. And this is something completely new and it’s really exciting. And maybe there’s confirmation bias, because I want there to be new and exciting, but this does feel very real.
DG: Yeah, I think that’s right. The healthy vibrancy Silicon Valley has right now I don’t think has been felt since the iPhone gold rush days. And so it feels good.
NF: And I have nothing against people who are into Web3, but one of the things that was sad about the Web3 era for me was that the demos weren’t very exciting. Someone would tell you about their Web3 idea, and you were never that impressed by the screenshot or the demo. It seemed to do something that you already know how to do but in a different way.
The demos are unbelievable in AI. It is like the iPhone app from, I don’t know, 2009 that you kind of passed around, “Look at this.” And I think there’s something maybe sort of deep about that. You want that wow moment, want that magic moment from technology where you didn’t know that was possible, and now you can do it in your hand.
If a picture’s worth a 1,000 words, a demo is worth a million dollars.
All right. Nat, Daniel, thanks for coming on again, and look forward to seeing what 2023 brings.
DG: Same here. Thank you.
NF: Thanks, Ben.
This Daily Update Interview is also available as a podcast. To receive it in your podcast player, visit Stratechery.
The Daily Update is intended for a single recipient, but occasional forwarding is totally fine! If you would like to order multiple subscriptions for your team with a group discount (minimum 5), please contact me directly.
Thanks for being a supporter, and have a great day!