DALL-E, the Metaverse, and Zero Marginal Content

Last week OpenAI released DALL-E 2, which produces (or edits) images based on textual prompts; this Twitter thread from @BecomingCritter has a whole host of example output, including Teddy bears working on new AI research on the moon in the 1980s:

A photo of a quaint flower shop storefront with a pastel green and clean white facade and open door and big window:

And, in the most on-the-nose example possible, A human basking in the sun of AGI utopia:

OpenAI has a video describing DALL-E on its website:

While the video does mention a couple of DALL-E’s shortcomings, it is quite upbeat about the possibilities; some excerpts:

Dall-E 2 is a new AI system from OpenAI that can take simple text descriptions like “A koala dunking a basketball” and turn them into photorealistic images that have never existed before. DALL-E 2 can also realistically edit and re-touch photos…

DALL-E was created by training a neural network on images and their text descriptions. Through deep learning it not only understands individual objects like koala bears and motorcycles, but learns from relationships between objects, and when you ask DALL-E for an image of a “koala bear riding a motorcycle”, it knows how to create that or anything else with a relationship to another object or action.

The DALL-E research has three main outcomes: first, it can help people express themselves visually in ways they may not have been able to before. Second, an AI-generated image can tell us a lot about whether the system understands us, or is just repeating what it’s been taught. Third, DALL-E helps humans understand how AI systems see and understand our world. This is a critical part of developing AI that’s useful and safe…

What’s exciting about the approach used to train DALL-E is that it can take what it learned from a variety of other labeled images and then apply it to a new image. Given a picture of a monkey, DALL-E can infer what it would look like doing something it has never done before, like paying its taxes while wearing a funny hat. DALL-E is an example of how imaginative humans and clever systems can work together to make new things, amplifying our creative potential.

That last line may raise some eyebrows: at first glance DALL-E looks poised to compete with artists and illustrators; there is another point of view, though, where DALL-E points towards a major missing piece in a metaverse future.

Games and Medium Evolution

Games have long been on the forefront of technological development, and that is certainly the case in terms of medium. The first computer games were little more than text:

Images followed, usually of the bitmap variety; I remember playing a lot of “Where in the world is Carmen San Diego” at the library:

Soon games included motion as you navigated a sprite through a 2D world; 3D followed, and most of the last 25 years has been about making 3D games ever more realistic. Nearly all of those games, though, are 3D images on 2D screens; virtual reality offers the illusion of being inside the game itself.

Still, this evolution has had challenges: creating ever more realistic 3D games means creating ever more realistic image textures to decorate all of those polygons; this problem is only magnified in virtual reality. This is one of the reasons even open-world games are ultimately limited in scope, and gameplay is largely deterministic: it is through knowing where you are going, and all of your options to get there, that developers can create all of the assets necessary to deliver an immersive experience.

That’s not to say that games can’t have random elements, above and beyond roguelike games that are procedurally generated: the most obvious way to deliver an element of unpredictability is for humans to play each other, albeit in well-defined and controlled environments.

Social and User-Generated Content

Social networking has undergone a similar medium evolution as games, with a two-decade delay. The earliest forms of social networking on the web were text-based bulletin boards and USENET groups; then came widespread e-mail, AOL chatrooms, and forums. Facebook arrived on the scene in the mid-2000s; one of the things that helped it explode in popularity was the addition of images. Instagram was an image-only social network that soon added video, which is all that TikTok is. And, over the last couple of years in particular, video conferencing through apps like Zoom or Facetime have delivered 3D images on 2D screens.

Still, medium has always mattered less for social networking, just because the social part of it was so inherently interesting. Humans like communicating with other humans, even if that requires dialing up a random BBS to download messages, composing a reply, and dialing back in to send it. Games may be mostly deterministic, but humans are full of surprises.

Moreover, this means that social networking is much cheaper: instead of the platform having to generate all of the content, users generate all of the content themselves. This makes it harder to get a new platform off of the ground, because you need users to attract users, but it also makes said platform far stickier than any game (or, to put it another way, the stickiest games have a network effect of their own).

Feeds and Algorithms

The first iterations of social networking had no particular algorithmic component other than time: newer posts were at the top (or bottom). That changed with Facebook’s introduction of the News Feed in 2006. Now instead of visiting all of your friends’ pages you could simply browse the feed, which from the very beginning made decisions about what content to include, and in what order.

Over time the News Feed evolved from a relatively straightforward algorithm to one driven by machine learning, with results so inscrutable that it took Facebook six months to fix a recent rankings bug. The impact has been massive: not just Facebook but also Instagram saw huge increases in engagement and increased growth the better their algorithmically-driven feeds became; it was also great for monetization, as the same sort of signals that decided what content you saw also influenced what ads you were presented.

However, the reason why this discussion of algorithmically-driven feeds is in a different section than social networking is because the ultimate example of their power isn’t a social network at all: it’s TikTok. TikTok, of course, is all user-generated content, but the crucial distinction from Facebook is that you aren’t limited to content from your network: TikTok pulls in the videos it thinks you specifically are most interested in from across its entire network. I explained why this was a blindspot for Facebook in 2020:

What is interesting to point out is why it was inevitable that Facebook missed this: first, Facebook views itself first-and-foremost as a social network, so it is disinclined to see that as a liability. Second, that view was reinforced by the way in which Facebook took on Snapchat. The point of The Audacity of Copying Well is that Facebook leveraged Instagram’s social network to halt Snapchat’s growth, which only reinforced that the network was Facebook’s greatest asset, making the TikTok blindspot even larger.

TikTok combines the zero cost nature of user-generated content with a purely algorithmic feed that is divorced from your network; there is a network effect, in that TikTok needs lots of content to choose from, but it doesn’t need your specific network.

The Machine Learning Metaverse

I get that metaverses were so 2021, but it strikes me that the examples from science fiction, including Snow Crash and Ready Player One, were very game-like in their implementation. Their virtual worlds were created by visionary corporations or, in the case of the latter, a visionary developer who also included a deterministic game for ultimate ownership of the virtual world. Yes, third parties could and did build experiences with strong social components, most famously Da5id’s Black Sun club in Snow Crash, but the core mechanic — and the core economics — were closer to a multi-player game than anything else.

That, though, is exceptionally challenging in the real world: remember, creating games, particularly their art, is expensive, and the expense increases the more immersive the experience is. Social media, on the other hand, is cheap because it uses user-generated content, but that content is generally stuck on more basic mediums — text, pictures, and only recently video. Of course that content doesn’t necessarily need to be limited to your network — an algorithm can deliver anything on the network to any user.

What is fascinating about DALL-E is that it points to a future where these three trends can be combined. DALL-E, at the end of the day, is ultimately a product of human-generated content, just like its GPT-3 cousin. The latter, of course, is about text, while DALL-E is about images. Notice, though, that progression from text to images; it follows that machine learning-generated video is next. This will likely take several years, of course; video is a much more difficult problem, and responsive 3D environments more difficult yet, but this is a path the industry has trod before:

Game developers pushed the limits on text, then images, then video, then 3D
Social media drives content creation costs to zero first on text, then images, then video
Machine learning models can now create text and images for zero marginal cost

In the very long run this points to a metaverse vision that is much less deterministic than your typical video game, yet much richer than what is generated on social media. Imagine environments that are not drawn by artists but rather created by AI: this not only increases the possibilities, but crucially, decreases the costs.

Zero Marginal Content

There is another way to think about DALL-E and GPT and similar machine learning models, and it goes back to my longstanding contention that the Internet is a transformational technology matched only by the printing press. What made the latter revolutionary was that it drastically reduced the marginal cost of consumption; from The Internet and the Third Estate:

Meanwhile, the economics of printing books was fundamentally different from the economics of copying by hand. The latter was purely an operational expense: output was strictly determined by the input of labor. The former, though, was mostly a capital expense: first, to construct the printing press, and second, to set the type for a book. The best way to pay for these significant up-front expenses was to produce as many copies of a particular book that could be sold.

How, then, to maximize the number of copies that could be sold? The answer was to print using the most widely used dialect of a particular language, which in turn incentivized people to adopt that dialect, standardizing languages across Europe. That, by extension, deepened the affinities between city-states with shared languages, particularly over decades as a shared culture developed around books and later newspapers. This consolidation occurred at varying rates — England and France several hundred years before Germany and Italy — but in nearly every case the First Estate became not the clergy of the Catholic Church but a national monarch, even as the monarch gave up power to a new kind of meritocratic nobility epitomized by Burke.

The Internet has had two effects: the first is to bring the marginal cost of consumption down to zero. Even with the printing press you still needed to print a physical object and distribute it, and that costs money; meanwhile it costs effectively nothing to send this post to anyone in the world who is interested. This has completely upended the publishing industry and destroyed the power of gatekeepers.

The other impact, though, has been on the production side; I wrote about TikTok in Mistakes and Memes:

That phrase, “Facebook is compelling for the content it surfaces, regardless of who surfaces it”, is oh-so-close to describing TikTok; the error is that the latter is compelling for the content it surfaces, regardless of who creates it…To put it another way, I was too focused on demand — the key to Aggregation Theory — and didn’t think deeply enough about the evolution of supply. User-generated content didn’t have to be simply pictures of pets and political rants from people in one’s network; it could be the foundation of a new kind of network, where the payoff from Metcalfe’s Law is not the number of connections available to any one node, but rather the number of inputs into a customized feed.

Machine learning generated content is just the next step beyond TikTok: instead of pulling content from anywhere on the network, GPT and DALL-E and other similar models generate new content from content, at zero marginal cost. This is how the economics of the metaverse will ultimately make sense: virtual worlds needs virtual content created at virtually zero cost, fully customizable to the individual.

Of course there are many other issues raised by DALL-E, many of them philosophical in nature; there has already been a lot of discussion of that over the last week, and there should be a lot more. Still, the economic implications matter as well, and after last week’s announcement the future of the Internet is closer, and weirder, than ever.

Stratechery by Ben Thompson

Subscriber’s Daily Update

An Interview with Michael Morton About E-Commerce Winners and Losers

More on Humane, Limitless, The iPhone Integration Barrier

Intel’s Modular Vision, Meta MTIA 2, Google Axion

An Interview with Google Cloud CEO Thomas Kurian About Google’s Enterprise AI Strategy