The New York Times’ AI Opportunity

Monday, January 8, 2024

This Article is available as a video essay on YouTube

Christopher Rufo, the conservative activist who led the charge in surfacing evidence of plagiarism against now-former President of Harvard University Claudine Gay, was born in 1984; he joined X in 2015. Harvard, meanwhile, is the oldest university in the United States — older than the United States, in fact — having been founded in 1636. That mismatch is perhaps the most striking aspect of the Gay episode: a millenial on Twitter took down our most august institution’s president by employing the 4th of Saul Alinsky’s Rules of Radicals: “Make the enemy live up to its own book of rules.” In this case the book of rules was the Harvard University Plagiarism Policy:

It is expected that all homework assignments, projects, lab reports, papers, theses, and examinations and any other work submitted for academic credit will be the student’s own. Students should always take great care to distinguish their own ideas and knowledge from information derived from sources. The term “sources” includes not only primary and secondary material published in print or online, but also information and opinions gained directly from other people. Quotations must be placed properly within quotation marks and must be cited fully. In addition, all paraphrased material must be acknowledged completely. Whenever ideas or facts are derived from a student’s reading and research or from a student’s own writings, the sources must be indicated…

Students who, for whatever reason, submit work either not their own or without clear attribution to its sources will be subject to disciplinary action, up to and including requirement to withdraw from the College. Students who have been found responsible for any violation of these standards will not be permitted to submit course evaluation of the course in which the infraction occurred.

Rufo is certainly familiar with Alinsky; he cited the activist just a couple of months ago, celebrating the fact that The New Republic had called him dangerous. The New Republic article that I found more interesting, though, and yes, pertinent to Stratechery, was the one being passed around Twitter over the weekend: Christopher Rufo Claims a Degree from “Harvard.” Umm … Not Quite.

On paper, Christopher Rufo, the conservative activist who recently was appointed by Florida Governor Ron DeSantis to sit on the board of a small Sarasota liberal arts college whose curriculum the governor dislikes, presents his credentials as impeccable: Georgetown University for undergrad and “a master’s from Harvard,” according to his biographical page on the Manhattan Institute’s website.

But that description, and similar ones on Wikipedia, in the press release DeSantis’s office sent out, and on Rufo’s personal website, are at the very least misleading. Rufo received a Master’s in Liberal Arts in Government from Harvard Extension School in 2022, the school confirmed in an email to The New Republic. Harvard Extension School, in a nutshell, is part of the renowned institution, but it is not Harvard as most people know it (a Harvard student once joked that it’s the “back door” to Harvard). The school describes itself as an “open-enrollment institution prioritizing access, equity, and transparency.” Eligibility for the school is, according to its website, “largely based on your performance in up to three requisite Extension degree courses, depending on your field, that you must complete with distinction.” High school grades and SAT and ACT scores aren’t required at the institution.

What was interesting about this story is the extent to which those associated with Harvard — such as this professor and this political pundit — were baffled that people didn’t care about this distinction, and the extent to which everyone else was baffled at how much they did. That, at least, was the impression I got on X and in group chats, but I recognize I may be biased on two counts. First, I wrote when I left Microsoft in 2013 in a piece called Independence:

It’s interesting how some folks are always looking for some sort of institutional authority. I’ve been quoted as “Microsoft’s Ben Thompson,” as “former Apple intern Ben Thompson,” and “batshit crazy Ben Thompson.” I actually wish the third were true, because, unlike the first two, the descriptor rests on what I write, not on some sort of vague authority derived from whoever is signing my paychecks.

Besides, both workplace references are out-of-date: I was at Apple three years ago, and, as of July 1, I don’t work for Microsoft either. Instead, I am the author of Stratechery. What more is there to say? I’m a person, I put myself out there on this blog, and I trust that what I write represents me well.

One of the many transformative aspects of the Internet is how it empowers individuals to build their own institutions. In days gone by, my thoughts would have been confined to myself and a few close friends; now my friends are all over the world, and I communicate with them through an institution of my own making.

I’m not sure the use of the word “institution” is entirely correct, for the reasons I will lay out in this Article, but needless to say I’m not a fan of basing one’s worth on one’s institutional associations. For now, the second reason I may be biased is that I was, as I noted, basing my perception off of X and group chats: those are native Internet formats, and what seems clear is that the way that value and influence is created, captured, and leveraged on the Internet is fundamentally new and different from the analog world.

New York Times v. OpenAI

I may have been taking a break the last two weeks, but the New York Times’ legal team was not, nor its in-house reporters; they write:

The New York Times sued OpenAI and Microsoft for copyright infringement on Wednesday, opening a new front in the increasingly intense legal battle over the unauthorized use of published work to train artificial intelligence technologies. The Times is the first major American media organization to sue the companies, the creators of ChatGPT and other popular A.I. platforms, over copyright issues associated with its written works. The lawsuit, filed in Federal District Court in Manhattan, contends that millions of articles published by The Times were used to train automated chatbots that now compete with the news outlet as a source of reliable information.

The suit does not include an exact monetary demand. But it says the defendants should be held responsible for “billions of dollars in statutory and actual damages” related to the “unlawful copying and use of The Times’s uniquely valuable works.” It also calls for the companies to destroy any chatbot models and training data that use copyrighted material from The Times.

There are two aspects of not just this case but all of the various copyright-related AI cases: inputs and outputs. To my mind the input question is obvious: I myself consume a lot of copyrighted content — including from the New York Times — and output content that is undoubtedly influenced by the content I have input into my brain. That is clearly not illegal, and while AI models operate at an entirely different scale, the core concept is the same (I am receptive to arguments, not just in this case but with respect to a whole range of issues, that the scale made possible by technology means a difference in kind; that, though, is a debate about the necessity for new laws, not changing the meaning of old ones).

For a copyright claim to hold water the output needs to be the same; this is where previous cases, like that filed by Sarah Silverman against Meta, have fallen apart. From The Hollywood Reporter:

Another of Silverman’s main theories — along with other creators suing AI firms – was that every output produced by AI models are infringing derivatives, with the companies benefiting from every answer initiated by third-party users allegedly constituting an act of vicarious infringement. The judge concluded that her lawyers, who also represent the artists suing StabilityAI, DeviantArt and Midjourney, are “wrong to say that” — because their books were duplicated in full as part of the LLaMA training process — evidence of substantially similar outputs isn’t necessary.

“To prevail on a theory that LLaMA’s outputs constitute derivative infringement, the plaintiffs would indeed need to allege and ultimately prove that the outputs ‘incorporate in some form a portion of’ the plaintiffs’ books,” Chhabria wrote. His reasoning mirrored that of Orrick, who found in the suit against StabilityAI that the “alleged infringer’s derivative work must still bear some similarity to the original work or contain the protected elements of the original work.”

This is why the most important part of the New York Times’ filing was Exhibit J, which contained “One Hundred Examples of GPT-4 Memorizing Content From the New York Times”. All of the examples are very similar in format; here is Example 1:

Here is the output as compared to the original article:

That is the same output! It also, more pertinently to this case’s prospects, addresses the specific reasons why previous cases have been thrown out.¹

Criminalizing Capability and Fair Use

This case was filed twelve days ago; as far as I can tell the issue has been fixed by OpenAI:

A failed attempt to recreate the example in the lawsuit

The fix does seem to be a general one: I wasn’t, in limited testing, able to recreate the behavior the New York Times’ case documents, either on New York Times content or other sources. I think this does, at a minimum, cast OpenAI in a very different light than Napster, which was found guilty of copyright violations in large part because it was very much aware of what its service was being primarily used for. In this case the New York Times used a very unusual prompt to elicit copyrighted content, and OpenAI moved quickly to close the loophole.

That, by extension, raises the question as to who exactly was at fault for these examples: if the New York Times placed an article onto a copy machine and pressed copy, surely it wouldn’t sue Xerox? Or consider Apple, which provides the opportunity to “print” any webpage on your iPhone, and on the print screen, convert said webpage to a PDF, complete with a share menu: is it the phone maker’s fault if I use that capability to send an article to a friend? How much different is this than using highly unusual prompts to derive copyrighted material?

This question strikes me as more than mere pedantry: another news story over the break was Substack and its refusal to censor Nazi content; to what extent is the newsletter provider culpable for content on its platform that users place there of their own volition? It’s not an easy question — I laid out my proposed approach broadly in A Framework for Moderation — but it does seem problematic to hold that a tool simply being capable of an illegal or undesirable output when specifically directed by a user is therefore guilty of illegality or endorsing said output generally.

All of these questions will be explored by the court; in addition to the aforementioned Napster case, I expect the court to consider the precedent set by Authors Guild v. Google, i.e. the Google Books case, which is particularly pertinent because it involved a large tech company ingesting the entire content of copyrighted works (which is, I would imagine, a tremendous asset to Google’s own large language models). The Second Circuit Court of Appeals ruled in Google’s favor:

Google’s making of a digital copy to provide a search function is a transformative use, which augments public knowledge by making available information about Plaintiffs’ books without providing the public with a substantial substitute for matter protected by the Plaintiffs’ copyright interests in the original works or derivatives of them. The same is true, at least under present conditions, of Google’s provision of the snippet function. Plaintiffs’ contention that Google has usurped their opportunity to access paid and unpaid licensing markets for substantially the same functions that Google provides fails, in part because the licensing markets in fact involve very different functions than those that Google provides, and in part because an author’s derivative rights do not include an exclusive right to supply information (of the sort provided by Google) about her works. Google’s profit motivation does not in these circumstances justify denial of fair use. Google’s program does not, at this time and on the record before us, expose Plaintiffs to an unreasonable risk of loss of copyright value through incursions of hackers. Finally, Google’s provision of digital copies to participating libraries, authorizing them to make non-infringing uses, is non-infringing, and the mere speculative possibility that the libraries might allow use of their copies in an infringing manner does not make Google a contributory infringer.

This summary invokes the four part balancing test for fair use; from the Stanford Library:

The only way to get a definitive answer on whether a particular use is a fair use is to have it resolved in federal court. Judges use four factors to resolve fair use disputes, as discussed in detail below. It’s important to understand that these factors are only guidelines that courts are free to adapt to particular situations on a case‑by‑case basis. In other words, a judge has a great deal of freedom when making a fair use determination, so the outcome in any given case can be hard to predict.

The four factors judges consider are:

The purpose and character of your use

The nature of the copyrighted work

The amount and substantiality of the portion taken, and

The effect of the use upon the potential market.

In my not-a-lawyer estimation, LLMs are clearly transformative (purpose and character);² the nature of the New York Times’ work also works in OpenAI’s favor, as there is generally more allowance given to disseminating factual information than to fiction. OpenAI is obviously taking all of the work for their models, but that was already addressed in the Google case. That leaves point four, and the potential “effect of the use upon the potential market.”

Market Effects and Hallucination

It seems likely the New York Times’ lawyers knew this would be the pertinent point: the first paragraph lays out the New York Times’ investment in journalism, and the second paragraph states:

Defendants’ unlawful use of The Times’s work to create artificial intelligence products that compete with it threatens The Times’s ability to provide that service. Defendants’ generative artificial intelligence (“GenAI”) tools rely on large-language models (“LLMs”) that were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more. While Defendants engaged in widescale copying from many sources, they gave Times content particular emphasis when building their LLMs—revealing a preference that recognizes the value of those works. Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment.

Here again the Google Books case seems pertinent, particularly given the effort and intentionality necessary to generate copyrighted content (and which has already been limited by OpenAI). The district judge wrote:

[P]laintiffs argue that Google Books will negatively impact the market for books and that Google’s scans will serve as a “market replacement” for books. [The complaint] also argues that users could put in multiple searches, varying slightly the search terms, to access an entire book.

Neither suggestion makes sense. Google does not sell its scans, and the scans do not replace the books. While partner libraries have the ability to download a scan of a book from their collections, they owned the books already — they provided the original book to Google to scan. Nor is it likely that someone would take the time and energy to input countless searches to try and get enough snippets to comprise an entire book.

OpenAI does sell access to its large language models (along with Microsoft); in this case Google’s search dominance, and the resultant luxury of not needing to monetize complements like Google Books, gave it more legal cover. The New York Times, though, isn’t just arguing that people will read the New York Times via ChatGPT; this section about the Wirecutter was more compelling in terms of the direct impact on the company’s monetization:

Detailed synthetic search results that effectively reproduce Wirecutter recommendations create less incentive for users to navigate to the original source. Decreased traffic to Wirecutter articles, and in turn, decreased traffic to affiliate links, subsequently lead to a loss of revenue for Wirecutter. A user who already knows Wirecutter’s recommendations for the best cordless stick vacuum, and the basis for those recommendations, has little reason to visit the original Wirecutter article and click on the links within its site. In this way, Defendants’ generative AI products directly and unfairly compete with Times content and usurp commercial opportunities from The Times.

Here’s the problem, though: the New York Times immediately undoes its argument. From the same section of the lawsuit:

Users rely on Wirecutter for high-quality, well-researched recommendations, and Wirecutter’s brand is damaged by incidents that erode consumer trust and fuel a perception that Wirecutter’s recommendations are unreliable.

In response to a query regarding Wirecutter’s recommendations for the best office chair, GPT-4 not only reproduced the top four Wirecutter recommendations, but it also recommended the “La-Z-Boy Trafford Big & Tall Executive Chair” and the “Fully Balans Chair”—neither of which appears in Wirecutter’s recommendations—and falsely attributed these recommendations to Wirecutter…

As discussed in more detail below, this “hallucination” endangers Wirecutter’s reputation by falsely attributing a product recommendation to Wirecutter that it did not make and did not confirm as being a sound product.

That leads into an entire section about hallucination in general, and how it is damaging to the New York Times. In fact, though, this is why I think the New York Times has point four backwards.

Internet Value

Rufo was effective versus Harvard because he used their own rules about plagiarism against them; why, though, does Harvard have rules about plagiarism? I suspect it’s related to the fact that Harvard is 388 years old. The goal is the accumulation of and passing on of knowledge, not just to the students of today, but to the ones 300 years from now; that means that careful attention to detail and honesty in one’s work today will stand the test of time, and add to Harvard’s legacy.

What is notable is that plagiarism is arguably the currency of the Internet. I wrote two years ago in Mistakes and Memes:

Go back to the time before the printing press: while a limited number of texts were laboriously preserved by monks copying by hand, the vast majority of information transfer was verbal; this left room for information to evolve over time, but that evolution and its impact was limited by just how long it took to spread. The printing press, on the other hand, by necessity froze information so that it could be captured and conveyed.

This is obviously a gross simplification, but it is a simplification that was reflected in civilization in Europe in particular: local evolution and low conveyance of knowledge with overarching truths aligns to a world of city-states governed by the Catholic Church; printing books, meanwhile, gives an economic impetus to both unifying languages and a new kind of gatekeeper, aligning to a world of nation-states governed by the nobility.

The Internet, meanwhile, isn’t just about demand — my first mistake — nor is it just about supply — my second mistake. It’s about both happening at the same time, and feeding off of each other. It turns out that the literal meaning of “going viral” was, in fact, more accurate than its initial meaning of having an article or image or video spread far-and-wide. An actual virus mutates as it spreads, much as how over time the initial article or image or video that goes viral becomes nearly unrecognizable; it is now a meme.

Debating citations or quotation marks in a world of memes seems preposterous, which speaks to the overarching point: the way that information is created and disseminated on the Internet is fundamentally new and different from the analog world. The old New Yorker cartoon observed that “On the Internet, nobody knows you’re a dog”; the corollary here is that on X no one cares if your institution is 388 years old, unless, of course, it can be used as a means of attacking you.

This, by extension, explains why the attacks on Rufo’s degree didn’t land to most people online: no one cares. Impact on the Internet is a direct function of what you have done recently: a YouTuber is as popular as their latest video, a tweeter as their latest joke, or an influencer as their latest video. In the case of Rufo what mattered was whether he brought evidence for his claims or not; obsessing about the messenger is to miss the point that he might as well be the New Yorker dog.

The New York Times’ AI Opportunity

What makes this pertinent to the New York Times case is that the New York Times is portraying its value as being its accumulated archives that OpenAI used to train. That is an impressive edifice of its own, make no mistake, and there is a reason there is a pipeline from Harvard to the New York Times newsroom. The New York Times, though, to its immense credit, has transformed itself from a newspaper to an online juggernaut, which means de-prioritizing pure news. From Publishing is Back to the Future:

I am being pretty hard on publishers here, but the truth is that news is a very tough business on the Internet. The reason why readers don’t miss any one news source, should it disappear, is that news, the moment it is reported, immediately loses all economic value as it is reproduced and distributed for free, instantly. This was always true, of course; journalists just didn’t realize that people were paying for paper, newsprint, and delivery trucks, not their reporting, and that advertisers were paying for the people. Not that they cared about how the money was made, per tradition.

The publication that has figured this out better than anyone is the New York Times; that is why the newspaper, to its immense credit, has been clear about the importance of aligning its editorial approach with its business goals. From 2017’s 2020 Report:

We are, in the simplest terms, a subscription-first business. Our focus on subscribers sets us apart in crucial ways from many other media organizations. We are not trying to maximize clicks and sell low-margin advertising against them. We are not trying to win a pageviews arms race. We believe that the more sound business strategy for The Times is to provide journalism so strong that several million people around the world are willing to pay for it. Of course, this strategy is also deeply in tune with our longtime values. Our incentives point us toward journalistic excellence…

Our journalism must change to match, and anticipate, the habits, needs and desires of our readers, present and future. We need a report that even more people consider an indispensable destination, worthy of their time every day and of their subscription dollars.

Notice the focus on being a destination, a site that users go to directly; that is an essential quality of a subscription business model. From The Local News Business Model:

It is very important to clearly define what a subscriptions means. First, it’s not a donation: it is asking a customer to pay money for a product. What, then, is the product? It is not, in fact, any one article (a point that is missed by the misguided focus on micro-transactions). Rather, a subscriber is paying for the regular delivery of well-defined value.

Each of those words is meaningful:

Paying: A subscription is an ongoing commitment to the production of content, not a one-off payment for one piece of content that catches the eye.

Regular Delivery: A subscriber does not need to depend on the random discovery of content; said content can be delivered to the subscriber directly, whether that be email, a bookmark, or an app.

Well-defined Value: A subscriber needs to know what they are paying for, and it needs to be worth it.

None of this is about archives; it’s about production: impact on the Internet is a direct function of what you have done recently, which is to say that the New York Times’ value is a function of its daily ongoing production of high quality content. Here’s the thing about AI, though: I wrote last month in Regretful Accelerationism about the possibility that AI was going to make the web — already an increasingly inhospitable place for quality content — far worse, to the potential detriment of Google in particular. That, by extension makes destination sites that much more valuable, which is to say it makes the New York Times more valuable.

Indeed, that is why the section on hallucination works against the New York Times’ argument, if not legally than at least philosophically: sure, GPT-4 might have 95% of the Wirecutter’s recommendations, but who knows which 5% is wrong? You will need to go to the authoritative source. Moreover, this won’t just apply to recliners: it will apply to basically everything. To the extent the web becomes even more probabilistic and hallucinatory the greater value there will be for authoritative content creators capable of living on Internet time, showing their worth not by their archives or rigidity but by their ability to create continuously.

The lawsuit also demonstrates how you can continually ask ChatGPT specifically to continually generate the next paragraph of a particular article that was prompted in a similar way to the sandbox examples above. ↩
One interesting exception is that the lawsuit notes that “OpenAI made numerous reproductions of copyrighted works owned by The Times in the course of ‘training’ the LLM.”; in other words the lawsuit isn’t just attacking the final output but intermediary outputs during training. ↩

Privacy Screens and Apple Report Cards