THR Web Features   /   July 2, 2024

The New Verbal Economy

Why generative AI will have to stay plugged in to us.

Richard Hughes Gibson

( THR illustration/Shutterstock.)

“As one knows without saying,” Friedrich Kittler declared in 1992, “we do not write anymore.” Kittler’s assessment rested on the changing materiality of the act of writing. Whereas once upon a time writers registered marks upon a surface (say, by dragging graphite across paper), now they issued electronic commands, their letters consisting, ultimately, of differences in voltage. What others took to be an “explosion of the signifying scene,” Kittler sought to expose as an “implosion.” For by migrating into the digital universe, writing had slipped out of humanly “perceivable time and space,” its traces now measurable in micrometers. Thus, Kittler postulated that the “last historical act of writing” occurred when “Intel engineers laid out some dozen square meters of blueprint paper (64-square meters in the case of the later 8086) in order to design the hardware architecture of their first integrated microprocessor.”

Fast forward thirty years, and once again, we hear that the end of writing is nigh—at least as a human pastime. “Writing is over. That’s it,” the mystery writer Sean Thomas wrote in January 2023, “It’s time to pack away your quill, your biro, and your shiny iPad: The computers will soon be here to do it better.” The current crisis concerns not the physical dimensions of writing—though the computational shrinking of words is its precondition. The computers have come for the mental work of writing. Thomas, for one, would have us own up to the uncomfortable truth that writing is algorithmic and that algorithms are computers’ forte: “That means that, given enough data to train on (e.g., all the words ever written on the Internet) computers can get really good at running the algos of language.” And as many testimonials, including ones on this website, have attested, the new writing machines are quite adept at the “algos of language.” Rightly, business professor Ethan Mollick has observed that when confronting “the tyranny of the blank page, people are going to push The Button.”

Understandably, much of the commentary on generative AI during the last two years has focused on what The Button can do—is doing—for us and to us. “Push it repeatedly, even recklessly,” the Button-boosters urge, “let’s see what it can do!” “Sure,” the cautious optimists advise, “push The Button, but be a critical reader and show some self-restraint: we’re still learning what it can and cannot do.” Meanwhile, the doomsayers cry, “Stop! Look what the Button is doing to you! to the word! to the world!” To which the worldly-wise reply, “Too late, everybody. The Button has been pushed: Adapt or perish.” As a chirographophile and a teacher of writing, I must admit my sympathy with the hardy souls who warn us not to relinquish any writing assignments to digital assistants. “Even the most basic scraps of writing we do—lessons in cursive, text messages, marginal jottings, postcards, all the paltry offcuts of our minds—improve us,” Samanth Subramanian recently argued in this vein, “The difficulty of writing—the cursed, nerve-shredding, fingernail-yanking uncertainty of it—is what forces the discovery of anything that is meaningful to writers or to their readers. To have AI strip all that away would be to render us wordless, thoughtless, self-less.”

The debate about The Button’s effects on us is unquestionably warranted, and many of the responses have been illuminating, particularly regarding different industries’ attitudes toward writing. (If you are headed to business school, for example, better polish your prompt-generation skills.) But in paying so much collective attention to The Button, commentators have, in my humble opinion, contributed to a broad misunderstanding of the current state of the art of text generation. The trouble is that Button-focused exposés often imply (and sometimes directly state) that the machines will soon be running themselves—as if having gobbled up the Internet, AIs will be good to go. In fact, we are told, the machines are only going to get better by talking to each other. The reality is not so simple.

Humans are, in fact, the necessary ingredient in the ongoing success of generative AI. I don’t mean that the business model of generative AI depends on consumer interest—that is obviously true. What I am proposing is more elemental. If you start reading the fine print that comes with your favorite bots, you’ll notice that nearly every AI company trains its language models using three types of data: 1) data provided by users or “crowd workers,” whether in the form of prompts given to the machines or up- and downvotes on AI outputs; 2) publicly available data; and 3) datasets licensed from third-party businesses. The first source helps AI companies to refine models, especially with an eye toward ferreting out noxious content. When releasing GPT-4, for example, OpenAI touted that user feedback on previous models contributed to the improved safety of new releases.

The latter two sources—data that’s free for the taking and data that AI companies pay to access—point us to the fact that writing machines don’t choose their own diets. We’ve heard Sean Thomas suggest that generative AI trains on “all the words ever written on the internet,” a boast echoed in both doomsday prophecies and corporate reports. But that statement is misleading, perhaps dangerously so, even if meant in jest. Vast portions of the Web are unusable—perhaps better said, untouchable—if one’s goal is to build a writing machine that emits inoffensive, grammatical, and generally accurate replies to user instructions. Moreover, the Internet is not static. Pages come into and go out of existence all the time, their contents expanding, contracting, and reappearing elsewhere, sometimes with and often without permission. For this reason, Common Crawl, one of the chief sources of that publicly available data (including, ahem, copyrighted material), scrapes the web every month, although much of that haul would be toxic to a polite text generator and so must be filtered before being swallowed up by a large language model.

Accordingly, AI developers strive to outdo each other by two strategies. The first is to concoct a better brew of materials technically available to everyone. In their paper announcing the release of Llama-2 last year, Meta engineers explained that they used two trillion tokens (i.e., computationally encoded linguistic building blocks) drawn from a “new mix of data from publicly available sources” (italics mine). “To increase knowledge and dampen hallucinations,” the team also “[up-sampled] the most factual sources”—meaning those sources deemed “most factual” by the designers—including, in all likelihood, Wikipedia and Google Books— were given the strongest influence on the resulting model. The second strategy is to locate and incorporate collections of words—“corpuses” in geek-speak—before the competition arrives. Anthropic trained Claude 2.1, for example, on millions of question-and-answer pairs, scientific data, and, more surprisingly, customer support logs.

AI development has thus created a new market for good words. Last year, OpenAI published this call for collaborators on their website:

We’re interested in large-scale datasets that reflect human society and that are not already easily accessible online to the public today. We can work with any modality, including text, images, audio, or video. We’re particularly looking for data that expresses human intention (e.g., long-form writing or conversations rather than disconnected snippets), across any language, topic, and format. 

Notice that medium doesn’t matter—they’ll take anything that can be scraped, scanned, or transcribed including printed matter and recordings of oral presentations—but the quality of the language does. The need for long-form writing explains why in May OpenAI struck a $250 million, five-year deal with News Corp, the Anglo-American media conglomerate whose outlets include the newspapers The Wall Street Journal and The Times of London and the investment guides Barron’s and MarketWatch. OpenAI can now pillage News Corp’s archives, and News Corp receives cash and credits to use OpenAI products. OpenAI’s bots will be ingesting News Corp’s words, while News Corp’s publications will employ OpenAI’s bots.

Because of this emerging verbal economy, the imminent doom scenario mentioned above—in which the machines quickly overtake human writers on epistemic and stylistic grounds—doesn’t compute for the foreseeable future. Our immediate conceptual challenge is not to prepare for a future where the machines embarrass us into silence but to deal with a present awash with writing produced by humans and machines, often in tandem. Generative AI was made possible by our condition of textual superabundance (or, better said, media superabundance since now audio, image, and video are all also widely available at scales previously thought impossible), but, as Matthew Kirschenbaum has warned, it could spiral into a “textpocalypse.”

Kirschenbaum’s nightmare is that the machines could prompt one another “to put out text ad infinitum, flooding the Internet with synthetic text devoid of human agency or intent; grey goo, but for the written word.” Such an Internet, as Kirschenbaum goes on to argue, would be useless to us since it would consist, almost entirely, of the digital equivalent of ultra processed foods, a sea of “spam.” The lesson is similar to Kittler’s, though now applied to semantics and syntactics: What appears to be an explosion of the signifier is, in fact, an implosion, a catastrophic one. That isn’t just bad for humans. An Internet without human buy-in would, in turn, make it practically useless for AI development, since, as I’ve been suggesting all along, humans are necessary to the project as both contributors and end-users. Studies published in the last year have shown that feeding new models “synthetic data”—i.e., text generated by AI—can lead to 1) “irreversible flaws” as “the tails of the original distribution of genuine human content disappear” and 2) “excessively uniform behaviors.” In other words, the veridicality and linguistic diversity of the models decline, generation by generation. The Web would become too predictable, and so unpalatable, for human and machine taste alike.

Now is a good time to recall Edward Tenner’s notion of “revenge effects,” according to which the introduction of a new technology may, in the long run, recreate the original problem (e.g., antibiotic usage that contributes to the rise of antibiotic-resistant bacteria) or displace the harm to another spot or group (e.g., adding air-conditioning to subway cars significantly raises the ambient temperature of platforms). In this case, the revenge of the proliferation of AI-generated text could come in two domains. First, the buildup of Kirschenbaum’s “grey goo” writing would compromise not only the immediate value of the Internet as source of text but also the long-term viability of the open web as a platform for publication. A polluted pool will have no visitors. Second, as The Button becomes a standard feature of word processing, human writers will either get lazy or, watching the clock or bottom line, accept the trade-off of their own diminishing skills for putative gains in efficiency. (Studies have already shown that many people treat the Button as one-stop shopping, uncritically passing along whatever the machine happens to spit out.) What users compose will become less and less distinguishable from the models on which our upgraded autocomplete depends. Human writing too will be marked by “excessively uniform behaviors.” Textpocalypse meets Tabaggedon.

These issues present us with a paradoxical situation. Generative AI’s success depends upon developing webs of words that do not conform to the current model’s image. That project cannot be accomplished simply by hoovering up the cleanest bits of the Web and discovering forgotten archives. Generative AI needs the archives of the future too. Developers need human writers to keep at it—that is, to keep sharing with each other, to keep correcting each other, to keep speaking freely and publicly. AI needs the Web to remain a viable place where uncountable numbers of characters are transacted daily. (Here is further evidence that, as Maria Farrell and Robin Berjon have recently argued at Noēma, the time has come to “rewild” the Internet.) Only through regular injections of human writing can the models improve and the machines stay up to date. To keep pace with the march of history and the gradual evolution of language, generative AI must stay plugged in to us.

In 1987, Kittler’s contemporary Vilém Flusser also wondered if writing was quickly becoming obsolete. Flusser feared that the end of writing would also bring about “the decline of reading that is, of critical decoding.” He envisioned a future in which “all messages, especially models of perception and experience,” would be ingested “uncritically,” “the informatic revolution” turning humans into “receivers who remix messages uncritically.” We would become “robots.” My point now is that robotic humans would be the doom of the writing machines. I have read many essays over the last two years that have counseled me to learn how to write “with the machine.” Look beneath The Button, and you’ll realize that we must not to forget how to write without it.