More than four decades ago, the Italian historian and critic Carlo Ginzburg argued that the modern disciplines of knowledge that had arisen in the late nineteenth century relied on the interpretation of clues. With medicine—symptomatology—as the paradigm, these disciplines were concerned with deciphering signs, wrangling indications from seemingly mute traces. The target of the interpretation was individual persons and behaviors, as the example of criminology showed. The means was hypothesis, a sort of divination with an “inevitable margin of hazardousness.”11xCarlo Ginzburg, “Clues: Roots of a Scientific Paradigm,” Theory and Society 7 (1979): 281. The result was social control—empirical knowledge of even unintended behaviors allowed for prediction and correction. Clues, Ginzburg concluded, had become paradigmatic for the sciences in the 1870s and 1880s. And the most proximate complement to medicine in this regard was philology. Criminology and literary interpretation had a logic in common. Today we would call it the logic of data.
Alan Turing never used the word data in the 1936 paper that defined computation and launched us into the digital world we live in today.22xAlan M. Turing, “On Computable Numbers, with an Application to the Entscheidungsproblem,” Proceedings of the London Mathematical Society, 2 (published 1937, written 1936), 42 (1): 230–65; https://www.csee.umbc.edu/courses/471/papers/turing.pdf. The term appears occasionally in the work of Turing’s peer John von Neumann, after whom modern computer architecture is named. The term, which already referred to information stored on paper, the results of bureaucratic labor, seeped slowly into computer discourse in the 1940s and ’50s, even as data became literal inputs into room-sized machines like the ENIAC, entered on punched cards. Seven decades later, those input numbers have gained a life of their own. They swirl around us, only sometimes touching down long enough for us to make any sense of them. We use these numbers as signs to navigate the world, relying on them to tell us where traffic is worst and what things cost. And because we do this, data has become a crucial part of our infrastructure, enabling commercial and social interactions. Rather than just tell us about the world, data acts in the world.
The Latin meaning of data is “givens,” but data in its modern meaning refers not to gifts of nature but to the input and the output—endlessly feeding back into one another—of digital machines. These machines send us messages—push notifications in the form of little hermeneutic puzzles, signs to read off screens. Data is both representation and infrastructure, sign and system. Think of the just-in-time logistics of Amazon’s delivery game. You click on a few icons to complete a purchase, and a series of events begins—involving robots, deplorably underpaid and overworked laborers, and parcel tracking. Data was the channel along which the prices were set and the items offered to you as icons on your screen. But it’s also the channel in which all the supply-side decisions are made, often automatically. Warehouse stocks and delivery routes change, and so do prices. Data makes all of this possible, but it is also the medium in which it is carried out—as media theorist Wendy Chun puts it, data “puts in place the world it discovers.”33xWendy Chun, “Queerying Homophily,” in Pattern Discrimination, ed. Clemens Apprich et al. (Minneapolis, MN: University of Minnesota, 2019), 62. Even the labor is done at the command of data, which both represents and determines the process. The numbers Turing put into the machine have become an array of signs about the world that also act in the world. We read them and act according to them; algorithms predict and influence our behaviors by means of indexes wrung from data.
The Subtlety of Data
Data’s assumption of the power to sign should be a Roman triumph for the interpretive wing of the humanities. Suddenly the problem of interpretation is unavoidable, the signs to be interpreted the result of digital data processing. One might expect, in this circumstance, a renaissance of semiotics, the study of how signs function. The ubiquity of digital information now rests on petabytes of data in circulation combined with powerful algorithms required to distill that data into readable or usable forms. This situation should underwrite a resurgent humanities, emboldened by its prediction of a postmodern world filled with unpredictable and fragmented signs, sure of its capacity to write systemic critique in a period apparently more suited to its tools than any other in human history. The core competencies of the humanities are the analysis of representational forms and the systemic critique of the meanings and values at least implicitly embedded in those forms. The study of data, as sign and as infrastructure, combines these vocations. Yet an approach to the metaphysical subtlety of data remains elusive.
The most recent in an exhausting, seemingly endless series of controversies about the digital humanities is symptomatic of how far we are from developing that approach. In her elegant 2019 essay “The Computational Case against Computational Literary Studies,” Nan Z. Da has asserted the existence of a “fundamental mismatch between the statistical tools that are used and the objects to which they are applied.”44xNan Z. Da, “The Computational Case against Computational Literary Studies,” Critical Inquiry 45 (2019): 601. Computational literary studies, or CLS, Da’s own term, is a limited field that attempts to develop informative results in literary interpretation using data-processing techniques, which she distinguishes from a big-tent “digital humanities” that might include critical and theoretical approaches to data. Scholars such as Ted Underwood and Andrew Piper run natural language processing (NLP) algorithms on large literary corpora to ask questions about form (see Piper’s work on Augustine’s Confessions and the eighteenth-century novel) or genre (as prompted by Underwood’s interest in detective fiction and science fiction).55xAndrew Piper, “Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel,” New Literary History 46 (Winter 2015): 63–98; Ted Underwood, “The Life Cycle of Genres,” Journal of Cultural Analytics, May 23, 2016, culturalanalytics.org/2016/05/the-life-cycles-of-genres/. As Da points out, the algorithms thus far in use work nearly exclusively by breaking texts into words or word pairs, counting these, then visualizing relationships among the units. The question is how we get from counting and predicting words to a sense of literature that can sustain interpretation.
Da doesn’t think we can. If we parse data enough to make certain it is telling us what we want to know, she says, then it gives us nothing we could not learn by reading; if data processing tells us something about literature we could not otherwise have known, it is either statistically insignificant or plain wrong. Da develops a range of cases that divide into “no-result” papers and “wrong result” papers. Her conclusion is that there can be no stable relationship between literary interpretation and data.
The controversy that followed the publication of Da’s paper was instructively dull.66xSee, in particular, “Computational Literary Studies: A Critical Inquiry Online Forum,” Critical Inquiry (blog), April 1–3, 2019, https://critinq.wordpress.com/2019/03/31/computational-literary-studies-a-critical-inquiry-online-forum/. A number of respondents attempted to show technical errors in Da’s presentation of data science or in her reproduction of the boutique algorithms used in her case studies. To date, these skirmishes seem to me to have come to a draw. The underlying complaint, however, is more interesting. Those who use CLS methods allege that Da has a “rigid” understanding of data science, and that her proposal is a sort of “policing” of disciplinary boundaries that would restrain innovation. This objection gestures at the internal interdisciplinarity of data science, which itself has an open-ended understanding of the relationship between data and domain. But for the data scientist, everything rests on fitting technique to object, of finding the right representational form for the algorithm to take to explain something about the domain. But what are the rules of the domain in the case of literature? Because literary studies long ago gave up the project of establishing a single stock of stable terms to apply to interpretation—one can think back to structuralist projects like Vladimir Propp’s Morphology of the Folktale—the CLS approach seems to sneak this stock of interpretive categories through the backdoor. Demonstrate some data stability in the word count in a genre or a period, then connect the stability to that genre or period, creating a tidy interpretive schema that entirely rests on the validity of the genre or period concept.
Some scholars, such as Richard Jean So and Hoyt Long, have tried to go beyond this, exploring what the algorithm seems to “get wrong” in terms of these categories as a way to explore, for example, the long-held thesis of the influence of haiku on Modernist poetry. But Da argues that they, too, get caught in the bottleneck. Although their goal is to let their algorithm “learn” the distribution of both haikus and other short poems, they are forced to set strict parameters that “overfit” the data, so that when Da runs the algorithm on another data set—of Chinese haikus—it badly misclassifies them. I would hardly call this disciplinary policing; it’s more like empirical testing. But it reveals a larger set of issues.
Data scientists have long distinguished between work that explicitly models data for the given domain and algorithms meant to explore the domain and turn up any patterns that might lurk within. A stronger model will give you a closer sense of how to interpret your data; a more flexible one—say, a neural net—will give you better predictive accuracy, but maybe leave your hermeneutic efforts foundering.77xLeo Breiman, “Statistical Modeling: The Two Cultures,” Statistical Science 16, no. 3 (2001): 199–231. If an algorithm finds patterns in data without explicitly stated parameters, it might take a generation of scientists to figure out what that stability means. Da’s experiment suggests that there’s no real signal in So and Long’s data set, but even if there is, the interpretive question remains wide open, because the relationship between the algorithmic representation and the semiotics of the poems has never been analyzed. CLS adds forms of representation to the already difficult question of literary interpretation. It’s hard to see how this is supposed to reduce the complexity of hermeneutics or set it on any stable basis.
Data in Perpetual Motion
Venturing into the no man’s land between top-down programs and learning algorithms makes the semiotic problem that much more obvious, and that much more difficult. As the filmmaker and theorist Florian Cramer puts it, interpretation has become “a battleground between quantitative analytics and critical theory.”88xFlorian Cramer, “Crapularity Hermeneutics: Interpretation as the Blind Spot of Analytics, Artificial Intelligence, and Other Algorithmic Producers of the Postapocalyptic Present,” in Apprich et al., Pattern Discrimination, 37. Digital humanist Johanna Drucker makes a similar point about graphics: “The representation of knowledge is as crucial to its cultural force as any other facet of its production. The graphical forms of display that have come to the fore in digital humanities in the last decade are borrowed from a mechanistic approach to realism, and the common conception of data in those forms needs to be completely rethought for humanistic work.”99xJohanna Drucker, “Humanities Approaches to Graphical Display,” Digital Humanities Quarterly 5, no. 1 (2011), http://www.digitalhumanities.org/dhq/vol/5/1/000091/000091.html. CLS experiments seem like an unintentional imitation of the digital situation more broadly, adding computationally produced signs to the world in the hopes of making sense of all the data already in circulation, like a diagram for a perpetual motion machine.
Data added to data will never produce its own parameters or set its own interpretive stakes, and for this reason, digital humanists cannot excuse themselves from the problem of what counts as literary in the first place, what we used to call “literariness.”
To be sure, words can be counted and statistical relationships can be stated within and across texts and corpora. It’s just that to make that data useful for literary interpretation, we’d have to be able to distinguish, among the data, patterns that belong to language as such, to individual languages in historical contexts, and to text production across a nearly infinite variety of generic styles of prose—the vast majority of which are not literary. Then, at the tail end of all this, we would have to distinguish the data patterns that are solely literary in nature. Data patterns must be unequivocally attached to a single object or set of objects to be analytically useful. But language itself—not to speak of mood, irony, or allegory—would not serve its purpose if it were equally unequivocal. We can restrict language technically and conceptually, but this is the exception: The very porosity of language means that datafication will never capture it entirely. Equivocation is what makes language useful, and what the linguist Roman Jakobson called the “poetic function” of language is a sort of play with or meditation on that equivocation. Jakobson himself was deeply influenced by early information theory and cybernetics, but he used these to make sense of reference and representation, not to pretend that culture was an entirely empirical object.1010xSee, e.g., Roman Jakobson, “Linguistics and Communication Theory,” in Selected Writing II: Word and Language (Paris, France: Mouton, 1971), 570–80.
While data sets about language do not reduce interpretive complexity, they are certainly proliferating. Powerful NLP algorithms like the so-called generative pretrained transformer, or GPT-2, can write convincingly in many genres, including that of the newspaper article, which suggests that we will soon be dealing with mind-bending effects stemming from the increasingly large data sets of natively digital language.1111xSee Jacob Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (version 2), arXiv, May 24, 2019, https://arxiv.org/pdf/1810.04805.pdf. For a nontechnical introduction, see Rami Horev, “BERT Explained: State of the Art Language Model for NLP,” Towards Data Science (blog), November 10, 2018, https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270. GPT-2 and other models are capable of using unprecedented amounts of text for training, and there is no doubt that they are both finding patterns and reproducing them in large linguistic data sets. We are set to experience a world in which such algorithms have a ubiquitous presence in our informational infrastructure. The extended lesson from Da’s work, for me, is that we need to study the relationship among algorithmic forms of representation, digital signs, and literary language. Nothing suggests that algorithms themselves will help us do this.
Because data is both representation and infrastructure, the semiotic problem is immediately tied to the political problem, and even to metaphysical problems. But there is little clarity about the relationships among these areas, even as a sense of urgency permeates both the humanistic fields that study them and international political discourse. Here, too, a core competency of the humanities is called for: systemic critique. Ginzburg made a prescient connection when he suggested that clues were used as a form of social control. Now we find clues in data and, using an automated form of divination called “prediction algorithms,” train them back on the world, identifying faces, targeting advertisements, and predicting elections. These systems put the world in place as much as they represent it, making representations into consequences, signs into supply chains. The digital humanist must study this process of transformation, the semiotic channels along which so much bad politics actually gets done. For that, condemnation is not enough. Just as in CLS, there is a tendency to look away from the forms of representation when it comes to the politics of data.
Data: Abstract and Concrete
Shoshana Zuboff has proposed that we live in an “age of surveillance capitalism,”1212xShoshana Zuboff, The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power (New York, NY: PublicAffairs, 2019). in which the massive amount of data we generate in our constant use of electronic devices is turned to profit by large corporations—the Four, or the Five, as they are sometimes called: Amazon, Apple, Alphabet (Google’s parent company), Facebook, and occasionally Microsoft. To do this, these digital giants create maps of behavior, both granular and broad, effectively enclosing consumer habits and social commerce alike in the manipulative web of capital.
Zuboff tells the story of how Google’s chief economist (then a consultant), Hal Varian, and others became aware, in the aftermath of the dot-com bust of the early aughts, that they were sitting on a gold mine. People using the search function alone were generating untold potential value in the form of behavioral data—interests, purchases, and so on—if only it could be realized. The principle way to do that would be to target ads at users, and for that Google needed to track those users. Zuboff shows how companies like Verizon made different forms of identification that allowed for this tracking nonoptional, how even when opt-outs (with impossibly unreadable terms) were offered, invisible tags remained. Zuboff calls the result “behavioral surplus,” leading to “behavioral prediction markets.”1313xIbid., 100ff. Targeting aspects of individual behavior by scouring real-time data for clues, surveillance capitalism reads like an expanded version of Ginzburg’s conceptualization of “clues” as a tool for social control.
The story is disturbing, but hardly surprising. As a steady stream of studies has shown, digital data processing has very much traversed the boundary from cyberspace to meatspace and is being used to make policy and managerial decisions from the university to the municipality to the global supply chain. This data cuts across social categories such as race and class.1414xAs scholars such as Safiya Noble, Virginia Eubanks, and Frank Pasquale have shown, sites of social struggle like race and class are now filtered by search, targeted advertisements, and judicial algorithms. See Safiya Umoja Noble, Algorithms of Oppression: How Search Engines Reinforce Racism (New York, NY: New York University, 2018); Virginia Eubanks, Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor (New York, NY: St. Martin’s Press, 2017); Frank Pasquale, The Black Box Society: The Secret Algorithms That Control Money and Information (Cambridge, MA: Harvard University Press, 2015). Zuboff’s account is instructive for its generality, capturing a whole paradigm of critical approaches that think of data as cutting abstractly, even arbitrarily, through our lives and communities, along the common path of the market. As political and commercial currents coalesce, the question of data impresses itself all the more heavily on the humanities. But data is not only “arbitrary.” It has also gained the feeling of necessity, since we have given it agency in our infrastructures.
Data is both abstract and not. On the one hand, data is numbers, utterly indifferent to the reality it distills. The informational value of these numbers derives from clustering them, giving them syntax, and automating the exploration of that syntax. But their value also comes from their origin, as the recent mantra “All data are local” is meant to illustrate.1515xYanni Alexander Loukissas, All Data Are Local: Thinking Critically in a Data-Driven Society (Cambridge, MA: MIT Press, 2019). The picture of the world that data processing delivers is always imperfect, often with disastrous effects for the most marginalized: those subjected to partly automated judicial decisions that turn out to be racially biased, the redlined, the poor.1616xSee Julia Angwin et al, “Machine Bias,” ProPublica, May 23, 2016, https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing; Ruha Benjamin, Race After Technology: Abolitionist Tools for the New Jim Code (Medford, MA: Polity, 2019). Yet the idea that digital abstraction is the causal factor in the disaster, that data systems create bias, is a faulty supposition. Google’s search function may deliver blatantly skewed results that advance the twisted logic of racism, while automated loan decisions perpetuate racialized poverty. But these injustices did not suddenly spring into being with the digital computer or the database. Data does not create bias on its own, but extends and morphs preexisting bias captured in things like credit reports and actuarial tables as it is fed into search engines.
What matters is not whether data is abstract but how it interacts with other representational systems and the bodies and infrastructures engaged in them. Preexisting representational systems—the bureaucratic capture of census data, market indexes, and the like—already shot through with problems of misrepresentation, combine with the warp of digital semiosis to make a multilayered abstraction, a set of overlapping yet muddled representations in which we nevertheless place enormous institutional and political trust. Yet mere condemnation may be answered by the techno-solutionists of Silicon Valley with optimism: there’s a problem? Let’s make the tools better! The humanities must go beyond the deadlock of accusation and boosterism.
“Behavioral surplus” names the warp of digital data but cannot distinguish it from the weft of capitalism. Although Zuboff makes little of the connection, the very phrase invokes Karl Marx’s notion of “surplus value”: the difference between the value conferred on the commodity by labor and the value realized in exchange. Zuboff writes that “digital connection is now a means to others’ commercial ends,” replacing Marx’s “old image of capitalism as a vampire that feeds on labor” with a “surveillance capitalism [that] feeds on every aspect of every human’s experience.”1717x Zuboff, The Age of Surveillance Capitalism, 9. Yet Zuboff’s suggestion of a road not taken, what she calls “advocacy capitalism,”1818xSee, e.g., Zuboff’s presentation, “Making Sense of the Information Economy: A Mixed Record?” (video), at “What’s Wrong with the Economy—and with Economics?,” New York Review of Books conference, New York, New York, March 2015, https://www.nybooks.com/daily/2015/03/29/whats-wrong-with-the-economy/. On the weakness of the “advocacy capitalism” conception, see the comprehensive review by Evgeny Morozov, “Capitalism’s New Clothes,” The Baffler, February 4, 2019, https://thebaffler.com/latest/capitalisms-new-clothes-morozov. rings as hollow as the proposal put forward this year at Davos to shift from shareholder to “stakeholder” capitalism. Although it seems like a policy suggestion, it is more like a utopian ideal, an all too blithe notion that capitalism can be simply restructured from the ground up. In the face of a digitalizing global economy, policy proposals often have this empty feeling. Worse, they distract us from the representational groove in which capital now travels: the courses it follows in making signs into dollar signs.
Data: Representation and Infrastructure
The vampire metaphor betrays Zuboff’s hand, and points to a deeper symptom of this kind of critique. It suggests that we view data manipulation as a modern Moloch consuming the bodies of workers, as in the iconic scene in Fritz Lang’s 1927 film Metropolis. But this cedes too much to the “machine,” which is really a dispersed and uneven set of global infrastructures. This system does not “see” so much as it captures, to use a distinction made by the media scholar Philip Agre.1919xSee Agre’s influential essay, “Surveillance and Capture: Two Models of Privacy,” The Information Society 10, no. 2 (1994): 101–27, https://www.tandfonline.com/doi/abs/10.1080/01972243.1994.9960162.
Cameras and facial recognition software are certainly forms of visual surveillance, but digital infrastructure really runs on data capture, which locates and configures packages, individuals, and behaviors. Think of the Amazon supply chain, a closed loop of data generation and interpretation, with the workers, producers, and consumers almost incidental to the profitable circulation of data and signs. This is not exactly surveillance, which is a visual metaphor about bodies. Rather, following the capture metaphor, it is about the inscription and manipulation of information as data, not in any way limited to physical movements. We are not just being tracked; we are being immersed in an unevenly deployed system of data capture. But that system is not exactly a panopticon. A lot of what goes wrong in data-driven effects—misrepresentation as much as misallocation—has to do with overlapping forms of captured bits of abstract data misapplied to each other and to us. We live in a sort of intersectionality of abstractions, overlapping systems of representation and infrastructure that are often badly out of sync with one another.2020xChun’s exploration of “homophily,” the data practice that groups like with like, approaches the problem in this way, treating data science as entangled layers of signification. See Chun, “Queerying Homophily,” 78–79. Neither condemnation nor utopia will do.
Data, unlike Ginzburg’s clues, targets more than individuals. It can be used to simulate systems and events, populations and epidemics. For this reason, it is able to cast a far finer, and far less clear, web of social control than the disciplines of the turn of the twentieth century. So while data quantifies us, making us feel abstract, alienated, and faced with a more powerful “system” than ever before, perhaps it is not with regard to the ever present feeling of dehumanization that our analytical skills are most needed. To be sure, our political energy should be directed there. But data’s dual aspect as both representation and infrastructure constitutes a sort of metaphysical fulfillment of the prophecies of postmodernism. What the postmodernists described was a world of simultaneous fragmentation and lockdown, with signs floating chaotically through virtual spaces, seeming to gain the upper hand over “material” infrastructures, but ultimately reinforcing systems of control and channels of power—or even inventing new ones. That world seemed improbable, even fanciful, to many during the heyday of the theory, but it is utterly obvious now, so fundamental that it somehow still evades our conceptual grasp. Our most immediate task is to take the measure of this semiotic metaphysics—to calibrate it in terms of its digitally processed and circulated signs.
We have automated the society of clues to act on its own divinations, with consequences far beyond the individual. We are not dealing with one system anymore, but instead with widely diverging systems, from industrial production to last-mile delivery to the political economics of platforms to the political speech that takes place on platforms, piling abstraction upon abstraction. Both the politics and the use of algorithms need something like what the young Jean Baudrillard called “a critique of the political economy of the sign.”2121xJean Baudrillard, For a Critique of the Political Economy of the Sign (St. Louis, MO: Telos Press, 1981). First published 1972. This work, which we must take up in spite or even because of downward pressure on the humanities and the headwinds of capital interests, will define our generation.