There are very few technological or scientific innovations like DALL·E 2, the natural language processing and computer vision model. It can be thought of in a sense as basic research: It has no real commercial use case, but its creators expect the program will be a building block of future research and a framework for the development of new ideas. In short, they want us to take it and go have fun with it. Most building blocks of basic research, however, do not have a wildly popular public laboratory like OpenAI, the company which developed DALL·E 2 (as well as the now much discussed AI text generator, ChatGPT). Since July, its public beta has attracted enormous interest. And parallel or competitor frameworks, such as Stable Diffusion and MidJourney, have also built communities of hobbyists and enthusiasts on Reddit or in Discord to share tips, tricks, and their most successful results.
Such natural language processing and computer vision models raise questions about what advances in artificial intelligence might mean for the practice of art and design. OpenAI explains: “Our hope is that DALL·E 2 will empower people to express themselves creatively.” Although elided by many of its supporters, much could be said about the significant differences between the two modes of creativity: One creates a work of art, the other composes text prompts. But OpenAI’s second, and less discussed, claim is that “DALL·E 2 also helps us understand how advanced AI systems see and understand our world, which is critical to our mission of creating AI that benefits humanity.” It is this crucial phrase, “see and understand,” which has preoccupied researchers in AGI (Artificial General Intelligence) for some time. What does a computer see, and when it does, what does it help us see?
Even if it uses a fantastically complicated machine playing games with image combinations, any model is only the sum of what it is trained on. In the case of DALL·E 2, the model is trained on a comprehensive corpus of humanity’s collected visual output. Some hopeful proponents call it a kind of imaginative mirror: something that can reflect our own mental image back to us. We could think about our own visual imaginations in terms of models, too. They are trained on some smaller and less complete set of humankind’s vision. There is a potential gap, though, between our imagination and the computer’s “dreams about electric sheep.” Can this model cross that gap?
Spend any time tapping prompts into a generated art engine and you will inevitably discover a sense of the uncanny in the image responses. The nearest analogue I can think of is the slightly unsettling sense you have when someone isn’t quite looking you in the eye, an almost-but-not-quite connection of understanding that feels as if it’s running two or three parallel tracks to the left. The images it produces recall the semiconscious and fretful thoughts one has when sick and feverish and trying to sleep. The fevered mind and imagination can sometimes grind along in a partly awake stupor, spinning repetitive, busy, always-whirling and never-resolving dreams about some technical, frivolous fragment of your past day. You can sense the shape of a meaning on the other side of some kind of foggy barrier, but it is never quite resolved.
I am reminded of an elderly man I knew in the latter days of his terminal battle with dementia, shambling out to his barn, the place where he had done so much and built so much. In his final months, all he could do was fidget with his tools and bric-a-brac, impotently, repetitively, frustratedly, now and again passing into a rage because he couldn’t quite find that, just, I was trying, there was that one, if we put this here, but....
What I am trying to say indirectly is that in the image output of computer vision, the technique is there but the techne is not. In other words, the technical mastery of the generated image is unquestioned. It has antecedents in the best of humankind’s art. But the underlying chords of intention between the subject of a work and the techniques used to represent it are optional, arbitrary, and easily swappable. Any work of art that conveys meaning carries that latent meaning within it for the viewer to discover. That meaning is reconstituted somewhere between what the artist brought to the work and what the viewer brings as well. Any generated image, by contrast, also has hidden meaning. But in this case, we know for certain that the hidden meaning was written down somewhere by the prompter.
Instead of asking what inspired a work or what attracted an artist to a particular form, medium, or choice of content and context, we think about what text prompt could possibly have made the program produce that. It would be like looking at a great work by Caravaggio and wondering, How did he pull off those brushstrokes? Did he manage to create that pigment from the admixture of such and such ingredients? Could the material and quality of the canvas have induced this visual texture? These are all worthwhile inquiries, but foregrounding them impoverishes art. The inquiry and engagement with the piece never move beyond technique.
As a trained architect, I joke sometimes that buildings are ruined for me because I tend to look for the expansion joints or the sloppy, hidden details. It is rare that I am so struck breathless by a building that I cannot but wonder at it. I fear that the same is becoming true of media in general, but I am also one of these artists-as-prompters. I received my OpenAI key a few months ago and have since spent time poking the machine to see what happens. Curiously, when first faced with that blank search bar, surrounded by pat, twee suggestions for Instagram-ready square images, I drew a blank. What could I ask for that I could not already see? It was like being the discoverer of the genie’s bottle, faced with the impossible task of choosing three (or in my case, 50 GPU credits’ worth of) wishes. What did I want? I will confess that my first prompts were written by family—“Hey, Dad, give me a scenario.” But I eventually turned to my old graduate thesis, inserting selections from the story I wrote of an island in the Mississippi River, paragraph-by-paragraph, into the prompt.
The results took my breath away. There is a strange and alien hunger that arises when you see your inchoate mental images rendered before your eyes, surreal and warped and depthless and square but most importantly, passing that gut test. The details weren’t important and weren’t even right, but taken as a whole, the images brought to simulated almost-life the way I myself had seen and imagined the island. Whether these images actually achieved a representation of the mental picture itself or merely settled for a kind of Frankenstein’s monster, stitched together and synthesized from disparate parts, didn’t matter. If I tried to represent my mental picture of the island myself, with my own hands, would I be able to convey it any more effectively? I had actually tried, across collage and drawing and rendering and model and even computer simulation, to do just that, to capture the lightning of that island in a bottle. And with four grainy images of murky provenance, I was struck by that same lightning.
Perhaps the thing DALL·E 2 does for us most of all is listen to us, with its far-seeing knowledge and its synthetic wisdom, and when we ask it, Do you know what I mean?, it can nod and say, “Something like this?”—a private and inscrutable conversation that also happens to yield something publicly examinable. I don’t know exactly what you mean when you describe something to me. But maybe the machine has an idea, because so much of what you have gone through—your influences and literature and favorite art—it has, too. It may be that, if nothing else, what computer vision has done is given us a way to look into our own selves and reflect, in some churned and gnawed and reconstituted way, an explicit, pixel-by-pixel representation back to us and anyone else who cares to look.
Where this mechanical imagination is lacking, and what makes it incomplete as an imagination but perhaps sufficient as a mirror, is that it is incapable of the intuitive leaps that are so familiar to our own imagination. The body of human visual production—centuries of paintings, drawings, photographs, etc.—on which these models were trained was not formed in a merely combinatorial and recapitulative way. Leaps of intuition and strokes of genius produced them, of course, but also the pressures of scarcity and necessity, and the endlessly differentiated crafts and cultures that sprang up in those contexts. This humanmade corpus has accreted over time, building up a foundation that forms the strata both for our imagination and now also for an AI training set. That foundation is not made merely from the recombination of pre-existing blocks. It has been fashioned slowly and painfully. To the extent that a computer vision model might contribute to that foundation, it can only do so insofar as it unintentionally prompts reflection or consideration in a human mind toward new means or ends.
It is, perhaps, possible that a computer vision model will one day construct new means, methods, or concepts of art production, but progress toward that goal will not be measured by higher-resolution images, faster generation times, or ribbons at art fairs. If it happens, it may instead take the combined effort of many humans to pull the veil back from the computer’s eyes and teach it to create.