The Archive and the Memory

There is a distinction, rarely observed carefully, between an archive and a memory. An archive is a storage system. A memory is a living process — selective, lossy, transformative, and embedded in a community that interprets it over time. Libraries are archives. Languages are memories. The confusion between them has a long history, and it is usually benign until a new preservation technology arrives and the distinction becomes urgent.

The printing press did not preserve culture. It changed what culture could be transmitted, at what scale, and by whom. The result was not the stable diffusion of existing knowledge but a centuries-long upheaval in what counted as authoritative, what could be challenged, and who had standing to speak. We are not inside a comparable moment. But we are inside a moment that requires the same precision about what is actually happening.

WHAT AI TRAINING DOES TO A CORPUS

When a large language model trains on the textual record of human culture — fiction, journalism, correspondence, philosophy, law, science — it does not preserve that record. It produces a statistical compression of its patterns. The model learns which words follow which words in which contexts, with sufficient complexity that the result resembles, in many cases, genuine understanding.

What it does not retain is provenance. A well-trained model can reproduce the rhythm of a Montaigne essay and the argument structure of a legal brief without knowing — in any sense that would satisfy an archivist — that these are different kinds of knowledge produced in different contexts for different purposes. The compression discards the metadata that tells a reader how to read.

Culture does not pass between generations as undifferentiated content. It passes as argument — one work responding to another, one generation reinterpreting what the previous considered settled. The argument is the transmission.

This matters for cultural transmission in a specific way. Culture does not pass between generations as undifferentiated content. It passes as argument, as contestation, as commentary — one work responding to another, one generation reinterpreting what the previous generation considered settled. The Romantic poets were not simply continuing the Augustan poets. They were arguing with them. Walter Ong’s work on orality and literacy makes clear that every major shift in preservation technology has been, at depth, a shift in the structure of argument itself.

A system that learns the statistical patterns of culture without the argumentative structure that generated those patterns has not learned culture. It has learned its surface.

THE LAWSUITS AND WHAT THEY REVEAL

The legal disputes between AI developers and content producers — The New York Times’s lawsuit against OpenAI and Microsoft, the Authors Guild class-action suit — are typically framed as intellectual property disputes. This framing is not wrong, but it is insufficient.

What these disputes also reveal is a conflict between two models of what a cultural corpus is. For publishers and authors, the corpus is a community of texts in argument with each other — and with their readers — over time. For AI developers, the corpus is training data: a large, undifferentiated input that produces useful model weights.

Neither view is purely cynical. Both views are incomplete. The question that neither the courts nor the regulators have yet properly asked is what happens to cultural transmission when its raw material — the written record — is processed through a system that systematically discards the relational, argumentative, and contextual structure that makes transmission possible.

UNESCO’s 2021 Recommendation on the Ethics of AI came closest when it noted that AI systems trained on cultural data can “amplify existing biases in cultural representation” — but this frames the problem as representational, not structural. The issue is not that some voices are underrepresented in the training corpus. The issue is that the corpus, once processed, loses the very feature that makes representation meaningful: the fact that the texts are arguing with each other.

WHAT ENDURES, AND WHAT DOESN’T

The Alexandrian library burned, and the texts it contained did not all survive. What survived of ancient culture survived in active use — taught in schools, translated and mistranslated, argued over by scholars who disagreed about what it meant. The transmission was lossy, partial, politically inflected. And it was alive.

An AI training corpus is not alive in this sense. It is comprehensive in a way no archive has ever been, and it is correspondingly static in a way that no living cultural tradition has ever been. The texts are all there. The argument between them is not.

This is not an argument against AI, or against large-scale text processing, or against any of the genuine capabilities these systems have developed. It is an argument for precision about what is being preserved and what is being processed — and for scepticism toward any claim that the two are the same thing.

Borges understood this. His Library of Babel contains every possible text. It contains nothing meaningful. The archive is not the memory. The compression is not the transmission. This distinction has always mattered. It matters more now.

SOURCES

1. The New York Times Company v. Microsoft Corporation. Case 1:23-cv-11195, SDNY, December 2023. Complaint PDF
2. Authors Guild et al. v. OpenAI Inc. Class action complaint, September 2023. authorsguild.org

Tags: WHAT ENDURES