← Back

What Does It Take for an AI to Get a Pulitzer?

Language models were trained on text. Writing and story telling should be their home territory. Yet it remains one of the few domains that they haven't conquered yet. Coding with LLMs requires less manual intervention than writing with them.

Why Stories Are Important

Storytelling is humanity's oldest technology. Hunter-gatherer groups with skilled storytellers showed higher cooperation and had more offspring.

Yet the fact that we evolved to receive stories doesn't explain who produces great ones and how we produce them. Also stories vary and serve various goals. Adaptive storytelling—myths, parables, gossip—is a different enterprise from literary art. Kafka doesn't help you cooperate with your community. He makes you feel like a bug.

What Makes a Great Story

There is no universal measure for what a great story is. There are general ideas on the structure of great stories and the narrative arch, however every great story is great in its own way.

What is universal is that a great story changes how the reader sees. This target is a moving, culturally situated interaction between text and reader. If you are building a system aiming to produce great writing, you must model not just the text, the structure but the gap between what the reader believes and what the text can show them or where it can lead them.

Secondly, all great stories have a voice. Voice is not style. It is a decision architecture. You decide what to foreground, what to suppress, where to double click. Didion's voice is her judgment about what matters in a room. Dostoevsky's is his insistence that every character contains an unresolvable contradiction. Voice is the selection principle that makes everything else cohere.

Even a mere performance of voice that sustains itself across eighty thousand words, through hundreds of scenes and thousands of micro-decisions, is not solved.

Thirdly, great stories have temporal architecture: manipulating experience over hundreds of pages, building expectations, delaying revelations. Rules and principles for great temporal architecture are not written down.

In essence, to produce great stories we need the following components: an engine to model real-time relational interactions between the story and the reader, a coherent path to create a voice, not by mimicking what already exists, but by applying it as a selection principle, and finally an ability to capture the art of temporal engineering humans have been good at for thousands of years.

Why LLMs Fail

When you read, your brain builds a world: spatial, emotional, populated with agents who have beliefs and desires. An LLM predicts tokens. Yet, because language is entangled with the causal structure that produced it, surprising amounts of deeper structure get captured implicitly. But there is a real gap. A language model knows "she slammed the door and he flinched" is plausible. It doesn't represent why he flinched—his history, the relationship dynamic, what his body learned to associate with that sound.

Current research and breakthroughs in world models are promising solutions here. A world model would certainly give machines a broader understanding of the agents they write about and their dynamics. Yet a rich world model on its own wouldn't be sufficient.

It will likely give you The Sims, not Karamazov Brothers.

Another reason why LLMs struggle to write well is the fact that writing inverts the relationship between specification and execution. When you prompt an LLM to code, you describe the destination and the model finds the path. In writing, the path is the destination. You cannot separate what a piece says from how it says it.

LLMs have read orders of magnitude more than any human writer who ever lived. If reading could substitute for living, LLMs should already be great writers. They aren't.

Here is a paradox the field must confront: LLMs have read orders of magnitude more than any human writer who ever lived. Borges, Proust, Dickinson—writers with limited lived experience but vast reading—produced masterpieces. LLMs are not even close. This indicates that how you process what you have read substitutes the mere sample size and the quantity.

We process reading with intentionality. Language models do not. Human writers write about something they usually care about. They have subject matter born from friction with the world. A selection principle tells them how to write. Intentionality tells them why to write—and what about. The selection pressures that shape what a human writes and what an AI generates are currently very different in kind, and that difference has consequences we don't fully understand yet.

Recreating the urge to write would require LLMs that have skin in the game, vulnerability, perhaps even mortality.

What Will Close the Gap

What we are about to claim sounds like blasphemy when you first read it. However we believe and have seen this play out in practice, that evaluation, the capacity for self-critical rereading may be able to reproduce the effects of intentionality in writing.

Literary fiction has always been about overwhelmingly rewriting: drafting, reading as a reader, diagnosing what fails, revising with intent. The form here captures the essence and if you replicate the form you may not have to replicate the essence. Existing self-critique architectures (chain-of-thought, constitutional AI, multi-agent debate) optimize for helpfulness and accuracy, not literary quality. So the architecture for revision exists. However there is no signal—what great looks like—to drive this revision.

This makes the evaluation problem foundational. Building a reliable metric for literary quality is not a milestone on the way to the research: it is the research.

While evaluation is at the core of this, several other capabilities are missing to allow models to write well.

Formal innovation is one of them. This is the ability to invent new forms, not just executing known ones.

Willingness to fail is the other missing part. Current training optimizes for median approval, but greatness lives in the tails. To solve this, there is a lot to be gained by activating more of the trained model itself before we change how we train them.

This means developing precise methods to detect when the model's default computation is underserving a query and steer it toward the most relevant knowledge it already encodes. Smarter activation of what models already know is a research direction we are pursuing seriously.

As mentioned above, the inability to resist the reader makes for weak and forgettable writing. Great literature demands work, tolerates confusion, refuses comfort, all antithetical to current objectives.

Models need to be able to distinguish between paths worth going down on and others that are not. Every quality attributed to great writing, such as voice, innovation, difficulty, exists in far greater quantities in terrible writing. The selection problem aka knowing which risks are worth taking, is harder than the generation problem and upstream of everything else.

Current alignment training already addresses this for a different reason. It instills commitments—to safety, agreeableness, inoffensiveness—which on the other hand actively flatten literary voices and also create a ceiling for the model intelligence.

Why This Matters

A system that produces genuinely great fiction would have demonstrated the ability to model agents with hidden internal states, contradictory motivations, and self-deception—and to communicate how those agents interact under pressure.

This matters far beyond literature. Such an intelligence will fundamentally challenge and change healthcare, education, diplomacy, public policy, markets and beyond.

The reason most education is bad is that it treats knowledge as information transfer when it is actually narrative enrollment. LLMs with genuine narrative capability are also LLMs that can model a specific learner's current understanding: not just what they know but what they believe, what they find interesting, what their identity is wrapped up in, and then construct a narrative experience that leads them to genuine insight.

Narrative shapes civilization in ways that are hard to overstate.

We believe civilization is a coordination technology. Millions of people who will never meet each other acting in loosely coherent ways. And the thing that enables coordination at that scale has always been a shared narrative. Nations are stories. Religions are stories. Money is a story. The corporation is a story. The social contract is a story. Every institution that structures human life is, at bottom, a narrative that enough people have agreed to live in.

The question is how do you actually build narrative intelligence sophisticated enough to navigate competing stories and partial truths and the difference between coherence that liberates and coherence that coerces.

Sounds fancy and vague. But stay here for a second. Coherence is the bag of values, beliefs and systems that will free the model from the need to please all at the expense of a genuine, novel insight. Coercive coherence on the other hand can look like it is liberating as it prevents the models from choosing, while it traps them behind guarded stagnation.

Epistemic Labs

Intelligence incapable of telling great stories is fundamentally incomplete. We're assembling a team that integrates literary editorial judgment, cognitive science, and AI at the foundational level: computational psychologists, playwrights, directors, writers and editors, philosophers and engineers.

There is significant work on computational narrative that has been done in the last decades, and we are building on what they've learned.

Our choice of tooling for experimentation is building end to end products. Over the last year, we started working on these products in order to test our research intuition. We currently have four core products ready to be used and benchmarked: The Thinker, The Writer, The Reader and the Solver. While these products are tools for experimentation, research and learning, we have designed them to be perfectly independent and high utility standalone applications.

Thinker is a reasoning tool for complex decisions. It can be used in negotiations, arbitrage, legal opinions, investing and beyond. It shows advantages over current leading models in specific contexts we are working to define and benchmark rigorously.

Writer is a writing system that allows iterative co-production with LLMs. It is focused on revision and the self-critical rereading process we believe is central to literary quality. It is also made to embed the adversarial, collaborative dynamic between writer and editor that has always produced the best literary work.

Reader is a reading system that allows collaborative reading with the models, exploring various directions and points of view, creating literary taste feedback data.

Solver is a multi-agent system that acts as a combinatorial problem solver for exploring novel algorithmic approaches: relevant because the selection problem in narrative (which risks are productive, which formal choices serve meaning) has a combinatorial structure that benefits from the same search and evaluation strategies.

These are early-stage tools. We will be shipping more and killing off some while evolving the others.

GTM

Our go-to-market plan looks like heresy. A heresy we have high conviction in. We are tapping into one of the most asymmetric opportunities on the market today: the deliberate choice to build not for the millions, but for the few.

We are not building products for a mass market. We are building for a small centile among the founders, allocators, executives, writers and researchers who sit at the highest-leverage nodes of the global economy.

The top 1% of economic actors do not merely earn more: they deploy more and have an outsized impact. With AI, the leverage is growing exponentially. This means TAM is properly measured in value captured per user, not users served, and on that metric the opportunity is enormous. When every incremental point of cognitive leverage you deliver sits atop decisions governing billions in capital allocation, the willingness to pay is commensurate with the stakes.

But the TAM argument is not the main unlock here: the product argument is.

Every major lab is locked in a consumer-scale race. When you serve hundreds of millions of users, your design surface collapses toward the median: you optimize for onboarding over depth, for safety margins that accommodate the least sophisticated user, for latency profiles that satisfy casual queries rather than sustained expert reasoning, and for interface simplicity that forecloses the kind of dense, configurable, opinionated workflows that high-leverage actors actually need.

For us, the expert is the only user. We do not aim to appeal to many. Every design decision, every capability investment, every iteration cycle is aimed at maximizing the depth and power of the tool for someone operating at the frontier of their domain. This means we can make decisions that no one else can afford. Examples are interfaces that assume domain expertise instead of explaining it away, output formats designed for integration into high-stakes workflows rather than casual consumption, and model behavior tuned for the sharp, uncomfortable, high-fidelity answer rather than the safe, palatable, median-satisfying one.

We believe that compounding returns flow from the freedom to push intelligence to its absolute useful limit without designing for the lowest common denominator.

The bet we are making is specific and falsifiable too: that intelligence augmentation exhibits convex returns when applied to high-leverage decisions and that the widening concentration of economic power means the population sitting at these high-leverage nodes is growing in influence even as it stays small in number. If that is true, and if we can deliver measurable, demonstrable cognitive leverage (not vague "intelligence enhancement" but concrete workflow outcomes our users can attribute to the tool), then we are building a business with elite unit economics, a product with no need to compromise on capability, and a research agenda disciplined by the hardest, most demanding users in the world. That combination, a business model that funds frontier research, a user base that stress-tests it ruthlessly, and a product philosophy liberated from median optimization is the real opportunity.

Our Main Insight and Guiding Principle

Our fundamental insight is simple. True intelligence needs pressure to choose. Appeasing to everyone, exploring every path, endlessly producing arguments without constraint can seem liberating, yet it is coercive in practice and limiting in output quality.

Models need to be able to choose, to have stakes and to form opinions. Spineless models can never tell great stories. Nor can the ones that have nothing to lose.

If we go back to the original story of humanity and how intelligence was born, this pattern emerges again. When Eve and Adam ate the apple from the Tree they made a choice, a decision.

They immediately gained intelligence and realized they were naked, as soon as this happened. We make choices every day pushed by what we have to lose and who we are and what our lived experiences and traumas have been. These choices make us intelligent, creative, human, able to invent.

To push frontier intelligence, we need to reproduce this function in models, at scale.

At Epistemic Labs we will be simulating trauma, biases, backgrounds and cultures, coupled with demographic sampling to recreate stakes, skin in the game and situational awareness so that we can push the models into opinionated choices.

If you are excited about this, do reach out.

← Back to Home

What Does It Take for an AI to Get a Pulitzer?

Intelligence incapable of telling great stories is fundamentally incomplete.

Products for the top 1% that move humanity forward

Read the full
manifesto

What Does It Take for an AI to Get a Pulitzer?

Why Stories Are Important

What Makes a Great Story

Why LLMs Fail

What Will Close the Gap

Why This Matters

Epistemic Labs

GTM

Our Main Insight and Guiding Principle

What Does It Take for an AI to Get a Pulitzer?

Intelligence incapable of telling great stories is fundamentally incomplete.

Products for the top 1% that move humanity forward

Read the fullmanifesto

What Does It Take for an AI to Get a Pulitzer?

Why Stories Are Important

What Makes a Great Story

Why LLMs Fail

What Will Close the Gap

Why This Matters

Epistemic Labs

GTM

Our Main Insight and Guiding Principle

Get in touch

Thank you

Read the full
manifesto