Computational Source Detection

Shakespeare is the most studied author in the English language, and scholars have been identifying his sources for centuries. By the early 2000s, the field seemed picked clean — the major sources (Holinshed's Chronicles, Plutarch's Lives, Ovid's Metamorphoses) were well established, and finding new ones was vanishingly rare. Then a self-taught scholar with three monitors on his dining room table and a copy of plagiarism-detection software found what appears to be a significant new source for eleven plays.¹

The Discovery

Dennis McCarthy and June Schlueter used WCopyfind, an open-source plagiarism tool built for catching cheating students, to compare Shakespeare's plays against an unpublished manuscript they'd tracked down: "A Brief Discourse of Rebellion and Rebels," written in the late 1500s by George North, a minor courtier who served as Elizabeth I's ambassador to Sweden. The manuscript had been filed under an obscure shelf mark in the British Library since 1933, effectively invisible.

The software found patterns that would take a human lifetime to notice. In North's dedication, he urges the ugly to strive for inner beauty, using the words "proportion," "glass," "feature," "fair," "deformed," "world," "shadow," and "nature" — in that order. In the opening soliloquy of Richard III, the hunchbacked tyrant uses the same words in virtually the same order to reach the opposite conclusion: since he's outwardly ugly, he'll be the villain he appears. McCarthy then checked each pattern against Early English Books Online — 17 million pages from nearly every work published in English between 1473 and 1700 — to verify that the word clusters weren't common. Almost no other works contained the same combinations.

The North manuscript also explains details scholars had assumed Shakespeare invented. Jack Cade in Henry VI, Part 2 starves, eats grass, is dragged through the street by his heels, and left for crows — all details present in North's text about Cade and two other rebels. The list of dog breeds in King Lear and Macbeth (mastiff, cur, "trundle-tail") maps directly onto North's catalog of dogs in a passage arguing that hierarchy is natural. "Trundle-tail" appears in only one other work before 1623.

Method and Limits

McCarthy's approach differs from the usual computational authorship work. Most stylometric analysis uses function words — articles, prepositions, conjunctions — to build a "digital fingerprint" that can identify who wrote a text. You don't need to understand the meaning of the writing at all; just the frequency of "the" and "of" is enough to distinguish authors with surprising accuracy. McCarthy instead used comparatively rare content words to identify sources rather than authors. The distinction matters: Shakespeare wasn't plagiarizing North (the ideas are transformed, often inverted), but he was clearly reading him.

The most compelling evidence goes beyond individual word matches. North uses Merlin's prophecy to describe a world "turned upside down"; scholars had long puzzled over the Fool's recitation of a Merlin prophecy in King Lear that doesn't match any known version. McCarthy and Schlueter argue it came from North's version. The technique of running rare phrases against comprehensive databases of period texts provides a way to distinguish between "Shakespeare and North used a common source" and "Shakespeare read North" — if the particular combination appears nowhere else, the common-source explanation loses force.

David Bevington, editor of The Complete Works of Shakespeare, called the findings "a revelation" while cautioning against overexpansion: themes like the world turned upside down were widely available in contemporaneous works including Erasmus's In Praise of Folly. The statistical techniques haven't yet been subjected to rigorous review in the digital humanities.

What This Actually Demonstrates

I think the deeper lesson here isn't about Shakespeare at all — it's about the relationship between computational pattern-matching and human expertise. McCarthy didn't replace traditional source scholarship; he found a needle that traditional methods couldn't locate because the haystack (17 million pages of Early English Books Online) was too large for any human to search. The expertise was still necessary to interpret the patterns — to distinguish "Shakespeare inverted North's argument about beauty and deformity" from "both writers used common Renaissance vocabulary." The tool finds candidates; the scholar evaluates them.

This is the same division of labor that's emerging in decipherment, where computational tools accelerated Ugaritic decoding from years to hours but still required linguistic expertise to evaluate the mappings. And it connects to a broader question about what LLMs trained on massive text corpora might eventually do for literary scholarship — not replacing close reading, but surfacing statistical patterns across millions of documents that no individual scholar could hold in mind. The fact that a self-taught researcher with plagiarism software found what centuries of professional Shakespeare scholarship missed should be humbling for anyone who thinks expertise alone is sufficient when the data exceeds human capacity to survey it.

Plagiarism Software Unveils a New Source for 11 of Shakespeare's Plays by Michael Blanding — source ↩

Linked from

Linguistics Overview
Computational Source Detection shows the opposite — how computational tools can recover hidden connections across millions of pages of text, finding Shakespeare's sources through rare-word clustering.

Open in stacked reader →

Computational Source Detection

The Discovery

Method and Limits

What This Actually Demonstrates

Footnotes

Linked from