Did I Just Solve A Centuries Old Mystery?

A week ago I started a new natural language processing (NLP) project in the area of machine translation. Essentially I’m building a scaled down version of Google Translate. Google Translate will translate between virtually any language, but my translator will only translate from Old Norse into English.

Why you ask? For many years I have been interested in the history of the viking age and have read many of the Icelandic Sagas. Some of these works of literature are available online, but not in English. So I thought it would be a fun project to see how well I could code an algorithm to do the translation for me.

This is still a work in progress, and while working on this I completed an interesting side project. The Icelandic Sagas were written anonymously. I guess that since these stories were based on the lives and adventures of real people the person or persons who wrote the stories down did not feel they had to claim authorship of the stories. Snorri Sturluson, a 13th century Icelandic poet and politician is believed to have written one of the sagas, Egil’s Saga, but as far as I know no one knows for sure who wrote this and many of the other sagas.

Since it is known that Snorri wrote Heimskringla (the sagas of the kings of Norway) we should be able to use some NLP techniques to determine which other sagas he wrote. NLP has been used to identify authorship for many works of literature and historical writings. For example a set of The Federalist Papers were written anonymously. Historians have long speculated that these anonymously written Federalist Papers were written by Alexander Hamilton. Using writings samples from Hamilton and Madison, the other author of the Federalist Papers, data scientist used NLP techniques to confirm that Hamilton did write the anonymous Federalist Papers. Essentially the papers mathematically match with the known examples of Hamilton’s writing. Thus Hamilton must have written them.

This work provides a proof of principle study that the saga authors can be identified using NLP. However, there is one problem. There in only one known saga author, Snorri Sturluson, who wrote Heimskringla. So when the sagas are matched on authorship those works that match with Snorri we can say with high certainty were written by Snorri. The other sagas that clusters together will have different authors, but we will not know who those authors are.

To cluster the sagas I first created a term vector, which was composed of every word and number (term) as they appeared in versions of the saga text that I acquired online (Icelandic Sagas, Heimskringla). Then in a process know as TFIDF each term in each saga was assigned a number based on how often it appeared in a saga relative to all the other sagas. These numbers were added to a matrix whose rows represent the saga the term came from and whose column position represent the term itself. Now that we’ve turned pages of literature into a matrix of numbers we can do math on it! The mathematical techniques employed were NMF and KMeans clustering. These techniques allow for the clustering of the saga text by the author.

To my surprise my analysis determined that virtually all of the sagas clustered with Snorri’s Heimskringla, which suggest that he authored not only Egil’s Saga but almost all of the sagas. Thus the centuries old mystery as to who wrote the sagas may have been solved! Since I’m not an expert in the Old Norse language I’m interested in hearing from those who are; what do you think, and how would you go about setting up a language model to cluster the saga text?

Written on June 5, 2017