Crowdsourcing the Medieval Text: New Avenues for Examining Leaves and Fragments

I recently presented a paper at the 39th Saint Louis Conference on Manuscript Studies on crowdsourcing the description of medieval manuscript fragments. The paper draws upon my project at the Harry Ransom Center to survey medieval manuscript fragments used as binding waste in early modern books. A transcript of the presentation is provided below:

“I would like to begin my talk with an anecdote which I hope will provide some context to my presentation. By now I’m sure most have you have been exposed to the term “crowdsourcing.” It is in danger of being overhyped, but is nontheless an important movement which has been gaining speed in the digital humanities since it first achieved mass recognition around 2010.

Some of the more notable examples include the Project Gutenberg, Papers of the War Department and the Zooniverse/Citizen Science Alliance projects.  I was personally exposed to crowdsourcing two years ago in my Information Studies program at the University of Texas while taking a course on digital curation.

What truly inspired me was not one of the larger institutional projects just mentioned but a remarkable exchange on Flickr– Yahoo’s image hosting site– among WWI enthusiasts. In 2005 a Flickr member, Jens-Olaf Walter, posted a photograph from WWI of German soldiers scrambling across railroad tracks somewhere in Finland. He accompanied the title with the simple phrase: “official army photo, German-Finnish Sign “Haltpunkt“?” It was not long before another user with the rather odd screen name “timonoko” posted a comment identifying the sign in the photograph and noting that it was near Helsinki but that he did not recognize the scenery.

Almost two and a half years later, another member took the challenge up by using Google Maps and a variety of clues in the photo to suggest a possible location. In response to this both Jens-Olaf and the other contributor posted photos of old Finish maps, one of which indicated a change in the railway line through that particular town. Amazingly “timonoko” posted a video clip from a Finish TV series “Memories of 1918” showing the exact same scene of the soldiers in the photograph crossing the railroad—except that it had been caught on film! (Sadly, I think the clip has now been removed).

The stream of comments and image posts do not end there. Multiple researchers began to chime in offering related photos about the train tracks, the nearby train station, and the German military unit in the photograph. The amount of collaborative research recorded for this single photo verges on the absurd. All this to say, examining the photograph and reading the comments was an eye opening experience for me. It points to the power of sites like Flickr to facilitate the collaborative process of describing and identifying historical artifacts.

But what is perhaps most instructive about this exchange is that the image also resides in the Great War Archive, a digital repository of WWI images. And there it sits, where I first found it, cut off from its stream of comments.

So how do we build upon the successes of repositories of digitized medieval manuscripts? How do we build a platform where anyone with a genuine interest in medieval manuscript fragments and a basic grasp of medieval paleography and codicology can offer input?

When archives and special collections began digitizing in the late nineties we produced what is referred to as shallow digitization. The basic institutional model has always been to make high quality scans, then put them in a repository online and hope that researchers take a look at them. One of the problems with this model is that the metadata is fairly limited. Yes, we’ve developed rigorous standards, but the descriptions still tend to be minimal and institutionally-focused. Most notably, there has been no time to provide transcriptions, or extensive keywords, and more often than not,

the images reside in software platforms that are inaccessible to Google. This is where the dramatic growth in the use of online social media, image hosting sites, and blogs offers a whole new realm of possibilities for extracting and sharing information about historical artifacts.

Historically, medieval manuscript leaves, fragments, and binding waste have received considerably less attention from academics and librarians than bound codices. And yet, here in the United States especially, institutions that hold rare materials are more likely to have medieval leaves and binding waste in their collections than whole volumes. Despite the abundance of such artifacts, few have been extensively researched, surveyed, and described (the Ege leaves being one notable exception). This is at least one reason why leaves, fragments, and binding waste hold the most potential to benefit from a more collaborative digitization model.

While formal archive and special collection websites remain essential for highly professional and stable long-term projects, broader popular social media and photo-sharing sites such as Flickr and Facebook offer the potential to provide an easy, inexpensive, and more widely accessible platform for crowdsourcing the description of medieval manuscript fragments and binding waste.

Because I am the lone archivist with a masters in medieval studies at the Harry Ransom Center, for the past year or so I have had the sole responsibility of surveying and describing the medieval and early modern leaf collection and manuscript binding waste in the Ransom’s book collection. The project has been challenging but thoroughly enjoyable. On occasion I have made totally unexpected discoveries, such as finding an impression of circa 16th-century leather-rimmed spectacles on the manuscript-waste endpapers of an early printed book.

Unfortunately cataloging these objects is NOT outlined in my primary responsibilities as an archivist. This inconvenient fact combined with the unique challenges of describing fragments forced me to recognize that I needed the assistance of others in the rare book and manuscript community if I was going to make any substantive progress on the project. Inspired by the example of WWI enthusiasts on Flickr and other collaborative transcription projects, I suggested we try something similar with the Ransom fragments. Having been given the green light by the proper authorities, early in June of this year, I began posting digital images on Flickr and inviting members of the rare book community to examine and share insights

Notifications about new images were posted on a related Facebook page.

The first week was promising. We made an initial announcement on ExLibris and Facebook and out of the 34 images posted had 659 views and 4 potential text identifications.

The real increase in traffic came when the Ransom Center ran a story about the project on its blog Cultural Compass late in July which brought us up to 2,422 views and around 16 total contributions. The next big jump in traffic came in mid-August after posting an announcement to the Early Book Society which brought us to nearly 7,000 all-time views and several new contributors. By the time I had finished posting images of all known fragments on October 5th we were well above 13,000 views. Early on I predicted that all fragments in the survey would be identified by the time I finished posting images of every item. Although this did quite not come true, the statistics are nevertheless encouraging.

As of October 10, 2012 the collection had been viewed over 14,000 times.

So what exactly do these numbers tell us? Well, for starters it tells us there is a healthy interest out there in images of medieval manuscript fragments from the Ransom Center. And that’s an encouraging thought. But this has to be tempered by the fact that over a four month period, the ratio of contributions to views was quite low–although it is highly probable that viewers will continue to make contributions in the months and years to come.

My colleague Ben Brumfield, a programmer who developed the successful transcription software FromthePage, already considers Ransom Center Fragments a crowdsourcing success given the highly specialized knowledge required to even approach these objects.  Whether or not the project is a success purely in terms of scholarship remains to be seen.

The survey comprises 78 books containing a little over 79 fragments.  The items span circa eight centuries, at least 8 geographic regions, and include a diverse representation of bookhands and documentary scripts, along with a variety of texts. A diverse selection of binding styles are also represented although the limp vellum structure and other forms of parchment binding comprise a majority.  At least 7 of the 79 fragments are too heavily abraded for their texts to be identified under normal light.

All images have been viewed at least once and 73 out of 225 images have comments. Twenty one of the texts on the fragments have been positively identified or at least attributed to print editions available online while another 13 fragments now include rough transcriptions or other relevant information. That means that within 4 months, contributors provided relevant information for 51 % of all identifiable fragments.

A closer look at the contributions confirms what others have learned from crowdsourced projects and that is what is called a power-law distribution in which most of the contributions are made by a hand-full of “well-informed enthusiasts.” In our case there were around 10 total contributors and 22 of the 72 contributions were made by one rare book enthusiast from San Antonio.  Similar outcomes seem to occur in both small and large projects.

I think it’s important to note that most contributions involved identifications of fragments via text-string searches in Google Books. A couple of things should be said about this. First, it’s amazing what you can do with Google Books. When I received training in manuscript studies just six years ago, the primary method of identifying texts was to use good old fashioned off the shelf reference sources. This usually required a very good memory and access to excellent bibliographies. Google Books now truncates quite a bit of this work. The only serious drawback of using Google Books, or relying on others who do so, is that just because a few lines of script on a fragment match a string of text in a book online doesn’t mean they are manifestations of the same work, or that the online book represents the critical edition of the work. Regardless it’s still a useful and an immensely powerful tool which allows for some fascinating discoveries.

I hope that by now my audience is asking the question— so what’s next?  Well, from my perspective there are at least three avenues for going forward:

The first is to use Flickr as the primary content manager and access point for these objects.

The second would be to Create or use a different content manager and software platform that participating institutions can use to upload images with an interface customized for fragments and binding waste and a more tightly controlled collaborative environment.

The third option would be to upload images to Digital Scriptorium or other individual institutional repositories.

Personally, I would like to see number 2 implemented while encouraging institutions to continue using Flickr as a broader and cheaper access point.

The folks at Integrating Digital Papyrology and their platform Papyrological Navigator at provide the closest approximation to what serious manuscript scholars might want. It’s an impressive collaboration between several institutions and senior scholars and represents probably the most granular and tightly controlled environment for collaborative work.

There are a few problems I see with this prototype. First, to my knowledge no institutions in the United States currently hold large databases of already digitized binding waste and fragments. Second, the papyri project is designed around small flat fragments of texts on papyrus. Binding waste is a far more complex 3-dimensional structure—often with multiple fragments in a single binding, sometimes pasted one on top of another. Third, the structures themselves are of interest to scholars and not just the text they contain.  And finally, Integrating Digital Papyrology is designed by academics, for academics.  Unless we offer images on public sites like Flickr in addition to something like this, we will not be contributing to lowering the wall between the broader public and cultural heritage institutions. Option 3 a perfectly good route to take, but like the Flickr option, is not a total solution.

I want to be absolutely clear, I am not advertising for Flickr. Admittedly it does have some incredibly useful features and management capabilities and in many ways it is superior to other collaborative platforms in terms of binding waste specifically. What I envision as a useful interface for paleographers, codicologists, and manuscript scholars is very close indeed to what Flickr has to offer. But it has some major drawbacks. For instance, there is no detailed zoom feature which is completely standard in other professional platforms. Large chunks of small dense script are virtually unworkable. Users can’t easily link comments to multiple images and it’s also an awkward platform for creating transcriptions—the comments section is a little too far below the image for visitors to transcribe while looking at the text. Finally, Flickr is owned by Yahoo, and there’s no company that is too big to fail. If we trust our images to Flickr we still need to back them up ourselves (Digital Scriptorium is a good option)—unless of course our purpose is only access and not preservation.

In a recent e-mail exchange, a prominent medievalist and manuscript scholar (who shall remain anonymous) explained what was absolutely essential to the feasibility and success of an operation like And I quote:

“The papyrus folks have broad support both from institutions with major holdings and from senior scholars.  There is also a strong ethos of quality control which manifests itself in rigorous vetting of comments and the ability NOT to post things that are not well substantiated.  So there is expertise plus process plus the commitment of a core group of people who are known to be seasoned papyrologists.”

I think the underlying message of this statement is that a broad-based manuscript fragment database with a platform for user contributions should NOT be open and accessible to just anybody off the street. I would like to challenge this assumption by asking a question. Is it essential to the long-term integrity of our discipline that we provide resources to which only the very best may contribute? Perhaps. However maybe there is room for both the amateur and the expert in a single forum. Or at the very least, in parallel forums.

The grand narrative of this age of information it is that the lines are blurring between those who have privileged access to knowledge and those who do not. As manuscript scholars it is our responsibility to affirm valid contributions when they are made, no matter where they come from, while at the same time diligently exposing error when and where it occurs. We can do this without creating boundaries to those not inside the academy.

The Ransom Center Flickr project no doubt has user contributed errors. But we never set out to provide a monolithic body of inerrant data. The Flickr site, and hopefully all collaborative projects, are are as much about the process as the final product. Open-ended crowdsourced projects should be able to exist comfortably alongside enterprises like Digital Scriptorium.

At the risk of being overly pedantic, I would like to conclude with a quote from Plutarch’s Life of Alexander. Alexander, being the man of ambition that he was, found himself rather put out by Aristotle’s decision to publish certain doctrines which were traditionally passed on via oral communication to the initiated. According to Plutarch, Alexander states:

“You have not done well to publish your books of oral doctrine; for what is there now that we excel others in, if those things which we have been particularly instructed in be laid open to all?”

It is my hope that as we move we can rise above the temptations of exclusivity which afflicted Alexander the Great.

And one final note: for all you catalogers out there: please include as much information as possible about manuscript binding waste in the notes field of MARC 21 records. This is currently the most efficient way to locate these items unless you want to make us slog through old auction catalogs or (heaven forbid) physically browse the closed stacks!”


  1. 1 Reis Fontanals November 9, 2012 at 9:11 am

    Very interesting! Totally agree. In Catalonia we have so many medieval documents, that often we don’t reference fragments and binding waste in the catalogs. From now, I will remember to do.

