A New Declaration of Rights: Open Content Mining

In a recent investment report, analyst Claudio Aspesi concluded that a new front had opened up in the Open Access (OA) debate. Writing in April, Aspesi noted that academics are “increasingly protesting the limitations to the usage of the information and data contained in the articles published through subscription models, and — in particular — to the practice of text mining articles.” Aspesi is right, and a central figure in this battleground is University of Cambridge chemist Peter Murray-Rust. A long-time advocate for open data, Murray-Rust is now spearheading an initiative to draft a “Content Mining Declaration”. What is the background to this?
Peter Murray-Rust
When I interviewed Peter Murray-Rust in 2008, he expressed considerable frustration at the difficulties he was experiencing in trying to extract and reuse the data published in scholarly journals — even where his university had paid an electronic licence to access the content. 

What Murray-Rust wanted to do, he explained, was to capture the “embedded data” contained in the tables, charts, and images published in science papers, along with the “supplemental information” that often accompanies papers. To do this, he had developed a variety of software tools to mine large quantities of digital text. Having extracted the data he then wanted to aggregate them, compare them, input them into programs, use them to create predictive models, and reuse them in a variety of other ways.

However, he was having huge problems achieving this, not because of any technical issue, but because of uncertainty over copyright and publishers’ insistence that a licence to read journals does not encompass the right to mine them with software.

To add to Murray-Rust’s frustration, many of his colleagues were either unsympathetic or uncomprehending. Even more galling, the Open Access movement — which should have been a natural ally — was more interested in making papers freely available to eyeballs, than to software. Even papers published in OA journals, he noted, are often released under licences that do not come with reuse rights.

In pursuit of his dream, Murray-Rust became a formative voice in the creation of the open data movement. Open data, Murray-Rust explained to me in 2008, is data “free of any restraint on access and on reuse.”  Recently, however, governments have tended to lead the way in urging for open data, spawning a generation of data wranglers; open scientific information has often lagged behind, but is now beginning to be seen as a central issue.

Four years later Murray-Rust is still frustrated. He is not, however, a man to give up, and he continues his advocacy today under the rubric of “open content mining”. Essentially, this is text mining plus. As Murray-Rust explains today, he views the mining of scholarly journals as a hierarchical activity, with content mining encompassing not just the mining of text and data, but other types of content too, including images, tables, graphs, audio, and video.

Simply using the term “text mining”, he adds, “might imply that anything other than text should be protected by the ‘content provider’. However, I and others can extract factual information from a wide range of material.”

The good news is that the research community is finally beginning to understand what Murray-Rust has been “banging on about” for all these years, as are research funders and governments, and Murray-Rust believes the door to what he wants is finally beginning to open.

However, he says, it is imperative that text mining advocates push hard at that open door if they want to achieve their objectives. To this end, Murray-Rust recently convened an ad hoc group of interested parties to draft what he calls a “Content Mining Declaration” (disclosure: I am a member of the group).


