Cells compute - Let’s program them.

Open Science

Announcing the arXiv.org API

I cannot resist mentioning something that I think will really add another huge boost to the open science movement. arXiv.org has now opened up its massive troves of open access articles via an Application Programming Interface, or API.

You might think that you have never heard of arXiv.org before, nor do you have any connection to it whatsoever. But more than likely you do have a connection to it, especially if you have ever visited a journal’s website, read a table of contents online, looked at an online abstract, or downloaded a PDF of a paper.

What was to eventually become arXiv.org was started by Paul Ginsparg in 1991 as an electronic means of continuing a practice already common among high-energy physicists: sharing pre-prints before they came out in the journals. At that time, high-energy physicists already recognized that the long delay’s associated with print publication actually hurt the pace of research, so they went ahead and sent pre-prints of all their papers to their friends. Realizing that this could 1) be done much better using the then increasingly popular email, and 2) that the current buddy list prevented those not in-the-know from ever being in-the-know, Ginsparg decided to digitalize this culture in the form of a centralized repository of digital pre-prints that anyone could access. The word ‘e-print’ was born.

Ginsparg’s original creation was based on SMTP, the protocol that enables email. (Remember the ‘web’ did not yet exist in 1991.) Eventually HTTP took the world by storm and in 1994 Ginsparg and co made the repository accessible via an HTML interface that eventually evolved into the present day arXiv.org.

From the very start, the arXiv project has provided full and open access to all of its e-prints. While it was inspired by and initially used by theoretical high-energy physicists, it quickly spread to all sub-disciplines of physics, and now serves the communities of physics, math, and computer science and most recently quantitative biology and statistics. For the most part, the arXiv allows anyone in these disciplines to post their work with only a smidgeon of peer review. Despite not having a rigorous peer review process, the arXiv has articles of astonishingly high quality, mainly because the common practice is for everyone to post their articles to the arXiv as well as submit to a regular journal.

Needless to say, it was the first of its kind, and it has done much to promote both the digitalization of scientific journals, and open access.

The arXiv has been going strong all this time, but the interface has centered around HTML web pages meant to be accessed by humans. That means that while it is easy to type in http://www.arxiv.org and click around to search and retrieve the articles that you are interested in, it is not easy to write a program to do this. (Unless you want to screen scrape, and no one wants to do that.) Well this has just changed with the release of the arXiv API.

But why would you want to write programs to do this stuff. The first thing that comes to mind is to relieve your self from the tedium of all those clicks, especially if you want to do things in bulk. But once you realize this, you immediately begin to think of all sorts of other uses, and herein lies the beauty of the API - once you have a way to program against all that information, you can do anything you want. More importantly you can convince your local nerd herd to program what you want. It’s probably a good time to make friends again.

Despite the API being only a few days old, there have already been some people that have stepped up to develop clients, including OpenWetWare’s Bill Flanagan. Pretty soon you will be able to use the extremely convenient biblio plugin on OpenWetWare to create bibliographies using arXiv articles. Given that there is a growing number of arXiv posts about quantitative biology, I suspect that there will be a growing number of OWW users that use this feature.

But I should mention, the arXiv is not the first scientific literature source to open up their information via an API. To my knowledge, this milestone was achieved by the National Center for Biotechnology Information with their entrez e-utils system. This system allows programmatic access to all of PubMed, PubMed central, and the data wharehouses at NCBI such as Genbank. In fact, the current biblio pluggin uses this API.

But the arXiv API puts the physics, math and computer sciences community in the mix, so that someone can really make a mashup with all of that open access content. I tried to do this a while ago before the arXiv API, and let me tell you that I sorely missed it. The arXiv API is a much needed addition to the open science infrastructure. As arXiv has done in the past, I hope this inspires a wave of API building by journal publishers and others with valuable data so that we can have the tools necessary to creatively combine all these knowledge sources to improve the way science is done.

For more details on the arXiv API, you can go to the arXiv API homepage, where you will find information on how to participate in the lively developer community, as well as info on how to get started using the API for your purposes.

Links