Cells compute - Let’s program them.

Announcing the arXiv.org API

I cannot resist mentioning something that I think will really add another huge boost to the open science movement. arXiv.org has now opened up its massive troves of open access articles via an Application Programming Interface, or API.

You might think that you have never heard of arXiv.org before, nor do you have any connection to it whatsoever. But more than likely you do have a connection to it, especially if you have ever visited a journal’s website, read a table of contents online, looked at an online abstract, or downloaded a PDF of a paper.

What was to eventually become arXiv.org was started by Paul Ginsparg in 1991 as an electronic means of continuing a practice already common among high-energy physicists: sharing pre-prints before they came out in the journals. At that time, high-energy physicists already recognized that the long delay’s associated with print publication actually hurt the pace of research, so they went ahead and sent pre-prints of all their papers to their friends. Realizing that this could 1) be done much better using the then increasingly popular email, and 2) that the current buddy list prevented those not in-the-know from ever being in-the-know, Ginsparg decided to digitalize this culture in the form of a centralized repository of digital pre-prints that anyone could access. The word ‘e-print’ was born.

Ginsparg’s original creation was based on SMTP, the protocol that enables email. (Remember the ‘web’ did not yet exist in 1991.) Eventually HTTP took the world by storm and in 1994 Ginsparg and co made the repository accessible via an HTML interface that eventually evolved into the present day arXiv.org.

From the very start, the arXiv project has provided full and open access to all of its e-prints. While it was inspired by and initially used by theoretical high-energy physicists, it quickly spread to all sub-disciplines of physics, and now serves the communities of physics, math, and computer science and most recently quantitative biology and statistics. For the most part, the arXiv allows anyone in these disciplines to post their work with only a smidgeon of peer review. Despite not having a rigorous peer review process, the arXiv has articles of astonishingly high quality, mainly because the common practice is for everyone to post their articles to the arXiv as well as submit to a regular journal.

Needless to say, it was the first of its kind, and it has done much to promote both the digitalization of scientific journals, and open access.

The arXiv has been going strong all this time, but the interface has centered around HTML web pages meant to be accessed by humans. That means that while it is easy to type in http://www.arxiv.org and click around to search and retrieve the articles that you are interested in, it is not easy to write a program to do this. (Unless you want to screen scrape, and no one wants to do that.) Well this has just changed with the release of the arXiv API.

But why would you want to write programs to do this stuff. The first thing that comes to mind is to relieve your self from the tedium of all those clicks, especially if you want to do things in bulk. But once you realize this, you immediately begin to think of all sorts of other uses, and herein lies the beauty of the API - once you have a way to program against all that information, you can do anything you want. More importantly you can convince your local nerd herd to program what you want. It’s probably a good time to make friends again.

Despite the API being only a few days old, there have already been some people that have stepped up to develop clients, including OpenWetWare’s Bill Flanagan. Pretty soon you will be able to use the extremely convenient biblio plugin on OpenWetWare to create bibliographies using arXiv articles. Given that there is a growing number of arXiv posts about quantitative biology, I suspect that there will be a growing number of OWW users that use this feature.

But I should mention, the arXiv is not the first scientific literature source to open up their information via an API. To my knowledge, this milestone was achieved by the National Center for Biotechnology Information with their entrez e-utils system. This system allows programmatic access to all of PubMed, PubMed central, and the data wharehouses at NCBI such as Genbank. In fact, the current biblio pluggin uses this API.

But the arXiv API puts the physics, math and computer sciences community in the mix, so that someone can really make a mashup with all of that open access content. I tried to do this a while ago before the arXiv API, and let me tell you that I sorely missed it. The arXiv API is a much needed addition to the open science infrastructure. As arXiv has done in the past, I hope this inspires a wave of API building by journal publishers and others with valuable data so that we can have the tools necessary to creatively combine all these knowledge sources to improve the way science is done.

For more details on the arXiv API, you can go to the arXiv API homepage, where you will find information on how to participate in the lively developer community, as well as info on how to get started using the API for your purposes.

Links

Comments

Comment from austen
Time: October 25, 2007, 9:41 pm

Arxiv or any other preprint service is bound to fail for biology and chemistry. Why? Because journals like Science, American Chemical Society Journals, and New England Journal of Medicine prevent the use of preprints.

The solution?

Take advantage of the self archive loophole and inform everyone and link to your self archived papers here: http://materialtransfer.org/self-archive if you do materials based science.

I don’t know the rules quant bio journals have, but I know most of them aren’t very good.

Usually the good quantitative bio stuff has a connection to reality i.e. it involves lab work. So all the good quantiative bio stuff could probably link to materialtransfer.org, since materials are involved in the research.

Arxiv is great for the communities like physics and math where preprints are accepted by journals but not for communities like bio and chem that aren’t.

Hopefully at the end of all this everyone will be able to put their stuff on Arxiv…but until the day the closed access journals fall materialtransfer.org is the best solution currently available.

Comment from Julius B. Lucks
Time: October 25, 2007, 10:02 pm

This is a very interesting comment, and what you are highlighting is that the real barrier to full open access is cultural. The success of arXiv.org is in large part due to the fact that it put a digital face to an already strong pre-print culture - Ginsparg did not have to convince people that pre-prints were the way to go. As we can see, the cultural change is the hard part. That is also an interesting point about self-archiving of your papers. However, I don’t completely agree that preprint servers are impossible for the biological crowd. PNAS in particular allows submissions that have been posted on arXiv for example, and Nature itself just launched Nature Proceedings, which incidentally was heavily influenced by the arXiv. And according to the Nature Proceedings website (http://precedings.nature.com/about): “Nature Precedings hosts manuscripts that may be submitted to any journal of any publisher. Nature and all Nature journals have a policy that permits such posts on recognized pre- or e-print servers such as Nature Precedings and arXiv without affecting their eligibility for publication, whether or not such postings result in discussion on other sites and in the media. We cannot take responsibility for the possibility of scooping by competitors. Authors submitting to other journals are advised to check their policies about prior postings before sending manuscripts to Nature Precedings.” Nature is such a market force that I can see other publishers allowing submission of articles that have been posted on pre-print servers in the near future. Interestingly enough there seem to be chemistry papers up on Nature Proceedings (http://precedings.nature.com/subjects/chemistry). But in the end it is all about the community. As you can see from a recent blog post, it looked like Springer tried a chemistry pre-print server a while ago and failed (http://blogs.openaccesscentral.com/blogs/ccblog/entry/anewpreprintserverfrom).

Comment from austen
Time: October 26, 2007, 4:46 pm

I’m aware of Nature Proceedings. I’m also aware of the failures like Elsevier’s Preprint service ehttp://www.sciencedirect.com/preprintarchive which failed because the ACS blocked it. What you have to understand about the ACS people is that they are all going to hell. They are bastards of the worse kind. They are the guys that hired loads of lobbyist to shut down pubchem: http://pubchem.ncbi.nlm.nih.gov/ and tried to sue Google for Google Scholar indexing their papers (btw are you fucking kidding me?). Both attempts they luckily failed at.

I am skeptical about the Nature thing both because of Elsevier’s failure that Nature survives by blocking access to government paid for data. I really can’t get enthusiastic by a company that’s blocking my access to scientific knowledge.

Comment from austen
Time: October 26, 2007, 7:33 pm

Also I don’t think its a “cultural thing”. I think its just that chem/bio people are sluttier than physics/math counterparts. In chem/bio the science can be less important than the image. But in math/phys there isn’t really a difference between the two.

Comment from Julius
Time: October 30, 2007, 2:08 pm

See Peter Suber’s post on the arXiv API: http://www.earlham.edu/~peters/fos/2007/10/arxiv-opens-its-api.html

Pingback from Announcing the arXiv API « arXiv API Developments
Time: October 30, 2007, 4:17 pm

[…] has been some discussion in the open access community (see Open Access News and Programmable Cells) about the API already, and we hope to cultivate a lively developer […]

Comment from Whatever-ishere
Time: November 21, 2007, 2:14 pm

thanks for the GREAT post! Very useful…

Comment from Alexander Wait Zaranek
Time: February 12, 2008, 5:13 pm

At least in Physics, scientists routinely publish on the arXiv then submit to Nature, Science etc. I’m not sure that much more than a change of article name is required to pay lip-service to their “no-pre-prints” policy. Maybe this isn’t really such a big obstacle in Biology after all? I could try to look into it further, if anyone is interested. –AWZ

Write a comment