An openwetware blog on the challenges of open and connected science

Uncategorized

Connecting the dots - the well posed question and code as a liability

Just a brief thought prompted by two, partly related, things streaming past my nose. Firstly Michael Nielsen discussed the views of Aristotle and Sunstein on collective intelligence. The thing that caught my attention was the idea that deliberation can make can make group functioning worse, leading to a collective decision that is muddled rather than actually identifying the best answer presented by members of the community. The exception to this is well posed questions, where deliberation can help. In science we are familiar with the idea that getting the question right (correct design of experiment, well organised theory) can be more important than the answer.

The second item was a blog post entitled “Data is good, code is a liability” from Greg Linden that was shared by Deepak Singh. Greg discussed a talk given by Peter Norvig which focusses on the idea that it is better to get a good sized dataset and use very sparing code to get at an answer rather than attempt to get at the answer de novo via complex code. Quoting from the post:

In one of several examples, Peter put up a slide showing an excerpt for a rule-based spelling corrector. The snippet of code, that was just part of a much larger program, contained a nearly impossible to understand let alone verify set of case and if statements that represented rules for spelling correction in English. He then put up a slide containing a few line Python program for a statistical spelling correction program that, given a large data file of documents, learns the likelihood of seeing words and corrects misspellings to their most likely alternative. This version, he said, not only has the benefit of being simple, but also easily can be used in different languages.

What struck me was the connection between being able to write a short, readable snippet of code, and the “well posed question”. The dataset provides the collective intelligence. So is it possible to propose the following?

“A well posed question is one which, given an appropriate dataset, can be answered by easily prepared and comprehensible code”

This could also possibly be turned on its head as “a good programming environment is one in which well posed questions can be readily converted to programs”. But it also raises an important point about how the structure of datasets relates to the questions you want to ask. The challenge in recording data is to structure it in such a way that the widest possible set of questions can be asked of that data. Data models all pre-suppose the kind of questions that will be asked. And any sufficiently general data model will be inefficient for most specific types of query.

Rajarshi Guha and Pierre Lindenbaum have been busy preparing different datastores for the solubility data being generated as part of the Open Notebook Science Challenge announced by Jean-Claude Bradley (more on this later). Rajarshi’s form based input has an SQL backend while Pierre has been working to extract the information as RDF. The point is not that one approach is better than the other, but that we need both, and possibly many more formats - and ideally we need to interconvert between them on the fly. A well posed question can easily founder on an inappropriately structured dataset (this is actually just a rephrasing of the Saunders Principle). It will be by enabling easy conversion between different formats that we might approach a situation where the aphorism I have suggested could become true.

Survey on scientists’ use of software

Greg Wilson and others are running a survey on how scientists’ perception of and use of software is changing. The wider range of people who do this the better and it will only take you about 10 minutes (five if you’re really clever).

The survey is at:
http://softwareresearch.ca/seg/SCS/scientific-computing-survey.html

Where does Open Access stop and ‘just doing good science’ begin?

open access banner
I had been getting puzzled for a while as to why I was being characterised as an ‘Open Access’ advocate. I mean, I do adovcate Open Access publication and I have opinions on the Green versus Gold debate. I am trying to get more of my publications into Open Access journals. But I’m no expert, and I’ve certainly been around this community for a much shorter time and know a lot less about the detail than many other people. The giants of the Open Access movement have been fighting the good fight for many years. Really I’m just a late comer cheering from the sidelines.

This came to a head recently when I was being interviewed for a piece on Open Access. We kept coming round to the question of what it was that motivated me to be ’such a strong’ advocate of open access publication. I must have a very strong motivation to have such strong views surely? And I found myself thinking that I didn’t. I wasn’t that motivated about open access per se. It took some thinking and going back over where I had come from to realise that this was because of where I was coming from.

I guess most people come to the Open Science movement firstly through an interest in Open Access. The frustration of not being able to access papers, followed by the realisation that for many other scientists it must be much worse. Often this is followed by the sense that even when you’ve got the papers they don’t have the information you want or need, that it would be better if they were more complete, the data or software tools available, the methodology online. There is a logical progression from ‘better access to the literature helps’ to ‘access to all the information would be so much better’.

I came at the whole thing from a different angle. My Damascus moment came when I realised the potential power of making everything available; the lab book, the data, the tools, the materials, and the ideas. Once you connect the idea of the read-write web to science communication, it is clear that the underlying platform has to be open, accessible, and re-useable to get the benefits. Science is perhaps the ultimate open platform available to build on. From this perspective it is immediately self evident that the current publishing paradigm and subscription access publication in particular is broken. But it is just one part of the puzzle, one of the barriers to communication that need to be attacked, broken down, and re-built. It is difficult, for these reasons, for me to separate out a bit of my motivation that relates just to Open Access.

Indeed in some respects Open Access, at least in the form in which it is funded by author charges can be a hindrance to effective science communication. Many of the people I would like to see more involved in the general scientific community, who would be empowered by more effective communication, cannot afford author charges. Indeed many of my colleagues in what appear to be well funded western institutions can’t afford them either. Sure you can ask for a fee waiver but no-one likes to ask for charity.

But I think papers are important. Some people believe that the scientific paper as it exists today is inevitably doomed. I disagree. I think it has an important place as a static document, a marker of what a particular group thought at a particular time, based on the evidence they had assembled. If we accept that the paper has a place then we need to ask how it is funded, particularly the costs of peer and editorial review, and the costs maintaining that record into the future. If you believe, as I do, that in an ideal world this communication would be immediately available to all then there are relatively few viable business models available. What has been exciting about the past few months, and indeed the past week has been the evidence that these business models are starting to work through and make sense. The purchase of BioMedCentral by Springer may raise concerns for the future but it also demonstrates that a publishing behemoth has faith in the future of OA as a publishing business model.

For me, this means that in many ways the discussion has moved on. Open Access, and Open Access publication in particular, has proved its viability. The challenges now lie in widening the argument to include data, to include materials, to include process. To develop the tools that will allow us to capture all of this in a meaningful way and to make sense of other people’s record. None of which should in any way belittle the achievement of those who have brought the Open Access movement to its current point. Immense amounts of blood, sweat, and tears, from thousands of people have brought what was once a fringe movement to the centre of the debate on science communication. The establishing of viable publishers and repositories for pre-prints, the bringing of funders and governments to the table with mandates, and of placing the option of OA publication at the fore of people’s minds are huge achievements, especially given the relatively short time it has taken. The debate on value for money, on quality of communication, and on business models and the best practical approaches will continue, but the debate about the value of, indeed the need for, Open Access has essentially been won.

And this is at the core of what Open Access means for me. The debate has placed, or perhaps re-placed, right at the centre of the discussion of how we should do science, the importance of the quality of communication. It has re-stated the principle of placing the claims that you make, and the evidence that supports them, in the open for criticism by anyone with the expertise to judge, regardless of where they are based or who is funding them. And it has made crystal clear where the deficiencies in that communication process lie and exposed the creeping tendency of publication over the past few decades to become more an exercise in point scoring than communication. There remains much work to be done across a wide range of areas but the fact that we can now look at taking those challenges on is due in no small part to the work of those who have advocated Open Access from its difficult beginnings to today’s success. Open Access Day is a great achievment in its own right and it should be celebration of the the efforts of all those people who have contributed to making it possible as well as an opportunity to build for the future.

High quality communication, as I and others have said, and will continue to say, is Just Good Science. The success of Open Access has shown how one aspect of that communication process can be radically improved. The message to me is a simple one. Without open communication you simply can’t do the best science. Open Access to the published literature is simply one necessary condition of doing the best possible science.

The trouble with semantics…

…is knowing what you mean…

I posted last week about the spontaneous CMLReact hackfest held around Peter Murray-Rust’s dining room table the day after Science Blogging in London. There were a number of interesting things that came out of the exercise for me. The first was that it would be relatively easy to design a moderately strict, but pretty standard, description format for a synthetic chemistry lab notebook that could be automatically scraped into CMLReact.

Automatic conversions from lab book to machine readable XML

CMLReact files have (roughly) three sections. In the first, all the molecules that are relevant to the description are described, or in the ideal semantic web world pointed to at an external authority such as Chemspider, PubChem, or other source. In the second section the relationships between input materials, solvents, products, and samples are described. In general all of these will be molecules which are referred to in the first session but this is not absolutely required (and this will be important later). The final section describes observables, procedures, yields, and other descriptions of what happened or what was measured.

If we take a look at the UsefulChem experiment that we converted to CMLReact you can see that most of this information is available in one form or another. The molecules are described via InChi/InChiKey at the bottom of the page. This could be used as they are to populate the molecules section. A little additional markup to distinguish between reactants, solvents, reagents, and products would make it possible to start populating the second section describing the relationships between these molecules.

The third section is the most tricky, and this will always be an 80:20 game. The object is to abstract as much information as can be reasonably garnered without putting in the vast amount of work required to get close to 100% retrieval. At the end of the day, if someone wants the real detail they can go back to the lab book. Peter has demonstrated text scraping tools that do a pretty good job of extracting a lot of this information. In combination with a bit of markup it is reasonable to expect that some basic information (amounts of reagents, yield, temperature of reaction, some descriptive terms) could reasonably be extracted. Again, getting 80-90% of a subset of regularly used terms  would be very powerful.

But what are we describing?

There is a problem with grabbing this descriptive information from the lab notebook however, and it is a problem that is very general and something I believe we need to grapple with urgently. There is a fundamental question as to what it is that this file is describing. Does it describe the plan of the experiment? The record of carrying out a specific example of this experiment? An ‘averaged’ description of a set of equivalent experiments? A general description of the reaction? Or a description of a model of what we expect or think is happening?

If you look closely at the current version of the CMLReact file you will see that the yield is expressed as a percentage with a standard deviation. This is actually describing the average of three independent reactions but that is not actually made explicit anywhere in this file. Is this important? Well I think it is because it has an effect on what any outward links back to the lab book mean. There is a significant difference between – ‘this link points to an example of this kind of reaction’ (which might in fact be significantly different in the details) and ‘this link points to this exact experiment’ or indeed ‘this link points to an index of relevant experimental results’. Those distinctions need to be encoded in the links, or perhaps more likely made explicit in the abstracted file.

The CMLReact file is an abstraction of the experimental record. It is therefore important to make it clear what the level of abtraction is and what has been abstracted out of that description. This relates to the distinction I have made before between the flexibility required to record an experiment versus the ability to use a more structured vocabulary to describe the experiment after it has happened. My impression is that people who work in developing these controlled vocabularies are focussed on description rather than recording and don’t often make the distinction between the two. There is also often a lack of distinction between describing an experiment and describing a model of what happened in that experiment.  This is important because the model may need to be modified in the future whereas the description of the experiment should be accurate.

Summary

My view remains that when recording an experiment the system used should be as flexible as possible. Structure can be added to this primary record when convenient to make the process of abstracting from this primary record to a controlled vocabulary easier. The primary goal for me, for the moment, remains making a human readable record available. The process of converting the primary record into a controlled vocabulary, such as CMLReact, FuGE, or workflow system such as Taverna, should be enabled via domain specific automated or semi-automated tools that help the user to structure their description of the experiment in a way that makes it more directly useful to them but maintains the links with the primary record. Where the same controlled vocabulary is used for more abstracted descriptions of studies, experiments, or the models that purport to describe them, this distinction must be made clear.

Semantics depends absolutely on being clear about what you are describing. There is absolutely no point in having absolute clarity about the description of an object if the nature of that object is fuzzy. Get it right and we could have a very sophisticated description of the scientific record. Get it wrong and that description could be at best unclear and at worst downright misleading.

Notes from Scifoo

I am too tired to write anything even vaguely coherent. As will have been obvious there was little opportunity for microblogging, I managed to take no video at all, and not even any pictures. It was non-stop, at a level of intensity that I have very rarely encountered anywhere before. The combination of breadth and sharpness that many of the participants brought was, to be frank, pretty intimidating but their willingness to engage and discuss and my realisation that, at least in very specific areas, I can hold my own made the whole process very exciting. I have many new ideas, have been challenged to my core about what I do, and how; and in many ways I am emboldened about what we can achieve in the area of open data and open notebooks. Here are just some thoughts that I will try to collect some posts around in the next few days.

  • We need to stop fretting about what should be counted as ‘academic credit’. In another two years there will be another medium, another means of communication, and by then I will probably be conservative enough to dismiss it. Instead of just thinking that diversifying the sources of credit is a good thing we should ask what we want to achieve. If we believe that we need a more diverse group of people in academia than that is what we should articulate - Courtesy of a discussion with Michael Eisen and Sean Eddy.
  • ‘Open Science’ is a term so vague as to be actively dangerous (we already knew that). We need a clear articulation of principles or a charter. A set of standards that are clear, and practical in the current climate. As these will be lowest common denominator standards at the beginning we need a mechanism that enables or encourages a process of incrementally raising those standards. The electronic Geophysical Year Declaration is a good working model for this - Courtesy of session led by Peter Fox.
  • The social and personal barriers to sharing data can be codified and made sense of (and this has been done). We can use this understanding to frame structures that will make more data available - session led by Christine Borgman
  • The Open Science movement needs to harness the experience of developing the open data repositories that we now take for granted. The PDB took decades of continuous work to bring to its current state and much of it was a hard slog. We don’t want to take that much time this time round - Courtesy of discussion led by Sarah Berman
  • Data integration is tough, but it is not helped by the fact that bench biologists don’t get ontologies, and that ontologists and their proponents don’t really get what the biologists are asking. I know I have an agenda on this but social tagging can be mapped after the fact onto structured data (as demonstrated to me by Ben Good). If we get the keys right then much else will follow.
  • Don’t schedule a session at the same time as Martin Rees does one of his (aside from anything else you miss what was apparently a fabulous presentation).
  • Prosthetic limbs haven’t changed in 100 years and they suck. Might an open source approach to building a platform be the answer - discussion with Jon Kuniholm, founder of the Open Prosthetics Project.
  • The platform for Open Science is very close and some of the key elements are falling into place. In many ways this is no longer a technical problem.
  • The financial system backing academic research is broken when the cost of reproducing or refuting specific claims rises to 10 to 20-fold higher than the original work. Open Notebook Science is a route to reducing this cost - discussion with Jamie Heywood.
  • Chris Anderson isn’t entirely wrong - but he likes being provocative in his articles.
  • Google run a fantasticaly slick operation. Down to the fact that the chocolate coated oatmeal biscuit icecream sandwiches are specially ordered in made with proper sugar instead of hugh fructose corn syrup.

Enough. Time to sleep.

First neutrons from ISIS TS-2!

In a break from your regularly scheduled programme on Open Science we bring you news from deepest TS-2 first neutronsdarkest Oxfordshire. I am based at ISIS, the UK’s neutron source, where my job is to bring in and support more biological science that uses neutrons. Neutron scattering, while it has made a number of crucial contributions to the biological sciences, has always been a bit player in comparison to x-ray crystallography and NMR. My job, is to try and build and strengthen this activity and to see the potential of neutron scattering in structural biology realised.

The Second Target Station project at ISIS is a huge part of this, and the reason I have a job here. TS-2 is designed specifically to provide a high flux of low energy neutrons, which are ideally suited to looking at large scale structures and biological molecules. The energy characteristics of the neutrons mean they have wavelengths ranging from angstroms up to around 2 nm, meaning they will be well suited to looking the overall shape and size of biomolecules and their complexes. The increase in flux, probably about 10-20 fold over the existing target station, means that experiments can be faster, or smaller, or more dilute. All things that make the bioscientists job easier. Over £140M has been spent on building the target and the instruments that will make use of these neutrons.

At 1308 yesterday the first neutrons were detected on the Inter beamline with a spectrum and flux pretty much dead on what was expected. In fact the first shot flipped out the detector it was so strong. This has been a massive project that despite news coverage to the contrary has been delivered essentially on time and on budget. Congratulations are due to all those involved in pulling this off. As the instruments themselves start to come fully online now we are going to get the chance to do many things that were either difficult or impossible before. In particular I am excited about what we will be able to do with the new small angle instrument SANS2d and INTER, the reflectometer, particularly in the area of membrane biology.

A new way of looking at science?

I’ve spent a long time talking about two things that our LaBLog enables, or rather that it should enable. One is that by changing the way we view the record we can look at our results and materials in a new way. The second is that we want to enable a machine to read the lab book. Andrew Milsted, the main developer of the LaBLog and a PhD student in Jeremy Frey’s group, has just enabled a significant step in that direction. He’s managed to dump my lab book as rdf which enables us to look at it in an rdf viewer such as Welkin, developed by the Simile group at MIT.

At the moment this just shows each post as a node and the links between posts as edges. But there a number of thNetwork view of my labbookings that are immediately obvious. The first is that I start a lot of things and don’t necessarily manage to get very far with them and that I do a number of (currently) unrelated things (isolated subgraphs aren’t connected). Also that there are some materials that get widely re-used and some that don’t. There are also clearly things that I haven’t finished entering properly (isolated nodes). Finally, that we need a more sophisticated tool for playing with the view because building a human readable version of the graph will require some manipulation, grabbing subgraphs and moving them around. Welkin is great but after 30 minutes playing I have a bunch of feature requests. But this is what I’ve done so far. I am sure there are many things that can be done with this kind of view - but for the moment what is important is that it is an entirely new kind of way of looking at the record.

For those interested in following progress on another story, the data and analysis built on the model that Pawel Szczesny built for us is in the bottom right hand corner of the graph. You can see thatat the moment it is isolated from the rest of the graph because we haven’t yet compared these models with our experimental results (actually the relevant experiments aren’t on this graph because it was dumped before we did them). That’s something we should be doing in the next few days. If the data matches the model (current indications are that it does, but data quality is an issue) then we will have something very interesting to say about the structural changes on ligand binding in ligand gated ion channels.

Practical communications management in the laboratory – getting semantics from context

Rule number one: Never give your students your mobile number. They have a habit of ringing it.

Our laboratory is about a ten minute walk from my office. Some of the other staff have offices five minutes away in the other direction and soon we will have another lab which is another ten minute walk away in a third direction. I am also offsite a lot of the time. Somehow we need to keep in contact between the labs and between the people. This is a question of passing queries around but also of managing the way these queries interrupt what I and others are doing.

Having broken rule #1 I am now trying to manage my attention when my phone keeps going off with updates, questions, and details. Much of it at inconvenient times and much of it things that other people could answer. So what is the best way to spread the load and manage the inbox?

What I am going to propose is to setup a lab account on Twitter. If I we get everyone to follow this account and set updates to be sent via SMS to everyone’s phones we have a nice simple notification system. We just set up a Twitter client on each computer in the lab, logged into that account, agree a partly standardised format for Tweets (primarily including person’s name) and go from there. This will enable people to ask questions (and anyone to answer them), provide important updates or notices (equipment broken, or working again), and to keep people updated with what is happening. It also means that we will have a log of everyone’s queries, answers, and notices that we can go back to and archive.

So a fair question at this point would be why don’t we do this through the LaBLog? Surely it would be better to keep all these queries in one place? Well one answer is that we are still struggling to deploy the LaBLog at RAL, but that’s a story for a separate post. But there is a fundamental difference in the way we interact with Twitter/SMS and notifications through the LaBLog via RSS. Notification of new material on the LaBLog via RSS is slow, but more importantly it is fundamentally a ‘pull’ interaction. I choose when to check it. Twitter and specifically the SMS notification is a ‘push’ interaction which will be better when you need people to notice, such as when you’re asking an urgent question, or need to post an urgent notice (e.g. don’t use the autoclave!). However, both allow me to see the content before deciding whether to answer, a crucial difference with a mobile phone call, and they give me options over what medium to respond with. They return the control over my time back to me rather than my phone.

The point is that these different streams have different information content, different levels of urgency, and different currency (how long they are important for). We need different types of action and different functionality for both. Twitter provides forwarding to our mobile devices, regardless (almost) of where in the world we are currently located, providing a mechanism for direct delivery. One of the fundamental problems with all streaming protocols and applications is that they have no internal notion of priority, urgency, or currency. We are rapidly approaching the point where to simple skim all of our incoming streams (currently often in many different places) is not an option. Aggregating things into one place where we can triage them will help but we need some mechanism for encoding urgency, importance, and currency. The easiest way for us to achieve this at the moment is to use multiple services.

One approach to this problem would be a single portal/application that handled all these streams and understood how to deal with them. My guess is that Workstreamr is aiming to fit into this niche as an enterprise solution to handling all workstreams from the level of corporate governance and strategic project management through to the office watercooler conversation. There is a challenging problem in implementing this. If all content is coming into one portal, and can be sent (from any appropriate device) through the same portal, how can the system know what to do with it? Does it pop up as an urgent message demanding the bosses attention or does it just go into a file that can be searched at a later date? This requires that the system either infer or have users provide an understanding of what should be done with a specific message. Each message therefore requires a rich semantic content indicating its importance, possibly its delivery mechanism, and whether this differs for different recipients. The alternative approach is to do exactly what I plan to do – use multiple services so that the semantic information about what should be done with each post is encoded from its context. It’s a bit crude but the level of urgency or importance is encoded in the choice of messenging service.

This may seem like rather a lot of weight to give to the choice between tweeting and putting up a blog post but this is part of a much larger emerging theme. When I wrote about data repositories I mentioned the implicit semantics that comes from using repositories such as slideshare and Flickr (or the PDB) that specialise in a specific kind of content. We talk a lot about semantic publishing and complain that people ‘don’t want to put into the metadata’ but if we recorded data at source, when it is produced, then a lot of the metadata would be built in. This is fundamentally the publish@source concept that I was introduced to by the group of Jeremy Frey at Southampton University. If someone logs into an instrument, we know who generated the data file and when, and we know what that datafile is about and looks like. The datafile itself will contain date and instrument settings. If the sample list refers back to URIs in a notebook then we have all the information on the samples and their preparation. If we know when and where the datafile was recorded and we are monitoring room conditions then we have all of that metadata built in as well.

The missing piece is the tools that bring all this together and a more sophisticated understanding of how we can bring all these streams together and process them. But at the core, if we capture context, capture user focus, and capture the connections to previous work then most of the hard work will be done. This will only become more true as we start to persuade instrument manufacturers to output data in standard formats. If we try and put the semantics back in after the fact, after we’ve lost those connections, then we are just creating more work for ourselves. If the suite of tools can be put together to capture and collate it at source then we can make our lives easier – and that in turn might actually persuade people to adopt these tools.

The key question of course…which Twitter client should I use? :)

The full Web2.0 experience - My talk tomorrow at IWMW in Aberdeen

Tomorrow I am giving a talk at the UKOLN Institutional Web Managers workshop starting at 12:45 British Summer Time (GMT+1). In principle you will be able to see the talk video cast at the links on the video streaming page. The page also has a liveblogging tool (OpenID enabled apparently!). I won’t be liveblogging my own talk but I will be attempting to respond to comments or questions either on that tool, via FriendFeed, or @cameronneylon on Twitter. I make no promises that this will work but if it all fails then I will record a live screencast.

What I missed on my holiday or Why I like refereeing for PLoS ONE

I was away last week having a holiday and managed to miss the whole Declan Butler/PLoS/Blogosphere dustup. Looked like fun. I don’t want to add to the noise as I think there was a lot of knee jerk reactions and significantly more heat than light. For anyone coming here without having heard about this I will point at the original article, Bora’s summary of reactions, and Timo Hannay’s reply at Nature. What I wanted to add to the discussion was a point that I haven’t seen in my quick skimming of the whole debate (which is certainly not complete so if I missed this then please drop in a comment).

No-one as far as I can see has really twigged as to just how disruptive PLoS ONE really is. In this I agree with Timo, in that I think publishers, from BMC, to Elsevier, ACS and Nature Publishing Group itself, should be very worried about the impact that it will have and think very hard about what it means for their future business models. Where we disagree, I think, is that I find this very exciting and think that it shows the way towards a scientific publishing industry that will look very different from todays’. Diffentiating on quality prior to publication was always difficult, and certainly expensive. The question for the future is whether we are prepared to pay for it, and are we getting value for money?

The criticism levelled at PLoS ONE is that it uses a ‘light touch’ refereeing process with the only criterion for publication being that a paper is methodologically sound. This, it is implied leads to a ‘low quality journal’ or perhaps rather a journal with a large number of relatively uncited articles. However there are very strong positives to this ‘light touch’ approach. It is fast. And it is cheap. The issue here is business models and the business model of PLoS ONE is highly disruptive. And financially successful. To me this is the big news. People are flocking to PLoS ONE because it is a quick and straightforward way of getting interesting (but perhaps not career making) results out there.

From an author’s perspective PLoS ONE cuts out the crap in getting papers published. The traditional approach (send to Nature/Science/Cell, get rejected, send to Nature/Science/Cell baby journal, get rejected, send to top tier specific journal, get rejected, end up eventually going to a journal that no-one subscribes to) takes time and effort and by the time you win someone else has usually published it anyway. It also costs the authors money in staff time to re-format, rejig, appease referees, re-jig again to appease a different set of referees. I haven’t done the sums but worst case scenario this could probably cost as much as a PLoS ONE publication charge.  Save time, save money, still get indexed in PubMed. It starts to sound good, especially for all that material that you are not quite sure where to pitch.

But what about that stuff that is really hard hitting? That you know is important. Here you now have an interesting choice. You can send to Nature/Science/Cell/PLoS Biology and if you get past the initial editorial review stage and get to referees then you are probably looking at around six to nine months before publication. You will be in a high profile journal, can generate good publicity, have great paper on your CV. Alternatively you can send to PLoS ONE and have it on the web and in PubMed in perhaps two to four weeks. If the paper is as strong as you believe then you will still get your hundreds of citations, still have a great paper, still get good publicity. It probably doesn’t look quite so good on the traditional CV, but try putting the number of citations for each paper on your CV - that puts it in perspective. And it will be out a lot faster, you will be ahead of the game and you can apply for your next grant with ‘paper published and already cited three times’ not ‘paper submitted’ (read ‘about to be rejected’, there is an art in submitting papers just before the grant deadline).  This makes for an interesting choice and one which cuts directly across the usual high impact/low impact criterion. It puts speed and convenience on the table as market differentiators in a way they haven’t been before.

As a referee PLoS ONE has a lot of appeal as well. You are being asked a very specific question. I recently refereed one paper for PLoS ONE at the same time as one for another (fairly low impact) journal. The PLoS ONE paper was a very simple case, the methodological detail was exemplary; easy to read, clear, and detailed. You get the impression the authors took care over it, possibly because they knew that was what it would be judged on (it is of course entirely possible that this group just writes good papers). The other paper was a distinct case of salami slicing - but I was left with trying to figure out whether it had been cut too thin for this specific journal. This is not just a difficult judgement to make. It is a highly subjective and probably meaningless one. The data was still useful and publishable, just probably not in that specific journal. Which one do you think took me longer? And which one left me with a warm feeling?

What about the reader? There is a lot of interesting stuff in PLoS ONE. There is also a lot of dross. But why should that matter? I don’t look at the dross; I often don’t even know that it exists. I can’t remember the last time I actually looked at a a journal table of contents. It doesn’t matter to me whether a paper is in Nature, Science, PLoS ONE, or Journal of the society for some highly specific thing in some rather small place. If it is searchable, and I have access to it then that’s all I need. If it is not both of these then for me it simply does not exist.  And I don’t judge the value or reliability of an article based on where it is, I judge the article on what it contains. PLoS ONE actually wins here because its hard focus on being ‘methodologically sound’ tends to lead referees and editors (as well as authors) to focus on this aspect.

To me the truly radical thing about PLoS ONE is that is has redefined the nature of peer review and that people have bought into this model. The idea of dropping any assessment of ‘importance’ as a criterion for publication had very serious and very real risks for PLoS. It was entirely possible that the costs wouldn’t be usefully reduced. It was more than possible that authors simply wouldn’t submit to such a journal. PLoS ONE has successfully used a difference in its peer review process as the core of its appeal to its customers. The top tier journals have effectively done this for years at one end of the market. The success of PLoS ONE shows that it can be done in other market segments. What is more it suggests it can be done across  existing market segments. That radical shift in the way scientific publishing works that we keep talking about? It’s starting to happen.