An openwetware blog on the challenges of open and connected science

Uncategorized

Open Research: The personal, the social, and the political

Next Tuesday I’m giving a talk at the Institute for Science Ethics and Innovation in Manchester. This is a departure for me in terms of talk subjects, in as much as it is much more to do with policy and politics. I have struggled quite a bit with it so this is an effort to work it out on “paper”. Warning, it’s rather long. The title of the talk is “Open Research: What can we do? What should we do? And is there any point?”–I’d like to start by explaining where I’m coming from. This involves explaining a bit about me. I live in Bath. I work at the Rutherford Appleton Laboratory, which is near Didcot. I work for STFC but this talk is a personal view so you shouldn’t take any of these views as representing STFC policy. Bath and Didcot are around 60 miles apart so each morning I get up pretty early, I get on a train, then I get on a bus which gets me to work. I work on developing methodology to study complex biological structures. We have a particular interest in trying to improve methods for looking at proteins that live in biological membranes and protein-nucleic acid complexes. I also have done work on protein labelling that lets us make cool stuff and pretty pictures. This work involves an interesting mixture of small scale lab work, work at large facilities on big instruments, often multi-national facilities. It also involves far too much travelling.A good question to ask at this point is “Why?” Why do I do these things? Why does the government fund me to do them? Actually it’s not so much why the government funds them as why the public does. Why does the taxpayer support our work? Even that’s not really the right question because there is no public. We are the public. We are the taxpayer. So why do we as a community support science and research? Historically science was carried out by people sufficiently wealthy to fund it themselves, or in a small number of cases by people who could find wealth patrons. After the second world war there was a political and social concensus that science needed to be supported and that concensus has supported research funding more or less to the present day. But with the war receding in public memory we seem to have retained the need to frame the argument for research funding in terms of conflict or threat. The War on Cancer, the threat of climate change. Worse, we seem to have come to believe our own propaganda, that the only way to justify public research funding is that it will cure this, or save us from that. And the reality is that in most cases we will probably not deliver on this.These are big issues and I don’t really have answers to a lot them but it seems to me that they are important questions to think about. So here are some of my ideas about how to tackle them from a variety of perspectives. First the personal.A personal perspective on why and how I do researchMy belief is we have to start with being honest with ourselves, personally, about why and how we do research. This sounds like some sort of self-help mantra I know but let me explain what I mean. My personal aim is to maximise my positive impact on the world, either through my own work or through enabling the work of others. I didn’t come at this from first principles but it has evolved. I also understand I am personally motivated by recognition and reward and that I am strongly, perhaps too strongly, motivated by others opinions of me. My understanding of my own skills and limitations means that I largely focus my research work on methodology development and enabling others. I can potentially have a bigger impact by building systems and capabilities that help others do their research than I can by doing that research myself. I am lucky enough to work in an organization that values that kind of contribution to the research effort.Because I want my work to be used as far as is possible I make as much as possible of it freely available. Again I am lucky that I live now when the internet makes this kind of publishing possible. We have services that enable us to easily publish ideas, data, media, and process and I can push a wide variety of objects onto the web for people to use if they so wish. Even better than that I can work on developing tools and systems that help other people to do this effectively. If I can have a bigger impact by enabling other peoples research then I can multiply that again by helping other people to share that research. But here we start to run into problems. Publishing is easy. But sharing is not so easy. I can push to the web, but is anyone listening? And if they are, can they understand what I am saying?A social perspective (and the technical issues that go with it)If I want my publishing to be useful I need to make it available to people in a way they can make use of. We know that networks increase in value as they grow much more than linearly. If I want to maximise my impact, I have to make connections and maximise the ability of other people to make connections. Indeed Merton made the case for this in scientific research 20 years ago.

I propose the seeming paradox that in science, private property is established by having its substance freely given to others who might want to make use of it.

This is now a social problem but a social problem with a distinct technical edge to it.  Actually we have two related problems. The issue of how I make my work available in a useful form and the separate but related issue of how I persuade others to make their work available for others to use.The key to making my work useful is interoperability. This is at root a technical issue but at a purely technical level is one that has been solved. We can share through agreed data formats and vocabularies. The challenges we face in actually making it happen are less technical problems than social ons but I will defer those for the moment. We also need legal interoperability. Science Commons amongst others has focused very hard on this question and I don’t want to discuss it in detail here except to say that I agree with the position that Science Commons takes; that if you want to maximise the ability of others to re-use your work then you must make it available with liberal licences that do not limit fields of use or the choice of license on derivative works. This mean CC-BY, BSD etc. but if you want to be sure then your best choice is explicit dedication to the public domain.But technical and legal interoperability are just subsets of what I think is more important;  process interoperability. If the object we publish are to be useful then they must be able to fit into the processes that researchers actually use. As we move to the question of persuading others to share and build the network this becomes even more important. We are asking people to change the way they do things, to raise their standards perhaps. So we need to make sure that this is as easy as possible and fits into their existing workflows. The problem with understanding how to achieve technical and legal interoperability is that the temptation is to impose it and I am as guilty of this as anyone. What I’d like to do is use a story from our work to illustrate an approach that I think can help us to make this easier.Making life easier by capturing process as it happens: Objects first, structure laterOur own work on web based laboratory recording systems, which really originates in the group of Jeremy Frey at Southampton came out of earlier work on a fully semantic RDF backed system for recording synthetic chemistry. In contrast we took an almost completely unstructured approach to recording work in a molecular biology laboratory, not because we were clever or knew it would work out, but because it was a contrast to what had gone before. The LaBLog is based on a Blog framework and allows the user to put in completely free text, completely arbitrary file attachments, and to organize things in whichever way they like. Obviously a recipe for chaos.And it was to start with as we found our way around but we went through several stages of re-organization and interface design over a period of about 18 months. The key realization we made was that while a lot of what we were doing was difficult to structure in advance that there were elements within that, specific processes, specific types of material that were consistently repeated, even stereotyped, and that structuring these gave big benefits. We developed a template system that made producing these repeated processes and materials much easier. These templates depended on how we organized our posts, and the metadata that described them, and the metadata in turn was driven by the need for the templates to be effective. A virtuous circle developed around the positive re-inforcement that the templates and associated metadata provided. More suprisingly the structure that evolved out of this matched in many cases well onto existing ontologies. In specific cases where it didn’t we could see that either the problem arose from the ontology itself, or the fact that our work simply wasn’t well mapped by that ontology. But the structure arose spontaneously out of a considered attempt to make the user/designer’s life easier. And was then mapped onto the external vocabularies.I don’t want to suggest that our particular implementation is perfect. It is far from it, with gaping holes in the usability and our ability to actually exploit the structure that has developed. But I think the general point is useful. For the average scientist to be willing to publish more of their research, that process has to be made easy and it has to recognise the inherently unstructured nature of most research. We need to apply structured descriptions where they make the user’s life easier but allow unstructured or semi-structured representations elsewhere. But we need to build tools that make it easy to take those unstructured or semi-structure records and mold them into a specific structured narrative as part of a reporting process that the researcher has to do anyway. Writing a report, writing a paper. These things need to be done anyway and if we could build tools so that the easiest way to write the report or paper is to bring elements of the original record together and push those onto the web in agreed formats through easy to use filters and aggregators then we will have taken an enormous leap forward.Once you’ve insinuated these systems into the researchers process then we can start talking about making that process better. But until then technical and legal interoperability are not enough - we need to interoperate with existing processes as well. If we could achieve this then much more research material would flow online, connections would be formed around those materials, and the network would build.And finally - the politicalThis is all very well. With good tools and good process I can make it easier for people to use what I publish and I can make it easier for others to publish. This is great but it won’t make others want to publish. I believe that more rapid publication of research is a good thing. But if we are to have a rational discussion about whether this is true we need to have agreed goals. And that moves the discussion into the political sphere.I asked earlier why it is that we do science as a society, why we fund it. As a research community I feel we have no coherent answer to these questions.  I also talked about being honest to ourselves. We should be honest with other researchers about what motivates us, why we choose to do what we do, and how we choose to divide limited resources. And as recipients of taxpayers money we need to be clear with government and the wider community about what we can achieve. We also have an obligation to optimize the use of the money we spend. And to optimize the effective use of the outputs derived from that money.We need at core a much more sophisticated conversation with the wider community about the benefits that research brings; to the economy, to health, to the environment, to education. And we need a much more rational conversation within the research community as to how those different forms of impact are and should be tensioned against each other.  We need in short a complete overhaul if not a replacement of the post-war concensus on public funding of research. My fear is that without this the current funding squeeze will turn into a long term decline. And that without some serious self-examination the current self-indulgent bleating of the research community is unlikely to increase popular support for public research funding.There are no simple answers to this but it seems clear to me that at a minimum we need to be demonstrating that we are serious about maximising the efficiency with which we spend public money. That means making sure that research outputs can be re-used, that wheels don’t need to re-invented, and innovation flows easily from the academic lab into the commercial arena. And it means distinguishing between the effective use of public money to address market failures and subsidising UK companies that are failing to make effective investments in research and development.The capital generated by science is in ideas, capability, and people. You maximise the effective use of capital by making it easy to move, by reducing barriers to trade. In science we can achieve this by maximising the ability transfer research outputs. If we to be taken seriously as guardians of public money and to be seen as worthy of that responsibility our systems need to make ideas, data, methodology, and materials flow easily. That means making our data, our process, and our materials freely available and interoperable. That means open research.We need a much greater engagement with the wider community on how science works and what science can do. The web provides an immense opportunity to engage the public in active research as demonstrated by efforts as diverse as Galaxy Zoo with 250,000 contributors and millions of galaxy classifications and the Open Dinosaur Project with people reading online papers and adding the measurements of thigh bones to an online spreadsheet. Without the publicly available Sloan Digital Sky Survey, without access to the paleontology papers, and without the tools to put the collected data online and share them these people, this “public”, would be far less engaged. That means open research.And finally we need to turn the tools of our research on ourselves. We need to critically analyse our own systems and processes for distributing resources, for communicating results, and for apportioning credit. We need to judge them against the value for money they offer to the taxpayer and where they are found wanting we need to adjust. In the modern networked world we need to do this in a transparent and honest manner. That means open research.But even if we agree these things are necessary, or a general good, they are just policy. We already have policies which are largely ignored. Even when obliged to by journal publication policies or funder conditions researchers avoid, obfuscate, and block attempts to gain access to data, materials, and methdology. Researchers are humans too with the same needs to get ahead and to be recognized as anyone else. We need to find a way to map those personal needs, and those personal goals, onto the community’s need for more openness in research. As with the tooling we need to “bake in” the openness to our processes to make it the easiest way to get ahead. Policy can help with cultural change but we need an environment in which open research is the simplest and easiest approach to take. This is interoperability again but in this case the policy and process has to interoperate with the real world. Something that is often a bit of a problem.So in conclusion…I started with a title I’ve barely touched on.  But I hope with some of the ideas I’ve explored we are in a position to answer the questions I posed. What can we do in terms of Open Research? The web makes it technically possible for us the share data, process, and records in real time. It makes it easier for us to share materials though I haven’t really touched on that. We have the technical ability to make that data useful through shared data formats and vocabularies. Many of the details are technically and socially challenging but we can share pretty much anything we choose to on a wide variety of timeframes.What should we do? We should make that choice easier through the development of tools and interfaces that recognize that it is usually humans doing and recording the research and exploiting the ability of machines to structure that record when they are doing the work. These tools need to exploit structure where it is appropriate and allow freedom where it is not. We need tools to help us map our records onto structures as we decide how we want to present them. Most importantly we need to develop structures of resource distribution, communication, and recognition that encourage openness by making it the easiest approach to take. Encouragement may be all that’s required. The lesson from the web is that once network effects take hold they can take care of the rest.But is there any point? Is all of this worth the effort? My answer, of course, is an unequivocal yes. More open research will be more effective, more efficient, and provide better value for the taxpayer’s money. But more importantly I believe it is the only credible way to negotiate a new concensus on the public funding of research. We need an honest conversation with government and the wider community about why research is valuable, what the outcomes are, and how the contribute to our society. We can’t do that if the majority cannot even see those outcomes. The wider community is more sophisticated that we give it credit for. And in many ways the research community is less sophisticated than we think. We are all “the public”. If we don’t trust the public to understand why and how we do research, if we don’t trust ourselves to communicate the excitement and importance of our work effectively, then I don’t see why we deserve to be trusted to spend that money.

An open letter to Lord Mandelson

Lord Mandelson is the UK minister for Business Innovation and Skills which includes the digital infrastructure remit. He recently announced that a version of the “three strikes” approach to combatting illegal firesharing, with the sanction being removal of internet access, would be applied in the UK. This is a copy of a letter I have sent to Lord Mandelson via the wonderful site www.writetothem.com that provides an easy way to write to UK parliamentarians. If you have an interest in the issue I suggest you do the same.

Lord Mandelson

House of Lords

Palace of Westminster

4 September 2009

Dear Lord Mandelson

I am writing to protest the decision taken by yourself to impose a “three strikes” approach to online rights and monopoly violations with an ultimate sanction requiring service providers to remove internet access. I am not a UK citizen but have lived in the UK for ten years and regard it as my home. I have a direct interest in the use of new technologies for communication, particularly in scientific research, and a vested interest in the long term competitiveness of the UK and its ability to support continued innovation in this area.

Your decision is wrong. Not because copyright violation should be allowed or respected and not because the main stream content industry should be ashamed that it makes money. It is wrong because it will stifle the development of new forms of creativity and the development of entirely new industries. As an advocate of Open Access scientific publication and copyright reform I am critical of the the current system of rights and monopolies but I work hard to respect the rights of content producers. And it is very hard work. Even as someone with some expertise in copyright and licensing, to do this right, requires time and effort. When I write, or prepare presentations, I spend significant amounts of time identifying work I can re-use, checking that licences are compatible, and making sure I license my own derivative work in a way that respects the rights of those people  whose work I have built on.

New forms of creativity are developing that re-use and re-purpose existing content but in fact this is not new at all. Re-use and re-purposing in culture has a grand tradition from Homer, via Don Quixote to Romeo and Juliet, from Brahms’ Haydn variations to Hendrix’s version of the Star Spangled Banner. In my own field all science and technology is derivative. It builds constantly on the work of others. But the internet makes new forms of re-use possible. New types of value creation are also made possible.  Re-use of images, video, and text, as well as ideas and data are enabling the development of new forms of business, new types of innovation in ways that are very challenging to predict. Your proposal will stifle this innovation by creating an environment of fear around re-use and by privileging certain classes and types of content and producer over the generators of new and innovative products. Those who do not care will ignore and circumvent the rules by technical means. And those who are exploring new types of derivative work, new types of innovative content, will be discouraged by the atmosphere of fear and uncertainty created by your policy.

Nonetheless it is important that the rights of content producers are respected. The key is finding the right balance between the needs to existing industries and individuals involved in the creation of new content and new industries. I would suggest that the key to any protection mechanism is parity. Large and traditional content producers, if given additional rights over those currently provided by law, must also respect equivalent rights for the small and new media producer.

This can be simply achieved by providing a similar three strikes mechanisms for traditional media. Thus if a television broadcaster uses, without appropriate attributions or licensing, video, images, or text taken by an individual then they should have their broadcast licence revoked. Similarly if print media utilise text from bloggers or Wikipedia without appropriate licensing or attribution, then the rights holders should be able to revoke their paper supply. Paper suppliers to the print media would be required to implement systems to enable online authors to register complaints and would be responsible for imposing these sanctions.

Clearly such a system is farcical, creating a nightmare of bureaucracy and heavy handed sanctions that stifle experimentation and economic activity. Yet it is analogous to what you have proposed. Only you are imposing this to protect a mature set of industries with no real long term growth potential while stifling the potential of a whole new class of industries and innovation with massive growth potential over the next few decades.

Your proposal is wrong for purely economic reasons. It is wrong because it will stifle a major opportunity for economic growth right at the point where we need it most. And it is wrong because as a government your role is not to legislate to protect business models but to regulate in a way that balances the risks of damage in one sector against the potential for encouraging new sectors to develop. I respectfully suggest that you have got that balance wrong.  I disagreed with much in Lord Carter’s report but perhaps the best measure of its balance was the equally vociferous criticism it received from both sides of the debate. This to me suggests that it forms a productive basis on which to move forward.

Yours sincerely

Cameron Neylon

Watching the future…student demos at University of Toronto

On Wednesday morning I had the distinct pleasure of seeing a group of students in the Computer Science department at the University of Toronto giving demos of tools and software that they have been developing over the past few months. The demos themselves were of a consistently high standard throughout, in many ways more interesting and more real than some of the demos that I saw the previous night at the “professional” DemoCamp 21. Some, and I emphasise only some, of the demos were less slick and polished but in every case the students had a firm grasp of what they had done and why, and were ready to answer criticisms or explain design choices succinctly and credibly. The interfaces and presentation of the software was consistently not just good, but beautiful to look at, and the projects generated real running code that solved real and immediate problems. Steve Easterbrook has given a run down of all the demos on his blog but here I wanted to pick out three that really spoke to problems that I  have experienced myself.

I mentioned Brent Mombourquette’s work on Breadcrumbs yesterday (details of the development of all of these demos is available on the student’s linked blogs). John Pipitone demonstrated this Firefox extension that tracks your browsing history and then presents it as a graph. This appealed to me immensely for a wide range of reasons: firstly that I am very interested in trying to capture, visualise, and understand the relationships between online digital objects. The graphs displayed by breadcrumbs immediately reminded me of visualisations of thought processes with branches, starting points, and the return to central nodes all being clear. In the limited time for questions the applications in improving and enabling search, recording and sharing collections of information, and even in identifying when thinking has got into a rut and needs a swift kick were all covered. The graphs can be published from the browser and the possibilities that sharing and analysing these present are still popping up with new ideas in my head several days later. In common with the rest of the demos my immediate response was, “I want to play with that now!”

The second demo that really caught my attention was a MediaWiki extension called MyeLink written by Maria Yancheva that aimed to find similar pages on a wiki. This was particularly aimed at researchers keeping a record of their work and wanting to understand how one page, perhaps describing an experiment that didn’t work, was different to a similar page, describing and experiment that did. The extension identifies similar pages in the wiki based on either structure (based primarily on headings I think) or in the text used. Maria demonstrated comparing pages as well as faceted browsing of the structure of the pages in line with the extension. The potential here for helping people manage their existing materials is huge. Perhaps more exciting, particularly in the context of yesterday’s post about writing up stories, is the potential to assist people with preparing summaries of their work. It is possible to imagine the extension first recognising that you are writing a summary based on the structure, and then recognising that in previous summaries you’ve pulled text from a different specific class of pages, all the while helping you to maintain a consitent and clear structure.

The last demo I want to mention was from Samar Sabie of a second MediaWiki extension called VizGraph. Anyone who has used a MediaWiki or a similar framework for recording research knows the problem. Generating tables, let alone graphs, sucks big time. You have your data in a CSV or Excel file and you need to transcribe, by hand, into a fairly incomprehensible, but more importantly badly fault intolerant, syntax to generate any sort of sensible visualisation. What you want, and what VizGraph supplies is a simple Wizard that allows you to upload your data file (CSV or Excel naturally) steps you through a few simple questions that are familiar from the Excel chart wizards and then drops that back into the page as a structured text data that is then rendered via the GoogleChart API. Once it is there you can, if you wish, edit the structured markup to tweak the graph.

Again, this was a great example of just solving the problem for the average user, fitting within their existing workflow and making it happen. But that wasn’t the best bit. The best bit was almost a throwaway comment as we were taken through the Wizard; “and check this box if you want to enable people to download the data directly from a link on the chart…”. I was sitting next to Jon Udell and we both spontaneously did a big thumbs up and just grinned at each other. It was a wonderful example of “just getting it”. Understanding the flow, the need to enable data to be passed from place to place, while at the same time make the user experience comfortable and seamless.

I am sceptical about the rise of a mass “Google Generation” of tech savvy and sophisticated users of web based tools and computation. But what Wednesday’s demos showed to me in no uncertain terms was that when you provide a smart group of people, who grew up with the assumption that the web functions properly, with tools and expertise to effectively manipulate and compute on the web then amazing things happen.  That these students make assumptions of how things should work, and most importantly that they should, that editing and sharing should be enabled by default, and that user experience needs to be good as a basic assumptionwas brought home by a conversation we had later in the day at the Science 2.0 symposium.

The question was  “what does Science 2.0 mean anyway?”. A question that is usually answered by reference to Web 2.0 and collaborative web based tools. Steve Easterbrooks’s opening gambit in response was “well you know what Web 2.0 is don’t you?” an this was met with slightly glazed stares. We realized that, at least to a certain extent, for these students there is no Web 2.0. It’s just the way that the web, and indeed the rest of the world, works. Give people with these assumptions the tools to make things and amazing stuff happens. Arguably, as Jon Udell suggested later in the day, we are failing a generation by not building this into a general education. On the other hand I think it pretty clear that these students at least are going to have a big advantage in making their way in the world of the future.

Apparently screencasts for the demoed tools will be available over the next few weeks and I will try and post links here as they come up. Many thanks to Greg Wilson for inviting me to Toronto and giving me the opportunity to be at this session and the others this week.

Sci - Bar - Foo etc. Part III - Google Wave Session at SciFoo

Google Wave has got an awful lot of people quite excited. And others are more sceptical. A lot of SciFoo attendees were therefore very excited to be able to get an account on the developer sandbox as part of the weekend. At the opening plenary Stephanie Hannon gave a demo of Wave and, although there were numerous things that didn’t work live, that was enough to get more people interested. On the Saturday morning I organized a session to discuss what we might do and also to provide an opportunity for people to talk about technical issues. Two members of the wave team came along and kindly offered their expertise, receiving a somewhat intense grilling as thanks for their efforts.

I think it is now reasonably clear that there are two short to medium term applications for Wave in the research process. The first is the collaborative authoring of documents and the conversations around those. The second is the use of wave as a recording and analysis platform. Both types of functionality were discussed with many ideas for both. Martin Fenner has also written up some initial impressions.

Naturally we recorded the session in Wave and even as I type, over a week later, there is a conversation going in real time about the details of taking things forward. There are many things to get used to, not leastwhen it is polite to delete other people’s comments and clean them up, but the potential (and the weaknesses and areas for development) are becoming clear.

I’ve pasted our functionality brainstorm at the bottom to give people an idea of what we talked about but the discussion was very wide ranging. Functionality divided into a few categories. Firstly Robots for bringing scientific objects, chemical structures, DNA sequences, biomolecular structures, videos, and images into the wave in a functional form with links back to a canonical URI for the object. In its simplest form this might just provide a link back to a database. So typing “chem:benzene” or “pdb:1ecr” would trigger a robot to insert a link back to the database entry. More complex robots could insert an image of the chemical (or protein structure) or perhaps rdf or microformats that provide a more detailed description of the molecule.

Taking this one step further we also explored the idea of pulling data or status information from larboratory instruments to create a “laboratory dashboard” and perhaps controlling them. This discussion was helpful in getting a feel for what Wave can and can’t do as well as how different functionalities are best implemented. A robot can be built to populate a wave with information or data from laboratory instruments and such a robot could also pass information from the wave back to the instrument in principle. However both of these will still require some form of client running on the instrument side that is capable of talking to the robot web service. So the actual problem of interfacing with the instrument will remain. We can hope that instrument manufacturers might think of writing out nice simple XML log files at some point but in the meantime this is likely to involve hacking things together. If you can manage this then a Gadget will provide a nice way of providing a visual dashboard type interface to keep you updated as to what is happening.

Sharing data analysis is something of significant interest to me and the fact that there is already a robot (called Monty) that will intepret Python is a very interesting starting point for exploring this. There is some basic graphing functionality (Graphy naturally). For me this is where some of the most exciting potential lies; not just sharing printouts or the results of data analysis procedures but the details of the data and a live representation of the process that lead to the results. Expect much more from me on this in the future as we start to take it forward.

The final area of discussion, and the one we probably spent the most time on, was looking at Wave in the authoring and publishing process. Formatting of papers, sharing of live diagrams and charts, automated reference searching and formatting, as well as submission processes, both to journals and to other repositories, and even the running of peer review process were all discussed. This is the area where the most obvious and rapid gains can be made. In a very real sense Wave was designed to remove the classic problem of sending around manuscript versions with multiple figure and data files by email so you would expect it to solve a number of the obvious problems. The interesting thing in my view will be to try it out in anger.

Which was where we finished the session. I proposed the idea of writing a paper, in Wave, about the development and application of tools needed to author papers in Wave. As well as the technical side, such a paper would discuss the user experience, and any of the social issues that arise out of such a live collaborative authoring experience. If it were possible to run an actual peer review process in Wave that would also be very cool however this might not be feasible given existing journal systems. If not we will run a “mock” peer review process and look at how that works. If you are interested in being involved, drop a note in the comments, or join the Google Group that has been set up for discussions (or if you have a developer sandbox account and want access to the Wave drop me a line).

There will be lots of details to work through but the overall feel of the session for me was very exciting and very positive. There will clearly be technical and logistical barriers to be overcome. Not least that a a significant quantity of legacy toolingmay not be a good fit for Wave. Some architectural thinking on how to most effectively re-use existing code may be required. But overall the problem seems to be where to start on the large set of interesting possibilities. And that seems a good place to be with any new technology.

Read more »

What would you say to Elsevier?

In a week or so’s time I have been invited to speak as part of a forward planning exercise at Elsevier. To some this may seem like an opportunity to go in for an all guns blazing OA rant or perhaps to plant some incendiary device but I see it more as opportunity to nudge, perhaps cajole, a big player in the area of scholarly publishing in the right direction. After all if we are right about the efficiency gains for authors and readers that will be created by Open Access publication and we are right about the way that web based systems utterly changes the rules of scholarly communication then even an organization of the size of Elsevier has to adapt or wither away. Persuading them to move in right direction because it is in their own interests would be an effective way of speeding up the process of positive change.

My plan is to focus less on the arguments for making more research output Open Access and more on what happens as a greater proportion of those outputs become freely available, something that I see as increasingly inevitable. Where that proportion may finally be is anyone’s guess but it is going to be a much bigger proportion than it is now. What will authors and funders want and need from their publication infrastructure and what are the business opportunities that arise from those. For me these fall into four main themes:

  • Tracking via aggregation. Funders and institutions want more and more to track the outputs of their research investment. Providing tools and functionality that will enable them to automatically aggregate and slice and dice these outputs is a big business opportunity. The data themselves will be free but providing it in the form that people need it rapidly and effectively will add value that they will be prepared to pay for.
  • Speed to publish as a market differentiator. Authors will want their content out and available and being acted on fast. Speed to publication is potentially the biggest remaining area for competition between journals. This is important because there will almost certainly be less journals with greater “quality” or “brand” differentiation. There is a plausible future in which there are only two journals, Nature and PLoS ONE.
  • Data publication, serving, and archival. There may be less journals but there will be much greater diversity of materials being published through a larger number of mechanisms. There are massive opportunities in providing high quality infrastructure and services to funders and institutions to aggregate, publish, and archive the full set of research outputs. I intend to draw heavily on Dorothea Salo’s wonderful slideset on data publication for this part.
  • Social search. Literature searching is the main area where there are plausible efficiency gains to be made in the current scholarly publications cycle. According to the Research Information Network’s model of costs search accounts for a very significant proportion of the non-research costs of  publishing. Building the personal networks (Bill Hooker’s, Distributed Wetware Online Information Filter [down in the comments] or DWOIF) that make this feasible may well be the new research skill of the 21st century. Tools that make this work effectively are going to be very popular. What will they look like?

But what have I missed? What (constructive!) ideas and thoughts would you want to place in the minds of the people thinking about where to take one of the world’s largest scholarly publication companies and its online information and collaboration infrastructure.?

Full disclosure: Part of the reason for writing this post is to disclose publicly that I am doing this gig. Elsevier are covering my travel and accommodation costs but are not paying any fee.

Now that’s what I call social networking…

So there’s been a lot of antagonistic and cynical commentary about Web2.0 tools particularly focused on Twitter, but also encompassing Friendfeed and the whole range of tools that are of interest to me. Some of this is ill informed and some of it more thoughtful but the overall tenor of the comments is that “this is all about chattering up the back, not paying attention, and making a disruption” or at the very least that it is all trivial nonsense.

The counter argument for those of us who believe in these tools is that they offer a way of connecting with people, a means for the rapid and efficient organization of information, but above all, a way of connecting problems to the resources that can let us make things happen. The trouble has been that the best examples that we could point to were flashmobs, small scale conversations and collaborations, strangers meeting in a bar, the odd new connection made. But overall these are small things; indeed in most cases trivial things. Nothing that registers on the scale of “stuff that matters” to the powers that be.

That was two weeks ago. In the last couple of weeks I have seen a number of remarkable things happen and I wanted to talk about one of them here because I think it is instructive.

On Friday last week there was a meeting held in London to present and discuss the draft Digital Britain Report. This report, commissioned by the government is intended to map out the needs of the UK in terms of digital infrastructure, both physical, legal, and perhaps even social. The current tenor of the draft report is what you might expect, heavy on the need of putting broadband everywhere, to get content to people, and heavy on the need to protect big media from the rising tide of piracy. Actually it’s not all that bad but many of the digerati felt that it is missing important points about what happens when consumers are also content producers and what that means for rights management as the asymmetry of production and consumption is broken but the asymmetry of power is not. Anyway, that’s not what’s important here.

What is important is that the sessions were webcast, a number of people were twittering from the physical audience, and a much larger number were watching and twittering from outside, aggregated around a hashtag #digitalbritain. There was reportage going on in real time from within the room and a wideranging conversation going on beyond the walls of the room. In this day and age nothing particularly remarkable there. It is still relatively unusual for the online audience to be bigger than the physical one for these kind of events but certainly not unheard of.

Nor was it remarkable when Kathryn Corrick tweeted the suggestion that an unconference should be organized to respond to the forum (actually it was Bill Thomson who was first with the suggestion but I didn’t catch that one). People say “why don’t we do something?” all the time; usually in a bar. No, what was remarkable was what followed this as a group of relative strangers aggregated around an idea, developed and refined it, and then made it happen. One week later, on Friday evening, a website went live, with two scheduled events [1, 2], and at least two more to follow. There is an agreement with the people handling the Digital Britain report on the form an aggregated response should take. And there is the beginning of a plan as to how to aggregate the results of several meetings into that form. They want the response by 13 May.

Lets rewind that. In a matter of hours a group of relative strangers, who met each other through something as intangible as a shared word, agreed on, and started to implement a nationwide plan to gather the views of maybe a few hundred, perhaps a few thousand people, with the aim, and the expectation of influencing government policy. Within a week there was a scalable framework for organizing the process of gathering the response (anyone can organize one of the meetings) and a process for pulling together a final report.

What made this possible? Essentially the range of low barrier communication, information, and aggregation tools that Web2.0 brings us.

  1. Twitter: without twitter the conversation could never have happened. Friendfeed never got a look in because that wasn’t where this specific community was. But much more than just twitter, the critical aspect was;
  2. The hashtag #digitalbritain: the hashtag became the central point of a conversation between people who didn’t know each other, weren’t following each other, and without that link would never have got in contact. As the conversation moved to discussing the idea of an unconference the hashtags morphed first to #digitalbritain #unconference (an intersection of ideas) and then to #dbuc09. In a sense it became serious when the hashtag was coined. The barrier to a group of sufficiently motivated people to identify each other was low.
  3. Online calendars: it was possible for me to identify specific dates when we might hold a meeting at my workplace in minutes because we have all of our rooms on an online calendar system. Had it been more complex I might not have bothered. As it was it was easy to identify possible dates. The barrier to organization was low.
  4. Free and easy online services: A Yahoo Group was set up very early and used as a mailing list. Wordpress.com provides a simple way of throwing up a website and giving specified people access to put up material. Eventbrite provies an easy method to manage numbers for the specific events. Sure someone could have set these up for us on a private site but the almost zero barrier of these services makes it easy for anyone to do this.
  5. Energy and community: these services  lead to low barriers, not zero barrier. There still has to be the motivation to carry it through. In this case Kathryn provided the majority of the energy and others chipped in along the way. Higher barriers could have put a stop to the whole thing, or perhaps stopped it going national, but there needs to be some motivation to get over the barriers that do remain. What was key was that a small group of people had sufficient energy to carry these through.
  6. Flexible working hours: none of this would be possible if the people who would be interested in attending such meetings couldn’t come on short notice. The ability of people to either arrange their own working schedule or to have the flexibility to take time out of work is crucial, otherwise no-one could come. Henry Gee had a marvelous riff on the economic benefits of flexible working just before the budget. The feasibility of our meetings is an example of the potential efficiency benefits that such flexibility could bring.

The common theme here is online services making it easy to aggregate the right people and the right information quickly, to re-publish that information in a useful form. We will use similar services, blogs, wikis, online documents to gather back the outputs from these meetings to push back into the policy making process. Will it make a big difference? Maybe not, but even in showing that this kind of response, this kind of community consultation can be done effectively in a matter of days and weeks, I think we’re showing what a Digital Britain ought to be about.

What does this mean for science or research? I will come back to more research related examples over the next few weeks but one key point was that this happened because there was a pretty large audience watching the webcast and communicating around it. As I and others have recently argued in research the community sizes probably aren’t big enough in most cases for these sort of network effects to kick in effectively. Building up community quantity and quality will be the main challenge of the next 6 - 12 months but where the community exists and where the time is available we are starting to see rapid, agile, and bursty efforts in projects and particularly in preparing documents.

There is clearly a big challenge in taking this into the lab but there is a good reason why when I talk to my senior management about the resources I need that the keywords are “capacity” and “responsiveness”. Bursty work requires the capacity to be in place to resource it. In a lab this is difficult, but it is not impossible. It will probably require a reconfiguring of resource distribution to realize its potential. But if that potential can be demonstrated then the resources will almost certainly follow.

Capturing the record of research process - Part II

So in the last post I got all abstract about what the record of process might require and what it might look like. In this post I want to describe a concrete implementation that could be built with existing tools. What I want to think about is the architecture that is required to capture all of this information and what it might look like.

The example I am going to use is very simple. We will take some data and do a curve fit to it. We start with a data file, which we assume we can reference with a URI, and load it up into our package. That’s all, keep it simple. What I hope to start working on in Part III is to build a toy package that would do that and maybe fit some data to a model. I am going to assume that we are using some sort of package that utilizes a command line because that is the most natural way of thinking about generating a log file, but there is no reason why a similar framework can’t be applied to something using a GUI.

Our first job is to get our data. This data will naturally be available via a URI, properly annotated and described. In loading the data we will declare it to be of a specific type, in this case something that can be represented as two columns of numbers. So we have created an internal object that contains our data. Assuming we are running some sort of automatic logging program on our command line our logfile will now look something like:
> start data_analysis
...loading data_analysis
...Version 0.1
...Package at: http://mycodeversioningsystem.org/myuserid/data_analysis/0.1
...Date is: 01/01/01
...Local environment is: Mac OS10.5
...Machine: localhost
...Directory: /usr/myuserid/Documents/Data/some_experiment
> data = load_data(URI)
...connecting to URI
...found data
...created two column data object "data"
...pushed "data" to http://myrepository.org/myuserid/new_object_id
..."data" aliased to http://myrepository.org/myuserid/new_object_id

That last couple are important because we want all of our intermediates to be accessible via a URI on the open web. The load_data routine will include the pushing of the newly created object in some useable form to an external repository. Existing services that could provide this functionality include a blog or wiki with an API, a code repository like GitHub, GoogleCode, or SourceForge, an institutional or disciplinary repository, or a service like MyExperiment.org. The key thing is that the repository must then expose the data set in a form which can be readily extracted by the data analysis tool being used. The tool then uses that publicly exposed form (or an internal representation of the same object for offline work).

At the same time a script file is being created that if run within the correct version of data_analysis should generate the same results.
# Script Record
# Package: http://http://mycodeversioningsystem.org/myuserid/data_analysis/0.1
# User: myuserid
# Date: 01/01/01
# System: Mac OS 10.5
data = load_data(URI)

The script might well include some system scripting that would attempt to check whether the correct environment (e.g. Python) for the tool is available and to download and start up the tool itself if the script is directly executed from a GUI or command line environment. The script does not care what the new URI created for the data object was because when it is re-run it will create a new one. The Script should run independently of any previous execution of the same workflow.

Finally there is the graph. What we have done so far is to take one data object and convert it to a new object which is a version of the original. That is then placed online to generate an accessible URI. We want our graph to assert that http://myrepository.org/myuserid/new_object_id is a version of URI (excuse my probably malformed RDF).

<data_analysis:data_object
    rdf:about= "http://myrepository.org/myuserid/new_object_id">
  <data_analysis:data_type>two_column_data</data_analysis:data_type>
  <data_analysis:generated>
    <data_analysis:generated_from rdf:resource="URI"/>
    <data_analysis:generated_by_command>load_data</data_analysis:generated_via>
    <data_analysis:generated_by_version rdf:resource="http://mycodeversioningsystem.org/myuserid/data_analysis/0.1>
    <data_analysis:generated_in_system>Max OS 10.5</data_analysis:generated_in_system>
    <data_analysis:generated_by rdf:resource="http://myuserid.name"/>
    <data_analysis:generated_on_date dc:date="01/01/01"/>
  </data_analysis:generated>
</data_analysis:data_object>

Now this is obviously a toy example. It is relatively trivial to set up the data analysis package so as to write out these three different types of descriptive files. Each time a step is taken, that step is then described and appended to each of the three descriptions. Things will get more complicated if a process requires multiple inputs or generates multiple outputs but this is only really a question of setting up a vocabulary that makes reasonable sense. In principle multiple steps can be collapsed by combining a script file and the rdf as follows:

<data_analysis:generated_by_command
    rdf:resource="http://myrepository/myuserid/location_of_script"/>

I don’t know anything much about theoretical computer science but it seems to me that any data analysis package that works through a stepwise process running previously defined commands could be described in this way. And that given that this is how computer programs run that this suggests that any data analysis process can be logged this way. It obviously has to be implemented to write out the files but in many cases this may not even be too hard. Building it in at the beginning is obviously better. The hard part is building vocabularies that make sense locally and are specific enough but are appropriately wired into wider and more general vocabularies. It is obvious that the reference to data_analysis:data_type = “two_column_data” above should probably point to some external vocabulary that describes generic data formats and their representations (in this case probably a Python pickled two column array). It is less obvious where that should be, or whether something appropriate already exists.

This then provides a clear set of descriptive files that can be used to characterise a data analysis process. The log file provides a record of exactly what happened, that is reasonably human readable, and can be hacked using regular expressions if desired. There is no reason in principle why this couldn’t be in the form of an XML file with a style sheet appropriate for human readability. The script file provides the concept of what happened as well as the instructions for repeating the process. It could usefully be compared to a plan which would look very similar but might have informative differences. The graph is a record of the relationships between the objects that were generated. It is machine readable and can additionally be used to automate the reproduction of the process, but it is a record of what happened.

The graph is immensely powerful because it can be ultimately used to parse across multiple sets of data generated by different versions of the package and even completely different packages used by different people (provided the vocabularies have some common reference). It enables the comparison of analyses carried out in parallel by different people.

But what is most powerful about the idea of an rdf based graph file of the process is that it can be automated and completely hidden from the user. The file may be presented to the user in some pleasant and readable form but they need never know they are generating rdf. The process of wiring the dataweb up, and the following process of wiring up the web of things in experimental science, will rest on having the connections captured from and not created by the user. This approach seems to provide a way towards making that happen.

What does this tell us about what a data analysis tool should look like? Well ideally it will be open source, but at a minimum there must be a set of versions that can be referenced. Ideally these versions would be available on an appropriate code repository configured to enable an automatic download. They must provide, at a minimum a log file, and preferably both script and graph versions of this log (in principle the script can be derived from either of the other two which can be derived from each other, the log and graph can’t be derived from the script). The local vocabulary must be available online and should preferably be well wired into the wider data web. The majority of this should be trivial to implement for most command line driven tools and not terribly difficult for GUI driven tools. The more complicated aspects lie in the pushing out of intermediate objects and the finalized logs onto appropriate online repositories.

A range of currently available services could play these roles, from code repositories such as Sourceforge and Github, through to the internet archive, and data and process repositories such as MyExperiment.org and Talis Connected Commons, or to locally provided repositories. Many of these have sensible APIs and/or REST interfaces that should make this relatively easy. For new analysis tools this shouldn’t be terribly difficult to implement. Implementing it in existing tools could be more challenging but not impossible. It’s a question of will rather than severe technical barriers as far as I can see. I am going to start trying to implement some of this in a toy data fitting package in Python, which will be hosted at Github, as soon as I get some specific feedback on just how bad that RDF is…

Best practice for data availability – the debate starts…well over there really

The issue of licensing arrangements and best practice for making data available has been brewing for some time but has just recently come to a head. John Wilbanks and Science Commons have a reasonably well established line that they have been developing for some time. Michael Nielsen has a recent blog post and Rufus Pollock, of the Open Knowledge Foundation, has also just synthesised his thoughts in response into a blog essay. I highly recommend reading John’s article on licensing at Nature Precedings, Michael’s blog post, and Rufus’ essay before proceeding. Another important document is the discussion of the license that Victoria Stodden is working to develop. Actually if you’ve read them go and read them again anyway – it will refresh the argument.

To crudely summarize, Rufus makes a cogent argument for the use of explicit licenses applied to collections of data, and feels that share-alike provisions in licenses or otherwise do not cause major problems and that the benefit that arises from enforcing re-use outweighs the problem. John’s position is that it far better for standards to be applied through social pressure (”community norms”) rather than licensing arrangements. He also believes that share-alike provisions are bad because they break interoperability between different types of objects and domains. One point that I think is very important and (I think) is a point of agreement is that some form of license or at dedication to the public domain will be crucial to developing best practice. Even if the final outcome of debate is that everything will go in the public domain it should be part of best practice to make that explicit.

Broadly speaking I belong to John’s camp but I don’t want to argue that case with this post. What is important in my view is that the debate takes place and that we are clear about what the aims of that debate are. What is it we are trying to achieve in the process of coming to (hopefully) some consensus of what best practice should look like?
It is important to remember that anyone can assert a license (or lack thereof) on any object that they (assert they) own or have rights over. We will never be able to impose a specific protocol on all researchers, all funders. Therefore what we are looking for is not the perfect arrangement but a balance between what is desired, what can be practically achieved, and what is politically feasible. We do need a coherent consensus view that can be presented to research communities and research funders. That is why the debate is important. We also need something that works, and is extensible into the future, where it will stand up to the development of new types of research, new types of data, new ways of making that data available, and perhaps new types of researchers altogether.

I think we agree that the minimal aim is to enable, encourage, and protect into the future the ability to re-use and re-purpose the publicly published products of publicly funded research. Arguments about personal or commercial work are much harder and much more subtle. Restricting the argument to publicly funded researchers makes it possible to open a discussion with a defined number of funders who have a public service and public engagement agenda. It also makes the moral arguments much clearer.

In focussing on research that is being made public we short circuit the contentious issue of timing. The right, or the responsibility, to commercially exploit research outputs and the limitations this can place on data availability is a complex and difficult area and one in which agreement is unlikely any time soon. I would also avoid the word “Open”. This is becoming a badly overloaded term with both political and emotional overtones, positive and negative. Focussing on what should happen after the decision has been to go public reduces the argument to “what is best practice for making research outputs available”. The question of when to make them available can then be kept separate. The key question for the current debate is not when but how.

So what I believe the debate should be about is the establishment, if possible, of a consensus  protocol or standard or license for enabling and ensuring the availability of the research outputs associated with publicly published, publicly funded research.  Along side this is the question of establishing mechanisms, for researchers to implement and be supported to observe these standards, as well as for “enforcement”. These might be trademarks, community standards, or legal or contractual approaches as well as systems and software to make all of this work, including trackbacks, citation aggregators, and effective data repositories. In addition we need to consider the public relations issue of selling such standards to disparate research funders and research communities.

Perhaps a good starting point would be to pinpoint the issues where there is general agreement and map around those. If we agree some central principles then we can take an empirical approach to the mechanisms. We’re scientists after all aren’t we?

Third party data repositories - can we/should we trust them?

This is a case of a comment that got so long (and so late) that it probably merited it’s own post. David Crotty and Paul (Ling-Fung Tang) note some important caveats in comments on my last post about the idea of the “web native” lab notebook. I probably went a bit strong in that post with the idea of pushing content onto outside specialist services in my effort to try to explain the logic of the lab notebook as a feed. David notes an important point about any third part service (do read the whole comment at the post):

Wouldn’t such an approach either:
1) require a lab to make a heavy investment in online infrastructure and support personnel, or
2) rely very heavily on outside service providers for access and retention of one’s own data? […]

Any system that is going to see mass acceptance is going to have to give the user a great deal of control, and also provide complete and redundant levels of back-up of all content. If you’ve got data scattered all over a variety of services, and one goes down or out of business, does that mean having to revise all of those other services when/if the files are recovered?

This is a very wide problem that I’ve also seen in the context of the UK web community that supports higher education (see for example Brian Kelly’s risk assessment for use of third party web services). Is it smart, or even safe, to use third party services? The general question divides into two sections: is the service more or less reliable than you own hard drive or locally provided server capacity (technical reliability, or uptime); and what is the long term reliability of the service remaining viable (business/social model reliability). Flickr probably has higher availability than your local institutional IT services but there is no guarantee that it will still be there tomorrow. This is why data portability is very important. If you can’t get your data out, don’t put it in there in the first place.

In the context of my previous post these data services could be local, they could be provided by the local institution, or by a local funder, or they could even be a hard disk in the lab. People are free to make those choices and to find the best balance of reliability, cost, and maintenance that suits them. My suspicion is that after a degree of consolidation we will start to see institutions offering local data repositories as well as specialised services on the cloud that can provide more specialised and exciting functionality. Ideally these could all talk to each other so that multiple copies are held in these various services.

David says:

I would worry about putting something as valuable as my own data into the “cloud” […]

I’d rather rely on an internally controlled system and not have to worry about the business model of Flickr or whether Google was going to pull the plug on a tool I regularly use. Perhaps the level to think on is that of a university, or company–could you set up a system for all labs within an institution that’s controlled (and heavily backed up) by that institution? Preferably something standardized to allow interaction between institutions.

Then again, given the experiences I’ve had with university IT departments, this might not be such a good approach after all.

Which I think encapsulates a lot of the debate. I actually have greater faith in Flickr keeping my pictures safe than my own hard disk. And more faith in both than insitutional repository systems that don’t currently provide good data functionality and that I don’t understand. But I wouldn’t trust either in isolation. The best situation is to have everything everywhere, using interchange standards to keep copies in different places; specialised services out on the cloud to provide functionality (not every institution will want to provide a visualisation service for XAFS data), IRs providing backup archival and server space for anything that doesn’t fit elsewhere, and ultimately still probably local hard disks for a lot of the short to medium term storage. My view is that the institution has the responsibility of aggregating, making available, and archiving the work if its staff, but I personally see this role as more harvester than service provider.

All of which will turn on the question of business models. If the data stores a local, what is the business model for archival? If they are institutional how much faith do you have that the institution won’t close them down. And if they are commercial or non-profit third parties, or even directly government funded service, does the economics make sense in the long term. We need a shift in science funding if we want to archive and manage data in the longer term. And with any market some services will rise and some will die. The money has to come from somewhere and ultimately that will always be the research funders. Until there is a stronger call from them for data preservation and the resources to back it up I don’t think we will see much interesting development. Some funders are pushing fairly hard in this direction so it will be interesting to see what develops. A lot will turn on who has the responsibility for ensuring data availability and sharing. The researcher? The institution? The funder?

In the end you get what you pay for. Always worth remembering that sometimes even things that are free at point of use aren’t worth the price you pay for them.

Connecting the dots - the well posed question and code as a liability

Just a brief thought prompted by two, partly related, things streaming past my nose. Firstly Michael Nielsen discussed the views of Aristotle and Sunstein on collective intelligence. The thing that caught my attention was the idea that deliberation can make can make group functioning worse, leading to a collective decision that is muddled rather than actually identifying the best answer presented by members of the community. The exception to this is well posed questions, where deliberation can help. In science we are familiar with the idea that getting the question right (correct design of experiment, well organised theory) can be more important than the answer.

The second item was a blog post entitled “Data is good, code is a liability” from Greg Linden that was shared by Deepak Singh. Greg discussed a talk given by Peter Norvig which focusses on the idea that it is better to get a good sized dataset and use very sparing code to get at an answer rather than attempt to get at the answer de novo via complex code. Quoting from the post:

In one of several examples, Peter put up a slide showing an excerpt for a rule-based spelling corrector. The snippet of code, that was just part of a much larger program, contained a nearly impossible to understand let alone verify set of case and if statements that represented rules for spelling correction in English. He then put up a slide containing a few line Python program for a statistical spelling correction program that, given a large data file of documents, learns the likelihood of seeing words and corrects misspellings to their most likely alternative. This version, he said, not only has the benefit of being simple, but also easily can be used in different languages.

What struck me was the connection between being able to write a short, readable snippet of code, and the “well posed question”. The dataset provides the collective intelligence. So is it possible to propose the following?

“A well posed question is one which, given an appropriate dataset, can be answered by easily prepared and comprehensible code”

This could also possibly be turned on its head as “a good programming environment is one in which well posed questions can be readily converted to programs”. But it also raises an important point about how the structure of datasets relates to the questions you want to ask. The challenge in recording data is to structure it in such a way that the widest possible set of questions can be asked of that data. Data models all pre-suppose the kind of questions that will be asked. And any sufficiently general data model will be inefficient for most specific types of query.

Rajarshi Guha and Pierre Lindenbaum have been busy preparing different datastores for the solubility data being generated as part of the Open Notebook Science Challenge announced by Jean-Claude Bradley (more on this later). Rajarshi’s form based input has an SQL backend while Pierre has been working to extract the information as RDF. The point is not that one approach is better than the other, but that we need both, and possibly many more formats - and ideally we need to interconvert between them on the fly. A well posed question can easily founder on an inappropriately structured dataset (this is actually just a rephrasing of the Saunders Principle). It will be by enabling easy conversion between different formats that we might approach a situation where the aphorism I have suggested could become true.