An openwetware blog on the challenges of open and connected science

e-notebook

Google Wave in Research - Part II - The Lab Record

In the previous post  I discussed a workflow using Wave to author and publish a paper. In this post I want to look at the possibility of using it as a laboratory record, or more specifically as a human interface to the laboratory record. There has been much work in recent years on research portals and Virtual Research Environments. While this work will remain useful in defining use patterns and interface design my feeling is that Wave will become the environment of choice, a framework for a virtual research environment that rewrites the rules, not so much of what is possible, but of what is easy.

Again I will work through a use case but I want to skip over a lot of what is by now I think becoming fairly obvious. Wave provides an excellent collaborative authoring environment. We can explicitly state and register licenses using a robot. The authoring environment has all the functionality of a wiki already built in so we can assume that and granular access control means that different elements of a record can be made accessible to different groups of people. We can easily generate feeds from a single wave and aggregate content in from other feeds. The modular nature of the Wave, made up of Wavelets, themselves made up of Blips, may well make it easier to generate comprehensible RSS feeds from a wiki-like environment. Something which has up until now proven challenging. I will also assume that, as seems likely, both spreadsheet and graphing capabilities are soon available as embedded objects within a Wave.

Let us imagine an experiment of the type that I do reasonably regularly, where we use a large facility instrument to study the structure of a protein in solution. We set up the experiment by including the instrument as a participant in the wave. This participant is a Robot which fronts a web service that can speak to the data repository for the instrument. It drops into the Wave a formatted table which provides options and spaces for inputs based on a previously defined structured description of the experiment. In this case it calls for a role for this particular run(is it a background or an experimental sample?) and asks where the description of the sample is.

The purification of the protein has already been described in another wave. As part of this process a wavelet was created that represents the specific sample we are going to use. This sample can be directly referenced via a URL that points at the wavelet itself making the sample a full member of the semantic web of objects. While the free text of the purification was being typed in another Robot, this one representing a web service interface to appropriate ontologies, automatically suggested using specific terms adding links back to the ontology where suggestions were accepted, and creating the wavelets that describe specific samples.

The wavelet that defines the sample is dragged and dropped into the table for the experiment. This copying process is captured by the internal versioning system and creates in effect an embedded link back to the purification wave, linking the sample to the process that it is being used in. It is rather too much at this stage to expect the instrument control to be driven from the Wave itself but the Robot will sit and wait for the appropriate dataset to be generated and check with the user it has got the right one.

Once everyone is happy the Robot will populate the table with additional metadata captured as part of the instrumental process, create a new wavelet (creating a new addressable object) and drop in the data in the default format. The robot naturally also writes a description of the relationships between all the objects in an appropriate machine readable form (RDFa, XML, or all of the above) in a part of the Wave that the user doesn’t necessarily need to see. It may also populate any other databases or repositories as appropriate. Because the Robot knows who the user is it can also automatically link the experimental record back to the proposal for the experiment. Valuable information for the facility but not of sufficient interest to the user for them to be bothered keeping a good record of it.

The raw data is not yet in a useful form though, we need to process it, remove backgrounds, that kind of thing. To do this we add the Reduction Robot as a participant. This Robot looks within the wave for a wavelet containing raw data, asks the user for any necessary information (where is the background data to be subtracted) and then runs a web service that does the subtraction. It then writes out two new wavelets, one describing what it has done (with all appropriate links to the appropriate controlled vocab obviously), and a second with the processed data in it.

I need to do some more analysis on this data, perhaps fit a model to start with, so again I add another Robot that looks for a wavelet with the correct data type, does the line fit, once again writes out a wavelet that describes what it has done, and a wavelet with the result in it. I might do this several times, using a range of different analysis approaches, perhaps doing some structural modelling and deriving some parameter from the structure which I can compare to my analytical model fit. Creating a wavelet with a spreadsheet embedded I drag and drop the parameter from the model fit and from the structure and format the cells so that it shows green if they are within 5% of each other.

Ok, so far so cool. Lots of drag and drop and using of groovy web services but nothing that couldn’t be done with a bit of work with a Workflow engine like Taverna and a properly set up spreadsheet. I now make a copy of the wave (properly versioned, its history is clear as a branch off the original Wave) and I delete the sample from the top of the table. The Robots re-process and realize there is not sufficient data to do any processing so all the data wavelets and any graphs and visualizations, including my colour-coded spreadsheet  go blank. What have I done here? What I have just created is a versioned, provenanced, and shareable workflow. I can pass the Wave to a new student or collaborator simply by adding them as a participant. I can then work with them, watching as they add data, point out any mistakes they might make and discuss the results with them, even if they are on the opposite side of the globe. Most importantly I can be reasonably confident that it will work for them, they have no need to download software or configure anything. All that really remains to make this truly powerful is to wrap this workflow into a new Robot so that we can pass multiple datasets to it for processing.

When we’ve finished the experiment we can aggregate the data by dragging and dropping the final results into a new wave to create a summary we can share with a different group of people. We can tweak the figure that shows the data until we are happy and then drop it into the paper I talked about in the previous post. I’ve spent a lot of time over the past 18 months thinking and talking about how we capture what is going and at the same time create granular web-native objects  and then link them together to describe relationships between them. Wave does all of that natively and it can do it just by capturing what the user does. The real power will lie in the web services behind the robots but the beauty of these is that the robots will make using those existing web services much easier for the average user. The robots will observe and annotate what the user is doing, helping them to format and link up their data and samples.

Wave brings three key things; proper collaborative documents which will encourage referring rather than cutting and pasting; proper version control for documents; and document automation through easy access to webservices. Commenting, version control and provenance, and making a cut and paste operation actually a fully functional and intelligent embed are key to building a framework for a web-native lab notebook. Wave delivers on these. The real power comes with the functionality added by Robots and Gadgets that can be relatively easily configured to do analysis. The ideas above are just what I have though of in the last week or so. Imagination really is the limit I suspect.

OMG! This changes EVERYTHING! - or - Yet Another Wave of Adulation

Yes, I’m afraid it’s yet another over the top response to yesterday’s big announcement of Google Wave, the latest paradigm shifting gob-smackingly brilliant piece of technology (or PR depending on your viewpoint) out of Google. My interest, however is pretty specific, how can we leverage it to help us capture, communicate, and publish research? And my opinion is that this is absolutely game changing - it makes a whole series of problems simply go away, and potentially provides a route to solving many of the problems that I was struggling to see how to manage.

Firstly, lets look at the grab bag of generic issues that I’ve been thinking about. Most recently I wrote about how I thought “real time” wasn’t the big deal but giving the user control back over the timeframe in which streams came into them. I had some vague ideas about how this might look but Wave has working code. When the people who you are in conversation with are online and looking at the same wave they will see modifications in real time. If they are not in the same document they will see the comments or changes later, but can also “re-play” changes. But a lot of thought has clearly gone into thinking about the default views based on when and how a person first comes into contact with a document.

Another issue that has frustrated me is the divide between wikis and blogs. Wikis have generally better editing functionality, but blogs have workable RSS feeds, Wikis have more plugins, blogs map better onto the diary style of a lab notebook. None of these were ever fundamental philosophical differences but just historical differences of implementations and developer priorities. Wave makes most of these differences irrelevant by creating a collaborative document framework that easily incorporates much of the best of all of these tools within a high quality rich text and media authoring platform. Bringing in content looks relatively easy and pushing content out in different forms also seems to be pretty straightforward. Streams, feeds, and other outputs, if not native, look to be easily generated either directly or by passing material to other services. The Waves themselves are XML which should enable straightforward parsing and tweaking with existing tools as well.

One thing I haven’t written much about but have been thinking about is the process of converting lab records into reports and onto papers. While there wasn’t much on display about complex documents a lot of just nice functionality, drag and drop links, options for incorporating and embedding content was at least touched on. Looking a little closer into the documentation there seems to be quite a strong provenance model, built on a code repository style framework for handling document versioning and forking. All good steps in the right direction and with the open APIs and multitouch as standard on the horizon there will no doubt be excellent visual organization and authoring tools along very soon now. For those worried about security and control, a 30 second mention in the keynote basically made it clear that they have it sorted. Private messages (documents? mecuments?) need never leave your local server.

Finally the big issue for me has for some time been bridging the gap between unstructured capture of streams of events and making it easy to convert those to structured descriptions of the intepretation of experiments.  The audience was clearly wowed by the demonstration of inline real time contextual spell checking and translation. My first thought was - I want to see that real-time engine attached to an ontology browser or DbPedia and automatically generating links back to the URIs for concepts and objects. What really struck me most was the use of Waves with a few additional tools to provide authoring tools that help us to build the semantic web, the web of data, and the web of things.

For me, the central challenges for a laboratory recording system are capturing objects, whether digital or physical, as they are created, and then serve those back to the user, as they need them to describe the connections between them. As we connect up these objects we will create the semantic web. As we build structured knowledge against those records we will build a machine-parseable record of what happened that will help us to plan for the future. As I understand it each wave, and indeed each part of a wave, can be a URL endpoint; an object on the semantic web. If they aren’t already it will be easy to make them that. As much as anything it is the web native collaborative authoring tool that will make embedding and pass by reference the default approach rather than cut and past that will make the difference. Google don’t necessarily do semantic web but they do links and they do embedding, and they’ve provided a framework that should make it easy to add meaning to the links. Google just blew the door off the ELN market, and they probably didn’t even notice.

Those of us interested in web-based and electronic recording and communication of science have spent a lot of the last few years trying to describe how we need to glue the existing tools together, mailing lists, wikis, blogs, documents, databases, papers. The framework was never right so a lot of attention was focused on moving things backwards and forwards, how to connect one thing to another. That problem, as far as I can see has now ceased to exist. The challenge now is in building the right plugins and making sure the architecture is compatible with existing tools. But fundamentally the framework seems to be there. It seems like it’s time to build.

A more sober reflection will probably follow in a few days ;-)

What are the services we really want for recording science online?

There is an interesting meta-discussion going on in a variety of places at the moment which touch very strongly on my post and talk (slides, screencast) from last week about “web native” lab notebooks. Over at Depth First, Rich Apodaca has a post with the following little gem of a soundbite:

Could it be that both Open Access and Electronic Laboratory Notebooks are examples of telephone-like capabilities being used to make a better telegraph?

Web-Centric Science offers a set of features orthoginal to those of paper-centric science. Creating the new system in the image of the old one misses the point, and the opportunity, entirely.

Meanwhile a discussion on The Realm of Organic Synthesis blog was sparked off by a post about the need for a Wikipedia inspired chemistry resource (thanks again to Rich and Tony Williams for pointing the post and discussion out respectively). The initial idea here was something along the lines of;

I envision a hybrid of Doug Taber’s Organic Chemistry Portal, Wikipedia and a condensed version of SciFinder.  I’ll gladly contribute!  How do we get the ball rolling?

This in turn has led into a discussion of whether Chemspider and ChemMantis partly fill this role already. The key point being made here is the problem of actually finding and aggregating the relevant information. Tony Williams makes the point in the comments that Chemspider is not about being a central repository in the way that J proposes in the original TROS blog post but that if there are resources out there they can be aggregated into Chemspider. There are two problems here, capturing the knowledge into one “place” and then aggregating.

Finally there is an ongoing discussion in the margins of a post at RealClimate. The argument here is over the level of “working” that should be shown when doing analyses of data, something very close to my heart. In this case both the data and the MatLab routines used to process the data have been made available. What I believe is missing is the detailed record, or log, of how those routines were used to process the data. The argument rages over the value of providing this, the amount of work involved, and whether it could actually have a chilling effect on people doing independent validation of the results. In this case there is also the political issue of providing more material for antagonistic climate change skeptics to pore over and look for minor mistakes that they will then trumpet as holes. It will come as no suprise that I think the benefits of making the log available outweigh the problems. But that we need the tools to do it. This is beautifully summed up in one comment by Tobias Martin at number 64:

So there are really two main questions: if this [making the full record available - CN] is hard work for the scientist, for heaven’s sake, why is it hard work? (And the corrolary: how are you confident that your results are correct?)

And the answer - it is hard work because we still think in terms of a paper notebook paradigm, which isn’t well matched to the data analysis being done within MatLab. When people actually do data analysis using computational systems they very rarely keep a systematic log of the process. It is actually a rather difficult thing to do - even though in principle the system could (and in some cases does) keep that entire record for you.

My point is that if we completely re-imagine the shape and functionality of the laboratory record, in the way Rich and I, and others, have suggested; if the tools are built to capture what happens and then provide that to the outside world in a useful form (when the researcher chooses), then not only will this record exist but it will provide the detailed “clickstream” records that Richard Akerman refers to in answer to a Twitter proposal from Michael Barton:

Michael Barton: Website idea: Rank scientific articles by relevance to your research; get uptakes on them via citations and pubmed “related articles”

Richard Akerman: This is a problem of data, not of technology. Amazon has millions of people with a clear clickstream through a website. We’ve got people with PDFs on their desktops.

Exchange “data files” and “Matlab scripts” for PDFs and you have a statement of the same problem that they guys at RealClimate face. Yes it is there somewhere, but it is a pain in the backside to get it out and give it to someone.

If that “clickstream” and “file use stream” and “relationship” stream was automatically captured then we get closer to the thing that I think many of us a yearning for (and have been for some time). The Amazon recommendation tool for science.

The integrated lab record - or the web native lab notebook

At Science Online 09 and at the Smi Electronic Laboratory Notebook meeting in London later in January I talked about how laboratory notebooks might evolve. At Science Online 09 the session was about Open Notebook Science and here I wanted to take the idea of what a “web native” lab record could look like and show that if you go down this road you will get the most out if you are open. At the ELN meeting which was aimed mainly at traditional database backed ELN systems for industry I wanted to show the potential of a web native way of looking at the laboratory record, and in passing to show that these approaches work best when they are open, before beating a retreat back to the position of “but if you’re really paranoid you can implement all of this behind your firewall” so as not to scare them off too much. The talks are quite similar in outline and content and I wanted to work through some of the ideas here.The central premise is one that is similar to that of many web-service start ups: “The traditional paper notebook is to the fully integrated web based lab record as a card index is to Google”. Or to put it another way, if you think in a “web-native” way then you can look to leverage the power of interlinked networks, tagging, social network effects, and other things that don’t exist on a piece of paper, or indeed in most databases. This means stripping back the lab record to basics and re-imagining it as thought it were built around web based functionality.

So what is a lab notebook? At core it is a journal of events, a record of what has happened. Very similar to a Blog in many ways. An episodic record containing dates, times, bits and pieces of often disparate material, cut and pasted into a paper notebook. It is interesting that in fact most people who use online notebooks based on existing services use Wikis rather than blogs. This is for a number of reasons; better user interfaces, sometimes better services and functionality, proper versioning, or just personal preference. But there is one thing that Wikis tend to do very badly that I feel is crucial to thinking about the lab record in a web native way; they generate at best very ropey RSS feeds. Wikis are well suited to report writing and formalising and sharing procedures but they don’t make very good diaries. At the end of the day it ought to possible to do clever things with a single back end database being presented as both blog and wiki but I’ve yet to see anything really impressive in this space so for the moment I am going to stick with the idea of blog as lab notebook because I want to focus on feeds.

So we have the idea of a blog as the record – a minute to minute and day to day record. We will assume we have a wonderful backend and API and a wide range of clients that suit different approaches to writing things down and different situations where this is being done. Witness the plethora of clients for Twittering in every circumstance and mode of interaction for instance. We’ll assume tagging functionality as well as key value pairs that are exposed as microformats and RDF as appropriate. Widgets for ontology look up and autocompleton if it is desired and the ability to automatically generate input forms from any formal description of what an experiment should look like. But above all, this will be exposed in a rich machine readable format in an RSS/Atom feed.What we don’t need is the ability to upload data. Why not? Because we’re thinking web native. On a blog you don’t generally upload images and video directly, you host them on an appropriate service and embed them on the blog page. All of the issues are handled for you and a nice viewer is put in place. The hosting service is optimised for handling the kind of content you need; Flickr for photos, YouTube (ViddlerBioscreencast) for video, Slideshare for presentations etc. In a properly built ecosystem there would be a data hosting service, ideally one optimised for your type of data, that would provide cut and paste embed codes providing the appropiate visualisations. The lab notebook only needs to point at the data; doesn’t need to know anything much about that data beyond the fact that it is related to the stuff going on around it and that it comes with some html code to embed a visualisation of some sort.

That pointing is the next thing we need to think about. In the way I use the Chemtools LaBLog I use a one post, one item system. This means that every object gets its own post. Each sample, each bottle of material, should have its own post and its own identity. This creates a network of posts that I have written about before. What it also means is that it is possible to apply page rank style algorithms and link analysis more generally in looking at large quantities of posts. Most importantly it encodes the relationship between objects, samples, procedures, data, and analysis in the way the web is tooled up to understand: the relationships are encoded in links. This is a lightweight way of starting to build up a web of data – it doesn’t matter so much to start with whether this is in hardcore RDF as long as there is enough contextual data to make it useful. Some tagging or key-value pairs would be a good start. Most importantly it means that it doesn’t matter at all where our data files are as long as we can point at them with sufficient precision.

But if we’re moving the datafiles off the main record then what about the information about samples? Wouldn’t it be better to use the existing Laboratory Information Management System, or sample management system or database? Well again, as long as you can point at each sample independently with the precision you need then it doesn’t matter. You can use a GoogleSpreadsheet if you want to – you can give URL for each cell, there is a powerful API that would let you build services to make putting the links in easy. We use the LaBLog to keep information on our samples because we have such a wide variety of different materials put to different uses, that the flexibility of using that system rather than a database with a defined schema is important for our way of working. But for other people this may not be the case. It might even be better to use multiple different systems, a database for oligonucleotides, a spreadsheet for environmental samples, and a full blown LIMS for barcoding and following the samples through preparation for sequencing. As long; as it can be pointed at, it can be used. Similar to the data case, it is best to use the system that is best suited to the specific samples. These systems are better developed than they are for data – but many of the existing systems don’t allow a good way of pointing at specific samples from an external document – and very few make it possible to do this via a simple http compliant URL.

So we’ve passed off the data, we’ve passed off the sample management. What we’re left with is the procedures which after all is the core of the record, right? Well no. Procedures are also just documents. Maybe they are text documents, but perhaps they are better expressed as spreadsheets or workflows (or rather the record of running a workflow). Again these may well be better handled by external services, be they word processors, spreadsheets, or specialist services. They just need to be somewhere where we can point at them.

What we are left with is the links themselves, arranged along a timeline. The laboratory record is reduced to a feed which describes the relationships between samples, procedures, and data. This could be a simple feed containing links or a sophisticated and rich XML feed which points out in turn to one or more formal vocabularies to describe the semantic relationship between items. It can all be wired together, some parts less tightly coupled than others, but in principle it can at least be connected. And that takes us one significant step towards wiring up the data web that many of us dream of

The beauty of this approach is that it doesn’t require users to shift from the applications and services that they are already using, like, and understand. What it does require is intelligent and specific repositories for the objects they generate that know enough about the object type to provide useful information and context. What it also requires is good plugins, applications, and services to help people generate the lab record feed. It also requires a minimal and arbitrarily extensible way of describing the relationships. This could be as simple html links with tagging of the objects (once you know an object is a sample and it is linked to a procedure you know a lot about what is going on) but there is a logic in having a minimal vocabulary that describes relationships (what you don’t know explicitly in the tagging version is whether the sample is an input or an output). But it can also be fully semantic if that is what people want. And while the loosely tagged material won’t be easily and tightly coupled to the fully semantic material the connections will at least be there. A combination of both is not perfect, but it’s a step on the way towards the global data graph.

Recording the fiddly bits of experimental and data analysis work

We are in the slow process of gearing up within my group at RAL to adopting the Chemtools LaBLog system and in the process moving properly to an Open Notebook status. This has taken much longer than I had hoped but there have been some interesting lessons along the way. Here I want to think a bit about a problem that has been troubling me for a while.

I haven’t done a very good job of recording what I’ve been doing in the times that I have been in a lab over the past couple of months. Anyone who has been following along will have seen small bursts of apparently unrelated activity where nothing much ever seems to come to a conclusion. This has been divided up mainly into a) a SANS experiment we did in early November which has now moved into a data analysis phase, b) some preliminary, and thus far fairly unconvincing experiments, attempting to use a very new laser tweezers setup at the Central Laser Facility to measure protein-DNA interactions at the single molecule level and c) other random odds and sods that have come by. None of these have been very well recorded for a variety of reasons.

Data analysis, particularly when it uses a variety of specialist software tools, is something I find very challenging to record. A common approach is to take some relatively raw data, run it through some software, and repeat, while fiddling with parameters to get a feel for what is going on. Eventually the analysis is run “for real” and the finalised (at least for the moment) structure/number/graph is generated. The temptation is obviously just to formally record the last step but while this might be ok as a minimum standard if only one person is involved, when more people are working through data sets it makes sense to try and keep track of exactly what has been done and which data has been partially processes in which ways. This helps us both in terms of being able to quickly track where we are with the process but also reduces the risk of replicating effort.

The laser tweezers experiment involves a lot of optimising of buffer conditions, bead loading levels, instrumental parameters and whatnot. Essentially a lot of fiddling, rapid shifts from one thing to another and not always being too sure exactly what is going on. We are still at the stage of getting a feel for things rather than stepping through a well ordered experiment. Again the recording tends to be haphazard as you try on thing  and then another. We’re not even absolutely sure what we should be recording for each “run” or indeed really what a “run” is yet.

The common theme here is “fiddling” and the difficulty of recording it efficiently, accurately, and usefully. What I would prefer to be doing is somehow capturing the important aspects of what we’re doing as we do it. What is less clear is what the best way to do that is. In the case of data analysis we have good model for how to do this well. Good use of repositories and the use of versioned scripts for handling data conversions, in the way the Michael Barton in particular has talked about provide an example of good practice. Unfortunately it is good practice that is almost totally alien to experimental biochemists and is also not easily compatible with a lot of the software we use.

The ideal would be a work bench using a graphical representation of data analysis tools and data repositories that would automatically generate scripts and deposit these and versioned data files into an appropriate repository. This would enable the “docking” of arbitrary web services, software packages and whatever, as well as connection to shared data stores. The purpose of the workbench would be to record what is done, although it might also provide some automation tools. In many ways this is what I think of when look at work flow engines like Taverna and platforms for sharing workflows like MyExperiment.

Its harder in the real world. Here the workbench is, well, the workbench but the idea of recording everything along with contextual metadata is pretty similar. The challenge lies in recording enough different aspects of what is going on to capture the important stuff without generating a huge quantity or data that can never be searched effectively. It is possible to record multiple video streams, audio, screencast any control computers , but it will be almost impossible to find anything in these data streams.

A challenge that emerges over and over again in laboratory recording is that you always seem to not be recording the thing that you really now need to have. Yet if you record everything you still won’t have it because you won’t be able to find it. Video, image, and audio search will one day make a huge difference to this but in the meantime I think we’re just going to keep muddling on.

The distinction between recording and presenting - and what it means for an online lab notebook

Something that has been bothering me for quite some time fell into place for me in the last few weeks. I had always been slightly confused by my reaction to the fact that on UsefulChem Jean-Claude actively works to improve and polish the description of the experiments on the wiki. Indeed this is one of the reasons he uses a wiki as the process of making modifications to posts on blogs is generally less convenient and in most cases there isn’t a robust record of the different versions. I have always felt uncomfortable about this because to me a lab book is about the record of what happened - including any mistakes in recording you make along the way. There is some more nebulous object (probably called a report) which aggregates and polishes the description of the experiments together.

Now this is fine, but point is that the full history of a UsefulChem page is immediately available from the history. So the full record is very clearly there - it is just not what is displayed. In our system we tend to capture a warts and all view of what was recorded at the time and only correct typos or append comments or observations to a post. This tends not be very human readable in most cases - to understand the point of what is going on you have to step above this to a higher level - one which we are arguably not very good at describing at the moment.

I had thought for a long time that this was a difference between our respective fields. The synthetic chemistry of UsefulChem lends itself to a slightly higher level description where the process of a chemical reaction is described in a fairly well defined, community accepted, style. Our biochemistry is more a set of multistep processes where each of those steps is quite stereotyped. In fact for us it is difficult to define where the ‘experiment’ begins and end. This is at least partly true, but actually if you delve a little deeper and also have a look at Jean-Claude’s recent efforts to use a controlled vocabulary to describe the synthetic procedures a different view arises. Each line of one these ‘machine readable’ descriptions actually maps very well onto each of our posts in the LaBLog. Something that maps on even better is the log that appears near the bottom of each UsefulChem page. What we are actually recording is rather similar. It is simply that Jean-Claude is presenting it at a different level of abstraction.

And that I think is the key. It is true that synthetic chemistry lends itself to a slightly different level of abstraction than biochemistry and molecular biology, but the key difference actually comes in motivation. Jean-Claude’s motivation from the beginning has been to make the research record fully available to other scientists; to present that information to potential users. My focus has always been on recording the process that occurs in the lab and particular to capture the connections between objects and data files. Hence we have adopted a fine grained approach that provides a good record, but does not necessarily make it easy for someone to follow the process through. On UsefulChem the ideal final product contains a clear description of how to repeat the experiment. On the LaBLog this will require tracking through several posts to pick up the thread.

This also plays into the discussion I had some months ago with Frank Gibson about the use of data models. There is a lot to be said for using a data model to present the description of an experiment. It provides all sorts of added value to have an agreed model of what these descriptions look like. However it is less clear to me that it provides a useful way of recording or capturing the research process as it happen, at least in a general case. Stream of consciousness recording of what has happened, rather than stopping halfway through to figure out how what you are doing fits into the data model, is what is required at the recording stage. One of the reasons people feel uncomfortable with electronic lab notebooks is that they feel they will lose the ability to scribble such ‘free form’ notes - the lack of any presuppositions about what the page should loook like is one of the strengths of pen and paper.

However, once the record, or records, have been made then it is appropriate to pull these together and make sense of them - to present the description of an experiment in a structured and sensible fashion. This can of course be linked back to the primary records and specific data files but it provides a comprehensible and fine grained descriptionof the rationale for and conduct of the experiment as well as placing the results in context. This ‘presentation layer’ is something that is missing from our LaBLog but could relatively easily be pulled together by writing up the methodology section for a report. This would be good for us and good for people coming into the system looking for specific information.

Person Frank Gibson

Right click for SmartMenu shortcuts

The trouble with semantics…

…is knowing what you mean…

I posted last week about the spontaneous CMLReact hackfest held around Peter Murray-Rust’s dining room table the day after Science Blogging in London. There were a number of interesting things that came out of the exercise for me. The first was that it would be relatively easy to design a moderately strict, but pretty standard, description format for a synthetic chemistry lab notebook that could be automatically scraped into CMLReact.

Automatic conversions from lab book to machine readable XML

CMLReact files have (roughly) three sections. In the first, all the molecules that are relevant to the description are described, or in the ideal semantic web world pointed to at an external authority such as Chemspider, PubChem, or other source. In the second section the relationships between input materials, solvents, products, and samples are described. In general all of these will be molecules which are referred to in the first session but this is not absolutely required (and this will be important later). The final section describes observables, procedures, yields, and other descriptions of what happened or what was measured.

If we take a look at the UsefulChem experiment that we converted to CMLReact you can see that most of this information is available in one form or another. The molecules are described via InChi/InChiKey at the bottom of the page. This could be used as they are to populate the molecules section. A little additional markup to distinguish between reactants, solvents, reagents, and products would make it possible to start populating the second section describing the relationships between these molecules.

The third section is the most tricky, and this will always be an 80:20 game. The object is to abstract as much information as can be reasonably garnered without putting in the vast amount of work required to get close to 100% retrieval. At the end of the day, if someone wants the real detail they can go back to the lab book. Peter has demonstrated text scraping tools that do a pretty good job of extracting a lot of this information. In combination with a bit of markup it is reasonable to expect that some basic information (amounts of reagents, yield, temperature of reaction, some descriptive terms) could reasonably be extracted. Again, getting 80-90% of a subset of regularly used terms  would be very powerful.

But what are we describing?

There is a problem with grabbing this descriptive information from the lab notebook however, and it is a problem that is very general and something I believe we need to grapple with urgently. There is a fundamental question as to what it is that this file is describing. Does it describe the plan of the experiment? The record of carrying out a specific example of this experiment? An ‘averaged’ description of a set of equivalent experiments? A general description of the reaction? Or a description of a model of what we expect or think is happening?

If you look closely at the current version of the CMLReact file you will see that the yield is expressed as a percentage with a standard deviation. This is actually describing the average of three independent reactions but that is not actually made explicit anywhere in this file. Is this important? Well I think it is because it has an effect on what any outward links back to the lab book mean. There is a significant difference between – ‘this link points to an example of this kind of reaction’ (which might in fact be significantly different in the details) and ‘this link points to this exact experiment’ or indeed ‘this link points to an index of relevant experimental results’. Those distinctions need to be encoded in the links, or perhaps more likely made explicit in the abstracted file.

The CMLReact file is an abstraction of the experimental record. It is therefore important to make it clear what the level of abtraction is and what has been abstracted out of that description. This relates to the distinction I have made before between the flexibility required to record an experiment versus the ability to use a more structured vocabulary to describe the experiment after it has happened. My impression is that people who work in developing these controlled vocabularies are focussed on description rather than recording and don’t often make the distinction between the two. There is also often a lack of distinction between describing an experiment and describing a model of what happened in that experiment.  This is important because the model may need to be modified in the future whereas the description of the experiment should be accurate.

Summary

My view remains that when recording an experiment the system used should be as flexible as possible. Structure can be added to this primary record when convenient to make the process of abstracting from this primary record to a controlled vocabulary easier. The primary goal for me, for the moment, remains making a human readable record available. The process of converting the primary record into a controlled vocabulary, such as CMLReact, FuGE, or workflow system such as Taverna, should be enabled via domain specific automated or semi-automated tools that help the user to structure their description of the experiment in a way that makes it more directly useful to them but maintains the links with the primary record. Where the same controlled vocabulary is used for more abstracted descriptions of studies, experiments, or the models that purport to describe them, this distinction must be made clear.

Semantics depends absolutely on being clear about what you are describing. There is absolutely no point in having absolute clarity about the description of an object if the nature of that object is fuzzy. Get it right and we could have a very sophisticated description of the scientific record. Get it wrong and that description could be at best unclear and at worst downright misleading.

Another lab notebook framework

OpenWetWare logo, designed by Jennifer Cook-ChrysosImage via Wikipedia

Just a very brief note, which really follows on from the vigorous discussion in Jennifer Rohn’s blog at Nature Networks this week, to say that the guys at OpenWetWare appear to have gone live with some of the new functionality for laboratory notebooks on the wiki. Check it out from the OWW main page. I will have a closer look and make some comments as and when I have a bit of time but it looks like a good start.

Workshop on Blog Based Notebooks

DUE TO SEVERE COMMENT SPAM ON THIS POST I HAVE CLOSED IT TO COMMENTS

On February 28/29 we held a workshop on our Blog Based notebook system at the Cosener’s House in Abingdon, Oxfordshire. This was a small workshop with 13 people including biochemists (from Southampton, Cardiff, and RAL), social scientists (from Oxford Internet Institute and Computing Laboratory), developers from the MyGrid and MyExperiment family and members of the the Blog development team. The purpose of the workshop was to try and identify both the key problems that need to be addressed in the next version of the Blog Based notebook system and also to ask the question ‘Does this system deliver functionality that people want’. Another aim was to identify specific ways in which we could interact with MyExperiment to deliver enhanced functionality for us as well as new members for MyExperiment.

Read more »

A quick update

I have got very behind. I’ve only just realised just how far behind but my excuse is that I have been rather busy. How far behind I was was brought home by the fact that I hadn’t actually commented as yet that the proposal for an Open Science session at PSB that was driven primarily by Shirley Wu has gone in and the proposal is now up at Nature Precedings. The posting there has already generated some new contacts.

On Tuesday I gave a talk at UKOLN at the University of Bath. Brian Kelly kindly videoed the first 10 minutes of the presentation when my attempts to record a screencast failed miserably and has blogged about the talk and on recording talks more generally for public consumption. Jean-Claude does this very effectively but this is something we should perhaps all be more putting a lot more effort into (and can someone tell me what the best software for recording screencasts is?!?). I got a lot of the talk on audio recording and will attempt to record a screencast when I can find time.
The talk was interesting; this was to a group of library/repository/curation/technical experts rather than the usual attempt to convince a group of sceptical scientists. Many of them are already ‘open’ advocates but are focused on technical issues. Lots of smart question on how do you really manage secure identities across multiple systems; how do we make data on the cloud stable for the long term; how do you choose between competing standards for describing and collating data; fundamentally how do you actually make all this work. Interesting discussion all in all and great to meet the people at UKOLN and finally meet Liz Lyon in person.

The other thing happening this week is that tomorrow and Friday we are running a small workshop introducing potential users to our Blog based notebook. Our aim is to see how other people’s working processes do or don’t fit into our system. This is still focused on biochemistry/molecular biology but it will be very interesting to see what comes out of this. I will try to report as soon as possible.

Finally; I think there is something in the air. This week has seen a rush of emails from people who have seen Blog posts, proposals, and other things writing to offer support, and perhaps more crucially access to more contacts.

And further on the PLoS front the biggest story in the UK news on Tuesday morning was about the paper in PLoS Medicine reporting on the results of a meta-study of the effectiveness of SSRIs in treating depression. I woke up to this story on BBC radio and by the time I gave my talk at 10:30 I’d had a chance at least to read the paper abstract. If I’d been on SSRIs this could be really important to me. Perhaps more to the point, if I were a doctor realising I’d be fielding phone calls from concerned patients all day, I could have read the paper. This story tells us a lot about why Open Access and Open Data are crucial. But more on that in another post sometime…I promise.