An openwetware blog on the challenges of open and connected science

format

The distinction between recording and presenting - and what it means for an online lab notebook

Something that has been bothering me for quite some time fell into place for me in the last few weeks. I had always been slightly confused by my reaction to the fact that on UsefulChem Jean-Claude actively works to improve and polish the description of the experiments on the wiki. Indeed this is one of the reasons he uses a wiki as the process of making modifications to posts on blogs is generally less convenient and in most cases there isn’t a robust record of the different versions. I have always felt uncomfortable about this because to me a lab book is about the record of what happened - including any mistakes in recording you make along the way. There is some more nebulous object (probably called a report) which aggregates and polishes the description of the experiments together.

Now this is fine, but point is that the full history of a UsefulChem page is immediately available from the history. So the full record is very clearly there - it is just not what is displayed. In our system we tend to capture a warts and all view of what was recorded at the time and only correct typos or append comments or observations to a post. This tends not be very human readable in most cases - to understand the point of what is going on you have to step above this to a higher level - one which we are arguably not very good at describing at the moment.

I had thought for a long time that this was a difference between our respective fields. The synthetic chemistry of UsefulChem lends itself to a slightly higher level description where the process of a chemical reaction is described in a fairly well defined, community accepted, style. Our biochemistry is more a set of multistep processes where each of those steps is quite stereotyped. In fact for us it is difficult to define where the ‘experiment’ begins and end. This is at least partly true, but actually if you delve a little deeper and also have a look at Jean-Claude’s recent efforts to use a controlled vocabulary to describe the synthetic procedures a different view arises. Each line of one these ‘machine readable’ descriptions actually maps very well onto each of our posts in the LaBLog. Something that maps on even better is the log that appears near the bottom of each UsefulChem page. What we are actually recording is rather similar. It is simply that Jean-Claude is presenting it at a different level of abstraction.

And that I think is the key. It is true that synthetic chemistry lends itself to a slightly different level of abstraction than biochemistry and molecular biology, but the key difference actually comes in motivation. Jean-Claude’s motivation from the beginning has been to make the research record fully available to other scientists; to present that information to potential users. My focus has always been on recording the process that occurs in the lab and particular to capture the connections between objects and data files. Hence we have adopted a fine grained approach that provides a good record, but does not necessarily make it easy for someone to follow the process through. On UsefulChem the ideal final product contains a clear description of how to repeat the experiment. On the LaBLog this will require tracking through several posts to pick up the thread.

This also plays into the discussion I had some months ago with Frank Gibson about the use of data models. There is a lot to be said for using a data model to present the description of an experiment. It provides all sorts of added value to have an agreed model of what these descriptions look like. However it is less clear to me that it provides a useful way of recording or capturing the research process as it happen, at least in a general case. Stream of consciousness recording of what has happened, rather than stopping halfway through to figure out how what you are doing fits into the data model, is what is required at the recording stage. One of the reasons people feel uncomfortable with electronic lab notebooks is that they feel they will lose the ability to scribble such ‘free form’ notes - the lack of any presuppositions about what the page should loook like is one of the strengths of pen and paper.

However, once the record, or records, have been made then it is appropriate to pull these together and make sense of them - to present the description of an experiment in a structured and sensible fashion. This can of course be linked back to the primary records and specific data files but it provides a comprehensible and fine grained descriptionof the rationale for and conduct of the experiment as well as placing the results in context. This ‘presentation layer’ is something that is missing from our LaBLog but could relatively easily be pulled together by writing up the methodology section for a report. This would be good for us and good for people coming into the system looking for specific information.

Person Frank Gibson

Right click for SmartMenu shortcuts

The trouble with semantics…

…is knowing what you mean…

I posted last week about the spontaneous CMLReact hackfest held around Peter Murray-Rust’s dining room table the day after Science Blogging in London. There were a number of interesting things that came out of the exercise for me. The first was that it would be relatively easy to design a moderately strict, but pretty standard, description format for a synthetic chemistry lab notebook that could be automatically scraped into CMLReact.

Automatic conversions from lab book to machine readable XML

CMLReact files have (roughly) three sections. In the first, all the molecules that are relevant to the description are described, or in the ideal semantic web world pointed to at an external authority such as Chemspider, PubChem, or other source. In the second section the relationships between input materials, solvents, products, and samples are described. In general all of these will be molecules which are referred to in the first session but this is not absolutely required (and this will be important later). The final section describes observables, procedures, yields, and other descriptions of what happened or what was measured.

If we take a look at the UsefulChem experiment that we converted to CMLReact you can see that most of this information is available in one form or another. The molecules are described via InChi/InChiKey at the bottom of the page. This could be used as they are to populate the molecules section. A little additional markup to distinguish between reactants, solvents, reagents, and products would make it possible to start populating the second section describing the relationships between these molecules.

The third section is the most tricky, and this will always be an 80:20 game. The object is to abstract as much information as can be reasonably garnered without putting in the vast amount of work required to get close to 100% retrieval. At the end of the day, if someone wants the real detail they can go back to the lab book. Peter has demonstrated text scraping tools that do a pretty good job of extracting a lot of this information. In combination with a bit of markup it is reasonable to expect that some basic information (amounts of reagents, yield, temperature of reaction, some descriptive terms) could reasonably be extracted. Again, getting 80-90% of a subset of regularly used terms  would be very powerful.

But what are we describing?

There is a problem with grabbing this descriptive information from the lab notebook however, and it is a problem that is very general and something I believe we need to grapple with urgently. There is a fundamental question as to what it is that this file is describing. Does it describe the plan of the experiment? The record of carrying out a specific example of this experiment? An ‘averaged’ description of a set of equivalent experiments? A general description of the reaction? Or a description of a model of what we expect or think is happening?

If you look closely at the current version of the CMLReact file you will see that the yield is expressed as a percentage with a standard deviation. This is actually describing the average of three independent reactions but that is not actually made explicit anywhere in this file. Is this important? Well I think it is because it has an effect on what any outward links back to the lab book mean. There is a significant difference between – ‘this link points to an example of this kind of reaction’ (which might in fact be significantly different in the details) and ‘this link points to this exact experiment’ or indeed ‘this link points to an index of relevant experimental results’. Those distinctions need to be encoded in the links, or perhaps more likely made explicit in the abstracted file.

The CMLReact file is an abstraction of the experimental record. It is therefore important to make it clear what the level of abtraction is and what has been abstracted out of that description. This relates to the distinction I have made before between the flexibility required to record an experiment versus the ability to use a more structured vocabulary to describe the experiment after it has happened. My impression is that people who work in developing these controlled vocabularies are focussed on description rather than recording and don’t often make the distinction between the two. There is also often a lack of distinction between describing an experiment and describing a model of what happened in that experiment.  This is important because the model may need to be modified in the future whereas the description of the experiment should be accurate.

Summary

My view remains that when recording an experiment the system used should be as flexible as possible. Structure can be added to this primary record when convenient to make the process of abstracting from this primary record to a controlled vocabulary easier. The primary goal for me, for the moment, remains making a human readable record available. The process of converting the primary record into a controlled vocabulary, such as CMLReact, FuGE, or workflow system such as Taverna, should be enabled via domain specific automated or semi-automated tools that help the user to structure their description of the experiment in a way that makes it more directly useful to them but maintains the links with the primary record. Where the same controlled vocabulary is used for more abstracted descriptions of studies, experiments, or the models that purport to describe them, this distinction must be made clear.

Semantics depends absolutely on being clear about what you are describing. There is absolutely no point in having absolute clarity about the description of an object if the nature of that object is fuzzy. Get it right and we could have a very sophisticated description of the scientific record. Get it wrong and that description could be at best unclear and at worst downright misleading.

The trouble with institutional repositories

A tag cloud with terms related to Web 2.I spent today at an interesting meeting at Talis headquarters where there was a wide range of talks. Most of the talks were liveblogged by Andy Powell and also by Owen Stephens (who has written a much more comprehensive summary of Andy’s talk) and there will no doubt be some slides and video available on the web in future. The programme is also available. Here I want to focus on Andy Powell’s talk (slides), partly because he obviously didn’t liveblog it but primarily because it crystallised for me many aspects of the way we think about Institutional Repositories. For those not in the know, these are warehouses that are becoming steadily more popular, run generally by unversities to house their research outputs, in most cases peer reviewed papers. Self archiving of some version of published papers is the so called ‘Green Route’ to open access.

The problem with institutional repositories in their current form is that academics don’t use them. Even when they are being compelled there is massive resistance from academics. There are a variety of reasons for this: academics don’t like being told how to do things; they particularly don’t like being told what to do by their institution; the user interfaces are usually painful to navigate. Nonetheless they are a valuable part of the route towards making more research results available. I use plenty of things with ropey interfaces because I see future potential in them. Yet I don’t use either of the repositories in the places where I work – in fact they make my blood boil when I am forced to. Why?

So Andy was talking about the way repositories work and the reasons why people don’t use them. He had already talked about the language problem. We always talk about ‘putting things in the repository’ rather than ‘making them available on the web’. He had mentioned already that the institutional nature of repositories does not map well onto the social networks of the academic users which probably bear little relationship with institutions and are much more closely aligned to discipline and possibly geographic boundaries (although they can easily be global).

But for me the key moment was when Andy asked ‘How many of you have used SlideShare’. Half the people in the room put their hands up. Most of the speakers during the day pointed to copies of their slides on SlideShare. My response was to mutter under my breath ‘And how many of them have put presentations in the institutional repository?’ The answer to this; probably none. SlideShare is a much better ‘repository’ for slide presentations than IRs. There are more there, people may find mine, it is (probably) Google indexed. But more importantly I can put slides up with one click, it already knows who I am, I don’t need to put in reams of metadata, just a few tags. And on top of this it provides added functionality including embedding in other web documents as well as all the social functions that are a natural part of a ‘Web2.0’ site.

SlideShare is a very good model of what a Repository can be. It has issues. It is a third party product, it may not have long term stability, it may not be as secure as some people would like. But it provides much more of the functionality that I want from a service for making my presentations available on the web. It does not serve the purpose of an archive – and maybe an institutional repository is better in that role. But for the author, the reason for making things available is so that people use them. If I make a video that relates to my research it will go on YouTube, Bioscreencast, or JoVE, not in the institutional repository, I put research related photos on Flickr, not in the institutional repository, and critically, I leave my research papers on the websites of the journal that published them, and cannot be bothered with the work required to put them in the institutional repository.

Andy was arguing for global discipline specific repositories. I would suggest that the lesson of the Web2.0 sites is that we should have data type specific repositories. FlickR is for pictures, SlideShare for presentations. In each case the specialisation enables a sort of implicit metadata and for the site to concentrate on providing functionality that adds value to that particular data type. Science repositories could win by doing the same. PDB, GenBank, SwissProt deal with specific types of data. Some might argue that GenBank is breaking under the strain of the different types and quantities of data generated by the new high throughput sequencing tools. Perhaps a new repository is required that is specially designed for this data.

So what is the role for the institutional repository? The preservation of data is one aspect. Pulling down copies of everything to provide an extra backup and retain an institutional record. If not copying then indexing and aggregating so as to provide a clear guide to the institutions outputs. This needn’t be handled in house of course and can be outsourced. As Paul Miller suggested over lunch, the role of the institution need not be to keep a record of everything, but to make sure that such a record is kept. Curation may be another, although that may be too big a job to be tackled at institutional level. When is a decision made that something isn’t worth keeping anymore? What level of metadata or detail is worth preserving?

But the key thing is that all of this should be done automatically and must not require intervention by the author. Nothing drives me up the wall more than having to put the same set of data into two subtly different systems more than once. And as far as I can see there is no need to do so. Aggregate my content automatically, wrap it up and put it in the repository, but I don’t want to have to deal with it. Even in the case of peer reviewed papers it ought to be feasible to pull down the vast majority of the metadata required. Indeed, even for toll access publishers, everything except the appropriate version of the paper. Send me a polite automated email and ask me to attach that and reply. Job done.

For this to really work we need to take an extra step in the tools available. We need to move beyond files that are simply ‘born digital’ because these files are in many ways still born. This current blog post, written in Word on the train is a good example. The laptop doesn’t really know who I am, it probably doesn’t know where I am, and it has not context for the particular word document I’m working on. When I plug this into the Wordpress interface at OpenWetWare all of this changes. The system knows who I am (and could do that through OpenID). It knows what I am doing (writing a Blog post) and the Zemanta Firefox plug in does much better than that, suggesting tags, links, pictures and keywords.

Plugins and online authoring tools really have the potential to automatically generate those last pieces of metadata that aren’t already there. When the semantics comes baked in then the semantic web will fly and the metadata that everyone knows they want, but can’t be bothered putting in, will be available and re-useable, along with the content. When documents are not only born digital but born on and for the web then the repositories will have probably still need to trawl and aggregate. But they won’t have to worry me about it. And then I will be a happy depositor.


Related articles


Zemanta Pixie

More on FuGE and data models for lab notebooks

Frank Gibson has posted again in our ongoing conversation about using FUGE as a data model for laboratory notebooks. We have also been discussing things by email and I think we are both agreed that we need to see what actually doing this would look like. Frank is looking at putting some of my experiments into a FUGE framework and we will see how that looks. I think that will be the point where we can really make some progress. However here I wanted to pick up on a couple of points he has made in his last post. Read more »

Semantics in the real world? Part II - Probabilistic reasoning on contingent and dynamic vocabularies

Rendering of human brain.And other big words I learnt from mathematicians…

The observant amongst you will have realised that the title of my previous post pushing a boat out into the area of semantics and RDF implied there was more to come. Those of you who followed the reaction [comments in original post, 1, 2, 3] will also be aware that there are much smarter and more knowledgeable people out there thinking about these problems. Nonetheless, in the spirit of thinking aloud I want to explore these ideas a little further because they underpin the way I think about the LaBLog and its organization. As with the last post this comes with the health warning that I don’t really know what I’m talking about. Read more »

Data models for capturing and describing experiments - the discussion continues

Frank Gibson has continued the discussion that kicked off here and has continued here [1, 2, 3, 4] and in other places [1, 2] along the way. Frank’s exposition on using FuGE as a data model is very clear in what it says and does not say and some of his questions have revealed sloppiness in the way I originally described what I was trying to do. Here I will respond to his responses and try to clarify what it is that I want, and what I want it to achieve. I still feel that we are trying to describe and achieve different things, but that this discussion is a great way of getting to the bottom of this and achieving some clarity in our description and language. Read more »