An openwetware blog on the challenges of open and connected science

authoring tools

How to make Connotea a killer app for scientists

So Ian Mulvaney asked, and as my solution did not fit into the margin I thought I would post here. Following on from the two rants of a few weeks back and many discussions at Scifoo I have been thinking about how scientists might be persuaded to make more use of social web based tools. What does it take to get enough people involved so that the network effects become apparent. I had a discussion with Jamie Heywood of Patients Like Me at Scifoo because I was interested as to why people with chronic diseases were willing to share detailed and very personal information in a forum that is essentially public. His response was that these people had an ongoing and extremely pressing need to optimise as far as is possible their treatment regime and lifestyle and that by correlating their experiences with others they got to the required answers quicker. Essentially successful management of their life required rapid access to high quality information sliced and diced in a way that made sense to them and was presented in as efficient and timely a manner as possible. Which obviously left me none the wiser as to why scientists don’t get it….

Nonetheless there are some clear themes that emerge from that conversation and others looking at uptake and use of web based tools. So here are my 5 thoughts. These are framed around the idea of reference management but the principles I think are sufficiently general to apply to most web services.

  1. Any tool must fit within my existing workflows. Once adopted I may be persuaded to modify or improve my workflow but to be adopted it has to fit to start with. For citation management this means that it must have one click filing (ideally from any place I might find an interesting paper)  but will also monitor other means of marking papers by e.g. shared items from Google reader, ‘liked’ items on Friendfeed, or scraping tags in del.icio.us.
  2. Any new tool must clearly outperform all the existing tools that it will replace in the relevant workflows without the requirement for network or social effects. Its got to be absolutely clear on first use that I am going to want to use this instead of e.g. Endnote. That means I absolutely have to be able to format and manage references in a word processor or publication document. Technically a nightmare I am sure (you’ve got to worry about integration with Word, Open Office, GoogleDocs, Tex) but an absolute necessity to get widespread uptake. And this has to be absolutely clear the first time I use the system, before I have created any local social network and before you have a large enough user base for theseto be effective.
  3. It must be near 100% reliable with near 100% uptime. Web services have a bad reputation for going down. People don’t trust their network connection and are much happier with local applications still. Don’t give them an excuse to go back to a local app because the service goes down. Addendum - make sure people can easily backup and download their stuff in a form that will be useful even if your service dissappears. Obviously they’ll never need to but it will make them feel better (and don’t scrimp on this because they will check if it works).
  4. Provide at least one (but not too many) really exciting new feature that makes people’s life better. This is related to #2 but is taking it a step further. Beyond just doing what I already do better I need a quick fix of something new and exciting. My wishlist for Connotea is below.
  5. Prepopulate. Build in publically available information before the users arrive. For a publications database this is easy and this is something that BioMedExperts got right. You have a pre-existing social network and pre-existing library information. Populate ‘ghost’ accounts with a library that includes people’s papers (doesn’t matter if its not 100% accurate) and connections based on co-authorships. This will give people an idea of what the social aspect can bring and encourage them to bring more people on board.

So that is so much motherhood and applepie. And nothing that Ian didn’t already know (unlike some other developers who I shan’t mention). But what about those cool features? Again I would take a back to basics approach. What do I actually want?

Well what I want is a service that will do three quite different things. I want it to hold a library of relevant references in a way I can search and use and I want to use this to format and reference documents when I write them. I want it to help me manage the day to day process of dealing with the flood of literature that is coming in (real time search). And I want it to help me be more effective when I am researching a new area or trying to get to grips with something (offline search). Real time search I think is a big problem that isn’t going to be solved soon. The library and document writing aspects I think are a given and need to be the first priority. The third problem is the one that I think is amenable to some new thinking.

What I would really like to see here is a way of pivoting my view of the literature around a specific item. This might be a paper, a dataset, or a blog post. I want to be able to click once and see everything that item cites, click again and see everything that cites it. Pivot away from that to look at what GoPubmed thinks the paper is about and see what it has which is related and then pivot back and see how many of those two sets are common. What are the papers in this area that this review isn’t citing? Is there a set of authors this paper isn’t citing? Have they looked at all the datasets that they should have? Are there general news media items in this area, books on Amazon, books in my nearest library, books on my bookshelf? Are they any good? Have any of my trusted friends published or bookmarked items in this area? Do they use the same tags or different ones for this subject? What exactly is Neil Saunders doing looking at that gene? Can I map all of my friends tags onto a controlled vocabulary?

Essentially I am asking for is to be able to traverse the graph of how all these things are interconnected. Most of these connections are already explicit somewhere but nowhere are they all brought together in a way that the user can slice and dice them the way they want. My belief is that if you can start to understand how people use that graph effectively to find what they want then you can start to automate the process and that that will be the route towards real time search that actually works.

…but you’ll struggle with uptake…

The trouble with institutional repositories

A tag cloud with terms related to Web 2.I spent today at an interesting meeting at Talis headquarters where there was a wide range of talks. Most of the talks were liveblogged by Andy Powell and also by Owen Stephens (who has written a much more comprehensive summary of Andy’s talk) and there will no doubt be some slides and video available on the web in future. The programme is also available. Here I want to focus on Andy Powell’s talk (slides), partly because he obviously didn’t liveblog it but primarily because it crystallised for me many aspects of the way we think about Institutional Repositories. For those not in the know, these are warehouses that are becoming steadily more popular, run generally by unversities to house their research outputs, in most cases peer reviewed papers. Self archiving of some version of published papers is the so called ‘Green Route’ to open access.

The problem with institutional repositories in their current form is that academics don’t use them. Even when they are being compelled there is massive resistance from academics. There are a variety of reasons for this: academics don’t like being told how to do things; they particularly don’t like being told what to do by their institution; the user interfaces are usually painful to navigate. Nonetheless they are a valuable part of the route towards making more research results available. I use plenty of things with ropey interfaces because I see future potential in them. Yet I don’t use either of the repositories in the places where I work – in fact they make my blood boil when I am forced to. Why?

So Andy was talking about the way repositories work and the reasons why people don’t use them. He had already talked about the language problem. We always talk about ‘putting things in the repository’ rather than ‘making them available on the web’. He had mentioned already that the institutional nature of repositories does not map well onto the social networks of the academic users which probably bear little relationship with institutions and are much more closely aligned to discipline and possibly geographic boundaries (although they can easily be global).

But for me the key moment was when Andy asked ‘How many of you have used SlideShare’. Half the people in the room put their hands up. Most of the speakers during the day pointed to copies of their slides on SlideShare. My response was to mutter under my breath ‘And how many of them have put presentations in the institutional repository?’ The answer to this; probably none. SlideShare is a much better ‘repository’ for slide presentations than IRs. There are more there, people may find mine, it is (probably) Google indexed. But more importantly I can put slides up with one click, it already knows who I am, I don’t need to put in reams of metadata, just a few tags. And on top of this it provides added functionality including embedding in other web documents as well as all the social functions that are a natural part of a ‘Web2.0’ site.

SlideShare is a very good model of what a Repository can be. It has issues. It is a third party product, it may not have long term stability, it may not be as secure as some people would like. But it provides much more of the functionality that I want from a service for making my presentations available on the web. It does not serve the purpose of an archive – and maybe an institutional repository is better in that role. But for the author, the reason for making things available is so that people use them. If I make a video that relates to my research it will go on YouTube, Bioscreencast, or JoVE, not in the institutional repository, I put research related photos on Flickr, not in the institutional repository, and critically, I leave my research papers on the websites of the journal that published them, and cannot be bothered with the work required to put them in the institutional repository.

Andy was arguing for global discipline specific repositories. I would suggest that the lesson of the Web2.0 sites is that we should have data type specific repositories. FlickR is for pictures, SlideShare for presentations. In each case the specialisation enables a sort of implicit metadata and for the site to concentrate on providing functionality that adds value to that particular data type. Science repositories could win by doing the same. PDB, GenBank, SwissProt deal with specific types of data. Some might argue that GenBank is breaking under the strain of the different types and quantities of data generated by the new high throughput sequencing tools. Perhaps a new repository is required that is specially designed for this data.

So what is the role for the institutional repository? The preservation of data is one aspect. Pulling down copies of everything to provide an extra backup and retain an institutional record. If not copying then indexing and aggregating so as to provide a clear guide to the institutions outputs. This needn’t be handled in house of course and can be outsourced. As Paul Miller suggested over lunch, the role of the institution need not be to keep a record of everything, but to make sure that such a record is kept. Curation may be another, although that may be too big a job to be tackled at institutional level. When is a decision made that something isn’t worth keeping anymore? What level of metadata or detail is worth preserving?

But the key thing is that all of this should be done automatically and must not require intervention by the author. Nothing drives me up the wall more than having to put the same set of data into two subtly different systems more than once. And as far as I can see there is no need to do so. Aggregate my content automatically, wrap it up and put it in the repository, but I don’t want to have to deal with it. Even in the case of peer reviewed papers it ought to be feasible to pull down the vast majority of the metadata required. Indeed, even for toll access publishers, everything except the appropriate version of the paper. Send me a polite automated email and ask me to attach that and reply. Job done.

For this to really work we need to take an extra step in the tools available. We need to move beyond files that are simply ‘born digital’ because these files are in many ways still born. This current blog post, written in Word on the train is a good example. The laptop doesn’t really know who I am, it probably doesn’t know where I am, and it has not context for the particular word document I’m working on. When I plug this into the Wordpress interface at OpenWetWare all of this changes. The system knows who I am (and could do that through OpenID). It knows what I am doing (writing a Blog post) and the Zemanta Firefox plug in does much better than that, suggesting tags, links, pictures and keywords.

Plugins and online authoring tools really have the potential to automatically generate those last pieces of metadata that aren’t already there. When the semantics comes baked in then the semantic web will fly and the metadata that everyone knows they want, but can’t be bothered putting in, will be available and re-useable, along with the content. When documents are not only born digital but born on and for the web then the repositories will have probably still need to trawl and aggregate. But they won’t have to worry me about it. And then I will be a happy depositor.


Related articles


Zemanta Pixie