An openwetware blog on the challenges of open and connected science

Some slides for granting permissions (or not) in presentations

A couple of weeks ago there was a significant fracas over Daniel MacArthur’s tweeting from a Cold Spring Harbour Laboratory meeting.  This was followed in pretty quick succession by an article in Nature discussing the problems that could be caused when the details of presentations no longer stop at the walls of the conference room and all of these led to a discussion (see also friendfeed discussions) about how to make it clear whether you are happy or not with your presentation being photographed, videoed, or live blogged. A couple of suggestions were made for logos or icons that might be used.

I thought it might be helpful rather than a single logo to have a panel that allows the presenter to permit some activities but not others and put together a couple of mockups.

Permission to do whatever with presentationPermission to do less with presentation

I’ve also uploaded a PowerPoint file with the two of these as slides to Slideshare which should enable you to download, modify, and extract the images as you wish. In both cases they are listed as having CC-BY licences but feel free to use them without any attribution to me.

In some of the Friendfeed conversations there are some good comments about how best to represent and suggestions on possible improvements. In particular Anders Norgaard suggests a slightly more friendly “please don’t” rather than my “do not”. Entirely up to you, but I just wanted to get these out. At the moment these are really just to prompt discussion but if you find them useful then please re-post modified versions for others to use.

[Ed. The social media icons are from Chris Ross and are by default under a GPL license. I have a request in to make them available to the Public Domain or as CC-BY at least for re-use. And yes I should have picked this up before.]

Conferences as Spam? Liveblogging science hits the mainstream

I am probably supposed to be writing up some weighty blog post on some issue of importance but this is much more fun. Last year’s International Conference on Intelligent Systems for Molecular Biology (ISMB) kicked off one of the first major live blogging exercises in a mainstream biology conference. It was so successful that the main instigators were invited to write up the exercise and the conference in a paper in PLoS Comp Biol. This year, the conference organizers, with significant work from Michael Kuhn and many others, have set up a Friendfeed room and publicised this from the off, with the idea of supporting a more “official”, or at least coordinated process of disseminating the conference to the wider world. Many have been waiting in anticipation for the live blogging to start due to logistical or financial difficulties in attending in person.

However, there were also concerns. Many of the original ring leaders were not attending. With the usual suspects confined to their home computers would the general populace take up the challenge and provide the rich feed of information the world was craving? Things started well, then moved on rapidly as the room filled up. But the question as to whether it was sustainable was answered pretty effectively when the Friendfeed room went suddenly quiet. Fear gripped the microbloggers. Could the conference go on? Gradually the technorati figured out they could still post by VPNing to somewhere else. Friendfeed was blocking the IP corresponding to the conference wireless network. So much traffic was being generated it looked like spam! This has now been corrected, and normal service resumed, but in a funny and disturbing kind of way it seems to me like a watershed. There were enough people, and certainly not just the usual suspects, live blogging a scientific conference that the traffic looked like spam. Ladies and Gentleman. Welcome to the mainstream.

Talking to the next generation - NESTA Crucible Workshop

Yesterday I was privileged to be invited to give a talk at the NESTA Crucible Workshop being held in Lancaster. You can find the slides on slideshare. NESTA, the National Endowment for Science, Technology, and the Arts,  is an interesting organization funded via a UK government endowment to support innovation and enterprise and more particularly the generation of a more innovative and entrepreneurial culture in the UK. Among the programmes it runs in pursuit of this is the Crucible program where a small group of young researchers, generally looking for or just in their first permanent or independent positions, attend a series of workshops to get them thinking broadly about the role of their research in the wider world and to help them build new networks for support and collaboration.

My job was to talk about “Science in Society” or “Open Science”. My main theme was the question of how we justify taxpayer expenditure on research; that to me this implies an obligation to maximise the efficiency of how we do our research. Research is worth doing but we need to think hard about how and what we do. Not surprisingly I focussed on the potential of using web based tools and open approaches to make things happen cheaper, quicker, and more effectively. To reduce waste and try to maximise the amount of research output for the money spent.

Also not surprisingly there was significant pushback - much of it where you would expect. Concerns over data theft, over how “non-traditional” contributions might appear (or not) on a CV, and over the costs in time were all mentioned. However what surprised me most was the pushback against the idea of putting material on the open web versus traditional journal formats. There was a real sense that the group had a respect for the authority of the printed, versus online, word which really caught me out. I often use a gotcha moment in talks to try and illustrate how our knowledge framework is changed by the web. It goes “how many people have opened a physical book for information in the last five years?”. Followed by “and how many haven’t used Google in the last 24 hours”. This is shamelessly stolen from Jamie Boyle incidentally.

Usually you get three or four sheepish hands going up admitting a personal love of real physical books. Generally it is around 5-10% of the audience, and this has been pretty consistent amongst mid-career scientists in both academia and industry, and people in publishing. In this audience about 75% put their hands up.  Some of these were specialist “tool” books, mathematical forms, algorithmic recipes, many of them were specialist texts and many referred to the use of undergraduate textbooks. Interestingly they also brought up an issue that I’ve never had an audience bring up before; that of how do you find a good route into a new subject area that you know little about, but that you can trust?

My suspicion is that this difference comes from three places, firstly that these researchers were already biased towards being less discipline bound by the fact that they’d applied for the workshop. They were therefore more likely to discipline hoppers,  jumping into new fields where they had little experience and needed a route in. Secondly, they were at a stage of their career where they were starting to teach, again possibly slightly outside their core expertise and therefore looking for good, reliable material, to base their teaching on. Finally though there was a strong sense of respect for the authority of the printed word. The printing of the German Wikipedia was brought up as evidence that printed matter was, at least perceived to be, more trustworthy. Writing this now I am reminded of the recent discussion on the hold that the PDF has over the imagination of researchers. There is a real sense that print remains authoritative in a way that online material is not. Even though the journal may never be printed the PDF provides the impression that it could or should be. I would guess also that the group were young enough also to be slightly less cynical about authority in general.

Food for thought, but it was certainly a lively discussion. We actually had to be dragged off to lunch because it went way over time (and not I hope just because I had too many slides!). Thanks to all involved in the workshop for such an interesting discussion and thanks also to the twitter people who replied to my request for 140 character messages. They made a great way of structuring the talk.

Pub-sub/syndication patterns and post publication peer review

I think it is fair to say that even those of us most enamored of post-publication peer review would agree that its effectiveness remains to be demonstrated in a convincing fashion. Broadly speaking there are two reasons for this; the first is the problem of social norms for commenting. As in there aren’t any. I think it was Michael Nielsen who referred to the “Kabuki Dance of scientific discourse”. It is entirely allowed to stab another member of the research community in the back, or indeed the front, but there are specific ways and forums in which it is acceptable to do. No-one quite knows what the appropriate rules are for commenting on online fora, as best described most recently by Steve Koch.

My feeling is that this is a problem that will gradually go away as we evolve norms of behaviour in specific research communities. The current “rules” took decades to build up. It should not be surprising if it takes a few years or more to sort out an adapted set for online interactions. The bigger problem is the one that is usually surfaced as “I don’t have any time for this kind of thing”. This in turn can be translated as, “I don’t get any reward for this”. Whether that reward is a token for putting on your CV, actual cash, useful information coming back to you, or just the warm feeling that someone else found your comments useful, rewards are important for motivating people (and researchers).

One of the things that links these two together is a sense of loss of control over the comment. Commenting on journal web-sites is just that, commenting on the journal’s website. The comment author has “given up” their piece of value, which is often not even citeable, but also lost control over what happens to their piece of content. If you change your mind, even if the site allows you to delete it, you have no way of checking whether it is still in the system somewhere.

In a sense, when the Web 2.0 world was built it was got nearly precisely wrong for personal content. For me Jon Udell has written most clearly about this when he talks about the publish-subscribe pattern for successful frameworks. In essence I publish my content and you choose to subscribe to it. This works well for me, the blogger, at this site, but it is not so great for the commenter who has to leave their comment to my tender mercies on my site. It would be better if the commenter could publish their comment and I could syndicate it back to my blog. This creates all sorts of problems; it is challenging for you to aggregate your own comments together and you have to rely on the functionality of specific sites to help you follow responses to your comments. Jon wrote about this better than I can in his blog post.

So a big part of the problem could be solved if people streamed their own content. This isn’t going to happen quickly in the general sense of everyone having a web server of their own - it still remains too difficult for even moderately skilled people to be bothered doing this. Services will no doubt appear in the future but current broadcast services like twitter offer a partial solution (its “my” twitter account, I can at least pretend to myself that I can delete all of it). The idea of using something like the twitter service at microrevie.ws as suggested by Daniel Mietchen this week can go a long way towards solving the problem. This takes a structured tweet of the form @hreview {Object};{your review} followed optionally by a number of asterisks for a star rating. This doesn’t work brilliantly for papers because of problems with the length of references for the paper, even with shortened dois, the need for sometimes lengthy reviews and the shortness of tweets. Additionally the twitter account is not automatically associated with a unique research contributor ID. However the principle of the author of the review controlling their own content, while at the same time making links between themselves and that content in a linked open data kind of way is extremely powerful.

Imagine a world in which your email outbox or local document store is also webserver (via any one of an emerging set of tools like Wave, DropBox, or Opera Unite). You can choose who to share your review with and change that over the time. If you choose to make it public the journal, or the authors can give you some form of credit. It is interesting to think that author-side charges could perhaps be reduced for valuable reviews. This wouldn’t work in a naive way, with $10 per review, because people would churn out  large amounts of rubbish reviews, but if those reviews are out on the linked data web then their impact can be measured by their page rank and the authors rewarded accordingly.

Rewards and control linked together might provide a way of solving the problem - or at least of solving it faster than we are at the moment.

Why the Digital Britain report is a missed opportunity

A few days ago the UK Government report on the future of Britain’s digital infrastructure, co-ordinated by Lord Carter, was released. I haven’t had time to read the whole report, I haven’t even really had time to skim it completely. But two things really leapt out at me.

On page four:

“If, as expected, the volume of digital content will increase 10x to 100x over the next 3 to 5 years then we are on the verge of a big bang in the communications industry that will provide the UK with enormous economic and industrial opportunities”

And on page 18:

“Already today around 7.5% of total UK music album purchases are digital and a smaller but rapidly increasing percentage of film and television consumption is streamed online or downloaded…User-generated and social content will be very significant but should not be the main or only content” - this brought to my attention by Brian Kelly.

The first extract, is to me symptomatic of a serious, even catastrophic lack of ambition and understanding of how the web is changing. If the UK’s digital content only increases by 10-100 fold over the next three years then we will be living in a country lagging behind those that will be experiencing huge economic benefits from getting the web right for their citizens.

But that is just a lack of understanding at core. The Government’s lack of appreciation for how fast this content is growing isn’t really an issue because the Government isn’t an effective content producer online. It would be great if it were, pushing out data, making things happen but they will probably catch up one day, when forced to by events. What is disturbing to me is that second passage. “User generated and social content should not be the main or only content”? It probably already is the main content on the open web, at least by volume, and the volume and traffic rates of user generated content are rising exponentially. But putting that aside, the report appears to be saying that basically the content generated by British citizens, is not, and will not be “good enough”; that it has no real value. Lord Carter hasn’t just said that he doesn’t believe that enough useful content could be produced by “non-professionals”, but that it shouldn’t be produced.

The Digital Britain Unconferences were a brilliant demonstration of how the web can enable democracy by bringing interested people together to debate and respond to specific issues. Rapid, high quality, and grass roots it showed the future of how government’s could actually interact effectively with their citizens. The potential for economic benefits from the web are not in broadcast, are not in professional production, but are in many to many communication and sharing. Selling a few more videos will not get us out of this recession. Letting millions of people add a small amount of value, or have more efficient interactions, could. This report fails to reflect that opportunity. It is a failure of understanding and a failure of imagination. The only saving grace is that, aside from the need for physical infrastructure, the Government is becoming increasingly irrelvant to the debate anyway. The world will move on, and the web will enable it, faster or slower than we expect, and in ways that will be suprising. It will just go that much slower in the UK.

Google Wave in Research - Part II - The Lab Record

In the previous post  I discussed a workflow using Wave to author and publish a paper. In this post I want to look at the possibility of using it as a laboratory record, or more specifically as a human interface to the laboratory record. There has been much work in recent years on research portals and Virtual Research Environments. While this work will remain useful in defining use patterns and interface design my feeling is that Wave will become the environment of choice, a framework for a virtual research environment that rewrites the rules, not so much of what is possible, but of what is easy.

Again I will work through a use case but I want to skip over a lot of what is by now I think becoming fairly obvious. Wave provides an excellent collaborative authoring environment. We can explicitly state and register licenses using a robot. The authoring environment has all the functionality of a wiki already built in so we can assume that and granular access control means that different elements of a record can be made accessible to different groups of people. We can easily generate feeds from a single wave and aggregate content in from other feeds. The modular nature of the Wave, made up of Wavelets, themselves made up of Blips, may well make it easier to generate comprehensible RSS feeds from a wiki-like environment. Something which has up until now proven challenging. I will also assume that, as seems likely, both spreadsheet and graphing capabilities are soon available as embedded objects within a Wave.

Let us imagine an experiment of the type that I do reasonably regularly, where we use a large facility instrument to study the structure of a protein in solution. We set up the experiment by including the instrument as a participant in the wave. This participant is a Robot which fronts a web service that can speak to the data repository for the instrument. It drops into the Wave a formatted table which provides options and spaces for inputs based on a previously defined structured description of the experiment. In this case it calls for a role for this particular run(is it a background or an experimental sample?) and asks where the description of the sample is.

The purification of the protein has already been described in another wave. As part of this process a wavelet was created that represents the specific sample we are going to use. This sample can be directly referenced via a URL that points at the wavelet itself making the sample a full member of the semantic web of objects. While the free text of the purification was being typed in another Robot, this one representing a web service interface to appropriate ontologies, automatically suggested using specific terms adding links back to the ontology where suggestions were accepted, and creating the wavelets that describe specific samples.

The wavelet that defines the sample is dragged and dropped into the table for the experiment. This copying process is captured by the internal versioning system and creates in effect an embedded link back to the purification wave, linking the sample to the process that it is being used in. It is rather too much at this stage to expect the instrument control to be driven from the Wave itself but the Robot will sit and wait for the appropriate dataset to be generated and check with the user it has got the right one.

Once everyone is happy the Robot will populate the table with additional metadata captured as part of the instrumental process, create a new wavelet (creating a new addressable object) and drop in the data in the default format. The robot naturally also writes a description of the relationships between all the objects in an appropriate machine readable form (RDFa, XML, or all of the above) in a part of the Wave that the user doesn’t necessarily need to see. It may also populate any other databases or repositories as appropriate. Because the Robot knows who the user is it can also automatically link the experimental record back to the proposal for the experiment. Valuable information for the facility but not of sufficient interest to the user for them to be bothered keeping a good record of it.

The raw data is not yet in a useful form though, we need to process it, remove backgrounds, that kind of thing. To do this we add the Reduction Robot as a participant. This Robot looks within the wave for a wavelet containing raw data, asks the user for any necessary information (where is the background data to be subtracted) and then runs a web service that does the subtraction. It then writes out two new wavelets, one describing what it has done (with all appropriate links to the appropriate controlled vocab obviously), and a second with the processed data in it.

I need to do some more analysis on this data, perhaps fit a model to start with, so again I add another Robot that looks for a wavelet with the correct data type, does the line fit, once again writes out a wavelet that describes what it has done, and a wavelet with the result in it. I might do this several times, using a range of different analysis approaches, perhaps doing some structural modelling and deriving some parameter from the structure which I can compare to my analytical model fit. Creating a wavelet with a spreadsheet embedded I drag and drop the parameter from the model fit and from the structure and format the cells so that it shows green if they are within 5% of each other.

Ok, so far so cool. Lots of drag and drop and using of groovy web services but nothing that couldn’t be done with a bit of work with a Workflow engine like Taverna and a properly set up spreadsheet. I now make a copy of the wave (properly versioned, its history is clear as a branch off the original Wave) and I delete the sample from the top of the table. The Robots re-process and realize there is not sufficient data to do any processing so all the data wavelets and any graphs and visualizations, including my colour-coded spreadsheet  go blank. What have I done here? What I have just created is a versioned, provenanced, and shareable workflow. I can pass the Wave to a new student or collaborator simply by adding them as a participant. I can then work with them, watching as they add data, point out any mistakes they might make and discuss the results with them, even if they are on the opposite side of the globe. Most importantly I can be reasonably confident that it will work for them, they have no need to download software or configure anything. All that really remains to make this truly powerful is to wrap this workflow into a new Robot so that we can pass multiple datasets to it for processing.

When we’ve finished the experiment we can aggregate the data by dragging and dropping the final results into a new wave to create a summary we can share with a different group of people. We can tweak the figure that shows the data until we are happy and then drop it into the paper I talked about in the previous post. I’ve spent a lot of time over the past 18 months thinking and talking about how we capture what is going and at the same time create granular web-native objects  and then link them together to describe relationships between them. Wave does all of that natively and it can do it just by capturing what the user does. The real power will lie in the web services behind the robots but the beauty of these is that the robots will make using those existing web services much easier for the average user. The robots will observe and annotate what the user is doing, helping them to format and link up their data and samples.

Wave brings three key things; proper collaborative documents which will encourage referring rather than cutting and pasting; proper version control for documents; and document automation through easy access to webservices. Commenting, version control and provenance, and making a cut and paste operation actually a fully functional and intelligent embed are key to building a framework for a web-native lab notebook. Wave delivers on these. The real power comes with the functionality added by Robots and Gadgets that can be relatively easily configured to do analysis. The ideas above are just what I have though of in the last week or so. Imagination really is the limit I suspect.

Google Wave in Research - the slightly more sober view - Part I - Papers

I, and many others have spent the last week thinking about Wave and I have to say that I am getting more, rather than less, excited about the possibilities that this represents. All of the below will have to remain speculation for the moment but I wanted to walk through two use cases and identify how the concept of a collaborative automated document will have an impact. In this post I will start with the drafting and publication of a paper because it is an easier step to think about. In the next post I will move on to the use of Wave as a laboratory recording tool.

Drafting and publishing a paper via Wave

I start drafting the text of a new paper. As I do this I add the Creative Commons robot as a participant. The robot will ask what license I wish to use and then provide a stamp, linked back to the license terms. When a new participant adds text or material to the document, they will be asked whether they are happy with the license, and their agreement will be registered within a private blip within the Wave controlled by the Robot (probably called CC-bly, pronounced see-see-bly). The robot may also register the document with a central repository of open content. A second robot could notify the authors respective institutional repository, creating a negative click repository in, well one click. More seriously this would allow the IR to track, and if appropriate modify, the document as well as harvest its content and metadata automatically.

I invite a series of authors to contribute to the paper and we start to write. Naturally the inline commenting and collaborative authoring tools get a good workout and it is possible to watch the evolution of specific sections with the playback tool. The authors are geographically distributed but we can organize scheduled hacking sessions with inline chat to work on sections of the paper. As we start to add references the Reference Formatter gets added (not sure whether this is a Robot or an Gadget, but it is almost certainly called “Reffy”). The formatter automatically recognizes text of the form (Smythe and Hoofback 1876) and searches the Citeulike libraries of the authors for the appropriate reference, adds an inline citation, and places a formatted reference in a separate Wavelet to keep it protected from random edits. Chunks of text can be collected from reports or theses in other Waves and the tracking system notes where they have come from, maintaing the history of the whole document and its sources and checking licenses for compatibility. Terminology checkers can be run over the document, based on the existing Spelly extension (although at the moment this works on the internal not the external API - Google say they are working to fix that) that check for incorrect or ambiguous use of terms, or identify gene names, structures etc. and automatically format them and link them to the reference database.

It is time to add some data and charts to the paper. The actual source data are held in an online spreadsheet. A chart/graphing widget is added to the document and formats the data into a default graph which the user can then modify as they wish. The link back to the live data is of course maintained. Ideally this will trigger the CC-bly robot to query the user as to whether they wish to dedicate the data to the Public Domain (therefore satisfying both the Science Commons Data protocol and the Open Knowledge Definition - see how smoothly I got that in?). When the users says yes (being a right thinking person) the data is marked with the chosen waiver/dedication and CKAN is notified and a record created of the new dataset.

The paper is cleaned up - informal comments can be easily obtained by adding colleagues to the Wave. Submission is as simple as adding a new participant, the journal robot (PLoSsy obviously) to the Wave. The journal is running its own Wave server so referees can be given anonymous accounts on that system if they choose. Review can happen directly within the document with a conversation between authors, reviewers, and editors. You don’t need to wait for some system to aggregate a set of comments and send them in one hit and you can deal with issues directly in conversation with the people who raise them. In addition the contribution of editors and referees to the final document is explicitly tracked. Because the journal runs its own server, not only can the referees and editors have private conversations that the authors don’t see, those conversations need never leave the journal server and are as secure as they can reasonably be expected to be.

Once accepted the paper is published simply by adding a new participant. What would traditionally happen at this point is that a completely new typeset version would be created, breaking the link with everything that has gone before. This could be done by creating a new Wave with just the finalized version visible and all comments stripped out. What would be far more exciting would be for a formatted version to be created which retained the entire history. A major objection to publishing referees comments is that they refer to the unpublished version. Here the reader can see the comments in context and come to their own conclusions. Before publishing any inline data will need to be harvested and placed in a reliable repository along with any other additional information. Supplementary information can simple be hidden under “folds” within the document rather than buried in separate documents.

The published document is then a living thing. The canonical “as published” version is clearly marked but the functionality for comments or updates or complete revisions is built in. The modular XML nature of the Wave means that there is a natural means of citing a specific portion of the document. In the future citations to a specific point in a paper could be marked, again via a widget or robot, to provide a back link to the citing source. Harvesters can traverse this graph of links in both directions easily wiring up the published data graph.

Based on the currently published information none of the above is even particularly difficult to implement. Much of it will require some careful study of how the work flows operate in practice and there will likely be issues of collisions and complications but most of the above is simply based on the functionality demonstrated at the Wave launch. The real challenge will lie in integration with existing publishing and document management systems and with the subtle social implications that changing the way that authors, referees, editors, and readers interact with the document. Should readers be allowed to comment directly in the Wave or should that be in a separate Wavelet? Will referees want to be anonymous and will authors be happy to see the history made public?

Much will depend on how reliable and how responsive the technology really is, as well as how easy it is to build the functionality described above. But the bottom line is that this is the result of about four day’s occasional idle thinking about what can be done. When we really start building and realizing what we can do, that is when the revolution will start.

Part II is here.

What would you say to Elsevier?

In a week or so’s time I have been invited to speak as part of a forward planning exercise at Elsevier. To some this may seem like an opportunity to go in for an all guns blazing OA rant or perhaps to plant some incendiary device but I see it more as opportunity to nudge, perhaps cajole, a big player in the area of scholarly publishing in the right direction. After all if we are right about the efficiency gains for authors and readers that will be created by Open Access publication and we are right about the way that web based systems utterly changes the rules of scholarly communication then even an organization of the size of Elsevier has to adapt or wither away. Persuading them to move in right direction because it is in their own interests would be an effective way of speeding up the process of positive change.

My plan is to focus less on the arguments for making more research output Open Access and more on what happens as a greater proportion of those outputs become freely available, something that I see as increasingly inevitable. Where that proportion may finally be is anyone’s guess but it is going to be a much bigger proportion than it is now. What will authors and funders want and need from their publication infrastructure and what are the business opportunities that arise from those. For me these fall into four main themes:

  • Tracking via aggregation. Funders and institutions want more and more to track the outputs of their research investment. Providing tools and functionality that will enable them to automatically aggregate and slice and dice these outputs is a big business opportunity. The data themselves will be free but providing it in the form that people need it rapidly and effectively will add value that they will be prepared to pay for.
  • Speed to publish as a market differentiator. Authors will want their content out and available and being acted on fast. Speed to publication is potentially the biggest remaining area for competition between journals. This is important because there will almost certainly be less journals with greater “quality” or “brand” differentiation. There is a plausible future in which there are only two journals, Nature and PLoS ONE.
  • Data publication, serving, and archival. There may be less journals but there will be much greater diversity of materials being published through a larger number of mechanisms. There are massive opportunities in providing high quality infrastructure and services to funders and institutions to aggregate, publish, and archive the full set of research outputs. I intend to draw heavily on Dorothea Salo’s wonderful slideset on data publication for this part.
  • Social search. Literature searching is the main area where there are plausible efficiency gains to be made in the current scholarly publications cycle. According to the Research Information Network’s model of costs search accounts for a very significant proportion of the non-research costs of  publishing. Building the personal networks (Bill Hooker’s, Distributed Wetware Online Information Filter [down in the comments] or DWOIF) that make this feasible may well be the new research skill of the 21st century. Tools that make this work effectively are going to be very popular. What will they look like?

But what have I missed? What (constructive!) ideas and thoughts would you want to place in the minds of the people thinking about where to take one of the world’s largest scholarly publication companies and its online information and collaboration infrastructure.?

Full disclosure: Part of the reason for writing this post is to disclose publicly that I am doing this gig. Elsevier are covering my travel and accommodation costs but are not paying any fee.

OMG! This changes EVERYTHING! - or - Yet Another Wave of Adulation

Yes, I’m afraid it’s yet another over the top response to yesterday’s big announcement of Google Wave, the latest paradigm shifting gob-smackingly brilliant piece of technology (or PR depending on your viewpoint) out of Google. My interest, however is pretty specific, how can we leverage it to help us capture, communicate, and publish research? And my opinion is that this is absolutely game changing - it makes a whole series of problems simply go away, and potentially provides a route to solving many of the problems that I was struggling to see how to manage.

Firstly, lets look at the grab bag of generic issues that I’ve been thinking about. Most recently I wrote about how I thought “real time” wasn’t the big deal but giving the user control back over the timeframe in which streams came into them. I had some vague ideas about how this might look but Wave has working code. When the people who you are in conversation with are online and looking at the same wave they will see modifications in real time. If they are not in the same document they will see the comments or changes later, but can also “re-play” changes. But a lot of thought has clearly gone into thinking about the default views based on when and how a person first comes into contact with a document.

Another issue that has frustrated me is the divide between wikis and blogs. Wikis have generally better editing functionality, but blogs have workable RSS feeds, Wikis have more plugins, blogs map better onto the diary style of a lab notebook. None of these were ever fundamental philosophical differences but just historical differences of implementations and developer priorities. Wave makes most of these differences irrelevant by creating a collaborative document framework that easily incorporates much of the best of all of these tools within a high quality rich text and media authoring platform. Bringing in content looks relatively easy and pushing content out in different forms also seems to be pretty straightforward. Streams, feeds, and other outputs, if not native, look to be easily generated either directly or by passing material to other services. The Waves themselves are XML which should enable straightforward parsing and tweaking with existing tools as well.

One thing I haven’t written much about but have been thinking about is the process of converting lab records into reports and onto papers. While there wasn’t much on display about complex documents a lot of just nice functionality, drag and drop links, options for incorporating and embedding content was at least touched on. Looking a little closer into the documentation there seems to be quite a strong provenance model, built on a code repository style framework for handling document versioning and forking. All good steps in the right direction and with the open APIs and multitouch as standard on the horizon there will no doubt be excellent visual organization and authoring tools along very soon now. For those worried about security and control, a 30 second mention in the keynote basically made it clear that they have it sorted. Private messages (documents? mecuments?) need never leave your local server.

Finally the big issue for me has for some time been bridging the gap between unstructured capture of streams of events and making it easy to convert those to structured descriptions of the intepretation of experiments.  The audience was clearly wowed by the demonstration of inline real time contextual spell checking and translation. My first thought was - I want to see that real-time engine attached to an ontology browser or DbPedia and automatically generating links back to the URIs for concepts and objects. What really struck me most was the use of Waves with a few additional tools to provide authoring tools that help us to build the semantic web, the web of data, and the web of things.

For me, the central challenges for a laboratory recording system are capturing objects, whether digital or physical, as they are created, and then serve those back to the user, as they need them to describe the connections between them. As we connect up these objects we will create the semantic web. As we build structured knowledge against those records we will build a machine-parseable record of what happened that will help us to plan for the future. As I understand it each wave, and indeed each part of a wave, can be a URL endpoint; an object on the semantic web. If they aren’t already it will be easy to make them that. As much as anything it is the web native collaborative authoring tool that will make embedding and pass by reference the default approach rather than cut and past that will make the difference. Google don’t necessarily do semantic web but they do links and they do embedding, and they’ve provided a framework that should make it easy to add meaning to the links. Google just blew the door off the ELN market, and they probably didn’t even notice.

Those of us interested in web-based and electronic recording and communication of science have spent a lot of the last few years trying to describe how we need to glue the existing tools together, mailing lists, wikis, blogs, documents, databases, papers. The framework was never right so a lot of attention was focused on moving things backwards and forwards, how to connect one thing to another. That problem, as far as I can see has now ceased to exist. The challenge now is in building the right plugins and making sure the architecture is compatible with existing tools. But fundamentally the framework seems to be there. It seems like it’s time to build.

A more sober reflection will probably follow in a few days ;-)

It’s not easy being clear…

There has been some debate going backwards and forwards over the past few weeks about licensing, peoples expectations, and the extent to which researchers can be expected to understand, or want to understand, the details of legal terms, licensing and other technical minutiae. It is reasonable for scientific researchers not to wish to get into the details. One of the real successes of Creative Commons has been to provide a relatively small set of reasonably clear terms that enable people to express their wishes about what people can do with their work. But even here there is the potential for significant confusion as demonstrated by the work that CC is doing on the perception of what “non commercial” means.

The end result of this is two-fold. Firstly people are genuinely confused about what to do and a result they give up. In giving up there is often an unspoken assumption that “people will understand what I want/mean”. Two examples yesterday illustrated exactly how misguided this can be and showed the importance of being clear, and thinking about, what you want people to do with your content and information.

The first was pointed out by Paulo Nuin who linked to a post on The Matrix Cookbook, a blog and PDF containing much useful information on matrix transforms. The post complained that Amazon were selling a Kindle version of the PDF, apparently without asking permission or even bothering to inform the authors. So far, so big corporation. But digging a little deeper I went to the front page of the site and found this interesting “license”:

“License? No, there is no license. It is provided as a knowledge sharing project for anyone to use. But if you use it in an academic or research like context, we would love to be cited appropriately.”

Now I would intepret this as meaning that the authors had intended to place the work in the public domain. They clearly felt that while educational and research re-use was fine that commercial use was not. I would guess that someone at Amazon read the statement “there is no license” and felt that it was free to re-use. It seems odd that they wouldn’t email the authors to notify them but if it were public domain there is no requirement to. Rude, yes. Theft? Well it depends on your perspective. Going back today the authors have made a significant change to the “license”:

It is provided as a knowledge sharing project for anyone to use. But if you use it in an academic or research like context, we would love to be cited appropriately. And NO, you are not allowed to make money on it by reselling The Matrix Cookbook in any form or shape.

Had the authors made the content CC-BY-NC then their intentions would have been much clearer. My personal belief is that an NC license would be counter-productive (meaning the work couldn’t be used for teaching at a fee charging college or for research funded by a commercial sponsor for instance) but the point of the CC licenses is to give people these choices. What is important is that people make those choices and make them clear.

The second example related to identity. As part of an ongoing discussion involving online commenting genereg, a Friendfeed user, linked to their blog which included their real name. Mr Gunn, the nickname used by Dr William Gunn online wrote a blog post in which he referred to genereg’s contribution by linking to their blog from their real name [subsequently removed on request]. I probably would have done the same, wanting to ascribe the contribution clearly to the “real person” so they get credit for it. Genereg objected to this feeling that as their real name wasn’t directly in that conversational context it was inappropriate to use it.

So in my view, “Genereg” was a nickname that someone was happy to have connected with their real name, while in their view this was inappropriate. No-one is right or wrong here, we are evolving the rules of conduct more or less as we go and frankly, identity is a mess. But this wasn’t clear to me or to Mr Gunn. I am often uncomfortable with trying to tell whether a specific person who has linked two apparently separate identities is happy with that link being public, has linked the two by mistake, or just regards one as an alias. And you can’t ask in public forum can you?

What links these, and this week’s other fracas, is confusion over people’s expectations. The best way to avoid this is to be as clear as you possibly can. Don’t assume that everyone thinks the same way that you do. And definitely don’t assume that what is obvious to you is obvious to everyone else. When it comes to content, make a clear statement of your expectations and wishes, preferably using a widely recognized and understood licenses. If you’re reading this at OWW you should be seeing my nice shiny new cc0 waiver in the right hand navbar (I haven’t figured how to get it into the RSS feed yet). Most of my slidesets at Slideshare are CC-BY-SA. I’d prefer them to be CC-BY but most include images with CC-BY-SA licenses which (try to make sure) I respect. Overall I try to make the work I generate as widely re-usable as possible and aim to make that as clear as possible.

There are no such tools to make clear statements about how you wish your identity to be treated (and perhaps there should be). But a plain english statement on the appropriate profile page might be useful “I blog under a pseudonym because…and I don’t want my identity revealed”…”Bunnykins is the Friendfeed handle of Professor Serious Person”. Consider whether what you are doing is sending mixed messages or potentially confusing. Personally I like to keep things simple so I just use my real name or variants of it. But that is clearly not for everyone.

Above all, try to express clearly what you expect and wish to happen. Don’t expect others necessarily to understand where you’re coming from. It is very easy for one person’s polite and helpful to be another person’s deeply offensive. When you put something online, think about how you want people to use it, think about how you don’t want people to use it (and remember you may need to balance the allowing of one against the restricting of the other) and make those as clear as you possibly can, where possible using a statement or license that is widely recognized and has had some legal attention at some point like the CC licenses, cc0 waiver, or the PDDL. Clarity helps everyone. If we get this wrong we may end up with a web full of things we can’t use.

And before anyone else gets in to tell me I’ve made plenty of unjustified, and plain wrong, assumptions about other people’s views before. Pot. Kettle. Black. Welcome to being human.