An openwetware blog on the challenges of open and connected science

citation

Links as the source code of our thinking - Tim O’Reilly

I just wanted to point to a post that Tim O’Reilly wrote just before the US election a few weeks back. There was an interesting discussion about the rights and wrongs of him posting on his political views and the rights and wrongs of that being linked to from the O’Reilly Media front page. In amongst the abuse that you have come to expect in public political discussions there is some thought provoking stuff. But what I wanted to point out and hopefully revive a discussion of is a point he makes right near the bottom.

[I have conflated two levels of comments here (Tim is quoting his own comment) - see the original post for the context]

“Thanks to everyone for wading in, especially those of you who are marshalling reasoned arguments and sharing actual sources and references, showing you’ve done your homework, and helping other people to see the data that helped to shape your point of view. We need a LOT more of that in this discussion, rather than slinging unsupported allegations back and forth.

Bringing this back to tech - showing the data behind your argument is a lot like open source. It’s a way of verifying the “code” that’s inside your head. If you can’t show us your code, it’s a lot harder to trust your results!”

Links as source code for your thinking: that’s a meme that should survive the particulars of this particular debate!

In a sense Tim is advocating the wholesale adoption of the very strong attribution culture we (like to think we) have in academic research. The importance of acknowedging your sources is clear but it also has much more value than that. By tracing back the influences that have brought someone to a specific conclusion or belief it is possible for other people to gain a much deeper insight into how those ideas evolved. Being able to parse the dependencies between ideas, data, samples, papers, and knowledge in an automatic, machine readable, way is the promise of the semantic web, but in the meantime just helping the poor old humans to trace back and understand where someone is coming from is very helpful.

The problem of academic credit and the value of diversity in the research community

This is the second in a series of posts (first one here) in which I am trying to process and collect ideas that came out of Scifoo. This post arises out of a discussion I had with Michael Eisen (UC Berkely) and Sean Eddy (HHMI Janelia Farm) at lunch on the Saturday. We had drifted from a discussion of the problem of attribution stacking and citing datasets (and datasets made up of datasets) into the problem of academic credit. I had trotted out the usual spiel about the need for giving credit for data sets and for tool development.

Michael made two interesting points. The first was that he felt people got too much credit for datasets already and that making them more widely citeable would actually devalue the contribution. The example he cited was genome sequences. This is a case where, for historical reasons, the publication of a dataset as a paper in a high ranking journal is considered appropriate.

In a sense I agree with this case. The problem here is that for this specific case it is allowable to push a dataset sized peg into a paper sized hole. This has arguably led to an over valuing of the sequence data itself and an undervaluing of the science it enables. Small molecule crystallography is similar in some regards with the publication of crystal structures in paper form bulking out the publication lists of many scientists. There is a real sense in which having a publication stream for data, making the data itself directly citeable, would lead to a devaluation of these contributions. On the other hand it would lead to a situation where you would cite what you used, rather than the paper in which it was, perhaps peripherally described. I think more broadly that the publication of data will lead to greater efficiency in research generally and more diversity in the streams to which people can contribute.

Michael’s comment on tool development was more telling though. As people at the bottom of the research tree (and I count myself amongst this group) it is easy to say ‘if only I got credit for developing this tool’, or ‘I ought to get more credit for writing my blog’, or anyone of a thousand other things we feel ‘ought to count’. The problem is that there is no such thing as ‘credit’. Hiring decisions and promotion decisions are made on the basis of perceived need. And the primary needs of any academic department are income and prestige. If we believe that people who develop tools should be more highly valued then there is little point in giving them ‘credit’ unless that ‘credit’ will be taken seriously in hiring decisions. We have this almost precisely backwards. If a department wanted tool developers then it would say so, and would look at CVs for evidence of this kind of work. If we believe that tool developers should get more support then we should be saying that at a higher, strategic level, not just trying to get it added as a standard section in academic CVs.

More widely there is a question as to why we might think that blogs, or public lectures, or code development, or more open sharing of protocols are something for which people should be given credit. There is often a case to be made for the contribution of a specific person in a non-traditional medium, but that doesn’t mean that every blog written by a scientists is a valuable contribution. In my view it isn’t the medium that is important, but the diversity of media and the concomitant diversity of contributions that they enable. In arguing for these contributions being significant what we are actually arguing for is diversity in the academic community.

So is diversity a good thing? The tightening and concentration of funding has, in my view, led to a decrease in diversity, both geographical and social, in the academy. In particular there is a tendency to large groups clustered together in major institutions, generally led by very smart people. There is a strong argument that these groups can be more productive, more effective, and crucially offer better value for money. Scifoo is a place where those of us who are less successful come face to face with the fact that there are many people a lot smarter than us and that these people are probably more successful for a reason. And you have to question whether your own small contribution with a small research group is worth the taxpayer’s money. In my view this is something you should question anyway as an academic researcher – there is far too much comfortable complacency and sense of entitlement, but that’s a story for another post.

So the question is; do I make a valid contribution? And does that provide value for money? And again for me Scifoo provides something of an answer. I don’t think I spoke to any person over the weekend without at least giving them something new to think about, a slightly different view on a situation, or just an introduction to something that hadn’t heard of before. These contributions were in very narrow areas, ones small enough for me to be expert, but my background and experience provided a different view. What does this mean for me? Probably that I should focus more on what makes my background and experience unique – that I should build out from that in the directions most likely to provide a complementary view.

But what does it mean more generally? I think that it means that a diverse set of experiences, contributions, and abilities will improve the quality of the research effort. At one session of Scifoo, on how to support ground breaking science, I made the tongue in cheek comment that I thought we needed more incremental science, more filling in of tables, of laying the foundations properly. The more I think about this the more I think it is important. If we don’t have proper foundations, filled out with good data and thought through in detail, then there are real risks in building new skyscrapers. Diversity adds reinforcement by providing better tools, better datasets, and different views from which to examine the current state of opinion and knowledge. There is an obvious tension between delivering radical new technologies and knowledge and the incremental process of filling in, backing up, and checking over the details. But too often the discussion is purely about how to achieve the first, with no attention given to the importance of the second. This is about balance not absolutes.

So to come back around to the original point, the value of different forms of contribution is not due to the fact that they are non-traditional or because of the medium per se, it is because they are different. If we value diversity at hiring committees, and I think we should, then looking at a diverse set of contributions, and the contribution that a given person is likely to make in the future based on their CVs, we can assess more effectively how they will differ from the people we already have. The tendency of ‘the academy’ to hire people in its own image is well established. No monoculture can ever be healthy; certainly not in a rapidly changing environment. So diversity is something we should value for its own sake, something we should try to encourage, and something that we should search CVs for evidence of. Then the credit for these activities will flow of its own accord.

Data is free or hidden - there is no middle ground

Science commons and other are organising a workshop on Open Science issues as a satellite meeting of the European Science Open Forum meeting in July. This is pitched as an opportunity to discuss issues around policy, funding, and social issues with an impact on the ‘Open Research Agenda’. In preparation for that meeting I wanted to continue to explore some of the conflicts that arise between wanting to make data freely available as soon as possible and the need to protect the interests of the researchers that have generated data and (perhaps) have a right to the benefits of exploiting that data.

John Cumbers proposed the idea of a ‘Protocol’ for open science that included the idea of a ‘use embargo’; the idea that when data is initially made available, no-one else should work on it for a specified period of time. I proposed more generally that people could ask that people leave data alone for any particular period of time, but that there ought to be an absolute limit on this type of embargo to prevent data being tied up. These kinds of ideas revolve around the need to forge community norms – standards of behaviour that are expected, and to some extent enforced, by a community. The problem is that these need to evolve naturally, rather than be imposed by committee. If there isn’t community buy in then proposed standards have no teeth.

An alternative approach to solving the problem is to adopt some sort ‘license’. A legal or contractual framework that creates obligation about how data can be used and re-used. This could impose embargoes of the type that John suggested, perhaps as flexible clauses in the license. One could imagine an ‘Open data – six month analysis embargo’ license. This is attractive because it apparently gives you control over what is done with your data while also allowing you to make it freely available. This is why people who first come to the table with an interest in sharing content always start with CC-BY-NC. They want everyone to have their content, but not to make money out of it. It is only later that people realise what other effects this restriction can have.

I had rejected the licensing approach because I thought it could only work in a walled garden, something which goes against my view of what open data is about. More recently John Wilbanks has written some wonderfully clear posts on the nature of the public domain, and the place of data in it, that make clear that it can’t even work in a walled garden. Because data is in the public domain, no contractual arrangement can protect your ability to exploit that data, it can only give you a legal right to punish someone who does something you haven’t agreed to. This has important consequences for the idea of Open Science licences and standards.

If we argue as an ‘Open Science Movement’ that data is in and must remain in the public domain then, if we believe this is in the common good, we should also argue for the widest possible interpretation of what is data. The results of an experiment, regardless of how clever its design might be, are a ‘fact of nature’, and therefore in the public domain (although not necessarily publically available). Therefore if any person has access to that data they can do whatever the like with it as long as they are not bound by a contractual arrangement. If someone breaks a contractual arrangement and makes the data freely available there is no way you can get that data back. You can punish the person who made it available if they broke a contract with you. But you can’t recover the data. The only way you can protect the right to exploit data is by keeping it secret. The is entirely different to creative content where if someone ignores or breaks licence terms then you can legally recover the content from anyone that has obtained it.

Why does this matter to the Open Science movement? Aren’t we all about making the data available for people to do whatever anyway? It matters because you can’t place any legal limitations on what people do with data you make available. You can’t put something up and say ‘you can only use this for X’ or ‘you can only use it after six months’ or even ‘you must attribute this data’. Even in a walled garden, once there is one hole, the entire edifice is gone. The only way we can protect the rights of those who generate data to benefit from exploiting it is through the hard work of developing and enforcing community norms that provide clear guidelines on what can be done. It’s that or simply keep the data secret.

What is important is that we are clear about this distinction between legal and ethical protections. We must not tell people that their data can be protected because essentially they can’t. And this is a real challenge to the ethos of open data because it means that our only absolutely reliable method for protecting people is by hiding data. Strong community norms will, and do, help but there is a need to be careful about how we encourage people to put data out there. And we need to be very strong in condemning people who do the ‘wrong’ thing. Which is why a discussion on what we believe is ‘right’ and ‘wrong’ behaviour is incredibly important. I hope that discussion kicks off in Barcelona and continues globally over the next few months. I know that not everyone can make the various meetings that are going on - but between them and the blogosphere and the ‘streamosphere‘ we have the tools, the expertise, and hopefully the will, to figure these things out.

Related articles

Zemanta Pixie

Attribution for all! Mechanisms for citation are the key to changing the academic credit culture

A reviewer at the National Institutes of Health evaluates a grant proposal.Image via Wikipedia

Once again a range of conversations in different places have collided in my feed reader. Over on Nature Networks, Martin Fenner posted on Researcher ID which lead to a discussion about attribution and in particular Martin’s comment that there was a need to be able to link to comments and the necessity of timestamps. Then DrugMonkey posted a thoughtful blog about the issue of funding body staff introducing ideas from unsuccessful grant proposals they have handled to projects which they have a responsibility in guiding. Read more »