Articles by deroure

You are currently browsing deroure’s articles.

As a keen observer of the digital ecosystem, one of the things that fascinates me is the evolution of Linked Data…

The Web took off the way it did due to many factors. Crucially it was easy to consume information but also it was easy to produce it – by the latter I mean that a unix systems administrator could, without specialist knowledge, put up a Web server and people could easily write some HTML to be served by it. To use our meme from last September, there was a “ramp” for consumers and one for producers too.

Something very interesting is happening with Linked Data: the producer ramp seems to have arrived before the consumer ramp; i.e. circumstances are such that there are incentives to produce before the consumers are fully tooled up.  This is different but not unnatural – the case for consumption is clearly better if there is something to consume (e.g. DVD players wouldn’t have been useful without DVDs!) The downside is the production practice might miss some usability requirements but, in the fluid world of the Web, we can expect to see this co-evolution.

How did this early producer ramp come about?  Some of it is due to the openness “wave” that the Linked (Open) Data community is both encouraging and surfing, such that data providers get a tick in a box for publishing this way.  Some of it is because of lobbying and hard work by key players – activists, academics, academic activists – with influence.

Sometimes the business case might not be founded entirely on delivering publicly open data. This is perhaps most evident in a corporation or enterprise with a complexity of information systems (I quite like “complexity” as the collective noun for information systems…)  There are clear internal efficiencies to a common data sharing technology which facilitates internal reuse and perhaps extension to business partners too.  We used to call this the RDF bus, it works well and now it’s emerging as public transport!

The BBC has demonstrated this admirably by using RDF internally to deliver web sites, also delivering linked data externally, and also bringing in other sources (e.g. the wildlifefinder). By the same argument, open government data represents a cost efficiency in-house together with an empowerment of the citizen through access to open data. Whichever incentive may dominate, each attracts applause and encourages progress towards a culture change in data publishing.

Interestingly, in linked data consumers may be producers too.  Some linked data apps consume multiple sources and communicate to the human – many mashups are really visualisations where the integration of information occurs somewhere in the cognitive workflow. But the producer mindset (i.e. it’s better to publish linked data for everyone to use than just put up yet another web site) reminds us that it’s valuable to integrate and republish, not just juxtapose in the UI. This culture of republication is already evident in the web for news aggregation and we can expect to see it in data too.

Anyway, the tooling for the consumer ramp is emerging now, powered by the energy of communities like the consumers of open.gov.uk.  This is good news for Linked Data, because we need both ramps to flourish.

Data analysts might reasonably be concerned that there is more to the consumer ramp than software tooling: there is also a question of “data literacy”.  This was less of an issue last time round because we humans are rather good at processing images and text – even making rapid assessments of information quality at a glance at a list of search results.  We can do this with BBC programme data too, but working with datasets can be more specialist and to do it wrong could ultimately be damaging.

Examples of misinterpretations of data abound before linked data, and we can hope that opening up analysis will lead to more challenge and debate, and emergence of better understanding and practice. There is some good practice emerging. We can publish data with a set of caveats to protect the producer from accusations resulting from misinterpretation, or far better we can publish data with an online tutorial so that people learn how to interpret it – building capability for sustained understanding rather than (or as well as) planning for inevitable misinterpretation. This is part of the consumer ramp too.

So there we are. I hope the producers are responsive to the emerging needs of the consumer as they see what is possible, that the wave of openness and business efficiency drives a culture change in data use, and that citizen analysts lead to better public understanding. In my next post I will explain how to achieve some of this!

Did you know you can run remote computations from your Windows/Mac/Linux box without any special client software installed, just by dragging and dropping?  And it even doesn’t matter if it’s not online all the time…

It’s a great idea from Ian Cottam at The University of Manchester, and it makes some powerful points.

The trick uses Dropbox, which is software that syncs your files across your computers. This is incredibly handy – as time goes on we all use more PCs, laptops (and indeed iPhones!) and Dropbox synchronises the contents of your Dropbox folder across all these for you. Note this is quite different from having some centralised filestore (or WebDAV drive) mounted on everything – it doesn’t need you to be online at time of use and it doesn’t need a sysadmin to set it up. Dropbox is very easy to install and incredibly easy to use – there really is no need to read a manual and the benefits are immediate. (Other synchronising software exists, but Ian prefers the simpliciity and ease of Dropbox.)

With the “Drop and Compute” model you just drag and drop your “job” into the appropriate Dropbox folder. Later Dropbox notifies you about new files and when you look you find the results.  This is a totally familiar interface for file and data management. Behind the scenes, the server spotted the job – via a simple monitoring script – and did its thing. To find out all the details of how Ian makes this work with Condor job submission, check out the Drop and Compute Wiki page for instructions and a video.

Couldn’t we have done this before?  Yes, but nothing like this easily for all concerned. If we say to researchers “mount this network drive and put all your files there on every machine you use” then we are creating an extra burden and perhaps worsening a version control problem, and obliging them to be online to use it.  However, if we say “you can use Dropbox to keep your files in sync between all your machines – and your iPhone too” then the user has an immediate benefit at next to no extra work. Dropbox is a solution that makes things simpler while other solutions make things more complicated – this is the only acceptable direction!

I think there’s some interesting psychology involved here too. Asking people to put their files somewhere central stops them feeling they are their personal files any longer, whereas syncing them across personal machines keeps them close and personal. Of course, behind the scenes, Dropbox is indeed putting them somewhere central, but that’s an implementation detail (and has the benefit you can also manage them from a  Web browser).  Fundamentally, the user model is empowering rather than disempowering.

Is there a catch?  Just one little issue at the moment – if Dropbox puts its data in a different territory then it may be subject to different laws, the best known example being the Patriot Act in the US.

The Condor example has exercised the model well, and in principle this approach could be used for any remote processing. But best of all it’s an example of understanding that ease of use really matters: above all, it’s a solution which actually makes peoples lives simpler.

The focus of our trip was on exploring changing research practice in a world where we have substantially more data and the emergence of new data-intensive techniques. Fundamentally we knew that  (a) we are in a digital ecosystem and (b) we are in a digital revolution.

The ecosystem point is important because we are two computer scientists who understand that we are witnessing (and participating in)  co-evolution of technology and society, and we were anxious to avoid the technological determinism that computer scientists might sometimes be guilty of. (Well, actually, neither of us were computer scientists originally…!) Actually the co-evolution point was already well understood in most of the groups we visited, and sometimes even the subject of scholarly study.  So we didn’t have trouble making the point, which is at odds with my experiences with some other audiences. There were a couple of examples of thinking along the lines of “high performance computing will solve the challenges of multidisciplinary collaboration” but they were rare and even disputed. Malcolm’s characterisation of Supercomputing as “males displaying to one another” was greeted with recognition and humour.

By the end of the trip we were working with three metaphors and they were great objects for discussion – all three inherently socio-technical. We knew they were working because people would reach for their notepads when we used them, and typically people would start using at least one of them themselves in the discussions:

1. Intellectual Access Ramps.  This is the notion that researchers need to be able to engage incrementally with the tools, methods and practices of data intensive research.  (We used myExperiment as an example.)

2. Telescopes for the Mind. This is the notion of new instruments to reveal things in data that we couldn’t see before. Telescopes changed our understanding of our position in the universe (literally!)

3. Going the Last Mile. This reflects the need to communicate insights so that they have influence – “insights with impact” if you like. Too often we stop at the paper or the screen, but that’s only half the picture.

In our visits we presented examples of these and sought more – we heard of both success and failures, of the importance of the rules of the road and how they need to change with increasingly digital practice. We found excellent examples of good practice. In particular the geo community is mature, inherently international and has well established practice and standards, and perhaps they may provide a beacon for some others. We also noted the leadership role of libraries in the US, effectively reinventing themselves in the digital age, and the maturity of study of coshaping and digital scholarship. The technology backdrop was fascinating, especially for me the new architectures which do the compute alongside the data (like Graywulf) as well as the inevitable cloud computing (quite literally in the case of the meteorologists!)

We had many useful discussions about effective alignment of researchers, research drivers, community, data, resources and innovation in technology and method -  and the consequent need for alignment from funding agencies too. We proposed a hypothesis for discussion: “If we spent 10% less on hardware and put that investment instead into equipping researchers to work more effectively with data then we would make greater progress in our research.”  There was broad agreement and reinforcement for this statement. The two new DataNet projects are particularly significant in terms of investment (5 years, renewable to 10) and a careful alignment which emphasises the research data users rather than the computer science but still understands the role of the computer scientists.

We also met some absolutely amazing scientists, who really brought home Alex Szalay’s quote “a scientist needs to be able to live within the data”. They were inspirational, humbling, and gave the greatest clarity to what data intensive science is all about!

e-Science was kind of a utopian vision and now we know the realities. I don’t think our report will be utopian: it will say there are hard and important decisions to be made in society, that these can be informed by data-intensive research in ways never thought possible before, and to achieve this we need to align our investments in the researcher’s capacity to understand data as well as in the big iron to shovel it. Resources are not infinite.

It’s time to sleep. The last 3.5 weeks have been incredibly intense. They’ve also been incredibly valuable, and I want to thank everyone who made time to meet with us, everyone who’s hosted us in their institutions, cities and homes, and especially Jo Newman and Ruth Lee who ensured the flawless organisation of this epic journey!

Flight BA216, London Heathrow

The second half of the tour has taken us down the west coast from Seattle to San Francisco to LA, then by road through Irvine to San Diego, and back eastward through Boulder. We are now masters of the Hertz “NeverLost” satnavs! This post comes to you from rainy Washington where tomorrow we meet with NSF.

The visits continue to be hugely informative and thought-provoking, and as the trip goes on our thinking is converging and now consolidating. Our hosts have been fabulous and we have enjoyed stimulating meetings from 45 minute chats to 3 hour discussions. And somewhere in the background, when we have a moment to glance up, we’ve seen fabulous views of lakes, mountains and the ocean.

A very brief synopsis of the second half:

Monday 20th Microsoft. Hosted by Tony Hey, we spent the day in the Executive Briefing Centre being briefed executively and then briefly executing our talk. Significantly we intersected with another expedition – Prof Doug Kell of BBSRC and his officers, who were at the end of a similar tour. Check out the open source Word Add-in For Ontology Recognition and Creative Commons Add-in for Microsoft Office 2007 And check out Doug’s blog too.

Tuesday 21st SLAC. In the Stanford Linear Accelerator Center, Jacek Becla introduced us to his world of eXtremely Large DataBases and the XLDB events (the 3rd workshop was held recently). We enjoyed a demo of SciDb (see Tuesday 8th in a previous post!)

Wednesday 22nd ISI.  A dynamic day of meetings (the meetings were dynamic and so was the schedule!) at the Information Sciences Institute with Yolanda Gil, Ewa Deelman, Ann Chervenak, Carl Kesselman and members of his team. Lots of examples of Computer Science coming to the aid of real users.

Thursday 23rd Irvine and UCSD. Hosted by Paul Dourish at Irvine, we had a fascinating meeting with Gary Olson, collaboratory guru and one of the editors of Scientific Collaboration on the Internet.  Then Malcolm and I forked: I stayed at Irvine to meet with Richard Taylor and see the work of Hazel Asuncion in traceability, workflows, and software architectures (software provenance is important too!); Malcolm visited Mark Ellisman of the National Center for Microscopy and Imaging Research at UCSD, where he spotted many microscopes and multiple ramps.

Friday 24th NCAR. Hosted by Don Middleton, we had a super day with his team in the National Center for Atmospheric Research in Boulder (at 5400 feet!) With serious attention to data management through its lifecycle, and delivery of tools as well as data, the day was full of examples of best practice – not just in data but in teamwork.

We have meetings in Washington now and one more technical visit -to former rock star Alex Szalay.

Ramping up

Giving a researcher sudden access to a wealth of powerful new tools, techniques and methods is no more likely to lead to a successful journey than if someone who hasn’t learned to drive is put at the wheel of a strange new vehicle. Not even the instruction manual is going to help much! It’s challenging, it requires investment of effort and it’s even dangerous. It should be no surprise if they would rather get out and resume more familiar modes of transport.

The way we learn to be a proficient driver is incremental and assisted. At the end we can drive by ourselves. And some go on to more advanced driving challenges like fast sports cars or heavy goods vehicles. Some teach other drivers.

Similarly researchers need a means of incremental engagement with the tools, techniques and methods of e-Science, so that they can meet their needs and on their own terms. To use another transport metaphor, we can think of this as an “on ramp”: it is the intellectual access ramp for data-intensive science.

Successful projects understand this and have built all sorts of ramps to meet the different needs in different communities. On our trip the ramp has turned out to be a powerful socio-technical metaphor. Once people see the ramp as an object in its own right we can look at its shape, how it’s built and how well it works – we have become ramp-spotters and I’m thinking of compiling the Observers Book of Ramps! Some have gentle curves. Some have activation energy. Some look like tall brick walls – those tend to be the ones that don’t work well…

We come bearing ramps. The UK Virtual Research Environment projects are ramps: myExperiment provides a gentle and familiar ramp into the “World of Workflows” where workflows are easy to run but tricky to write. Some of the efforts to hide complex infrastructure behind simple APIs are ramps for developers, like SAGA (Simple API for Grid Applications) and other offerings from OMII-UK (whose business is ramp rengineering!) For the scientist, a simple drag-and-drop interface to running e-Science computations is a gentle ramp – elegantly demonstrated by the dropbox drag-and-drop interface to Condor job submission developed by Ian Cottam in the Manchester Interdisciplinary Biocentre.

We’ve seen some great ramps on the trip, including scientist-focused tool provision alongside data products in operations like Birn and Unidata. The Science Gateways endeavour to be ramps and NanoHub is a great example. It’s interesting, and perhaps no coincidence, that successful ramps like NanoHub and Unidata have “education” in their mission statements.

Some ramps have facilitators to guide researchers up the ramp – think of librarians using their skills to assist a researcher who is then able to help themselves. We’ve also seen models where a layer of abstraction is implemented to protect the researchers from the underlying details – intermediaries rather than facilitators – but there is wariness that these may also serve to hide the computational thinking from the researchers and deny potential for new practice and new accomplishments.

It seems that successful ramps are characterised by an effective alignment of community, data and software, and they have a role in developing research skills as well as conducting research. It follows that ramp construction needs an alignment of interests and funding from a community of users in research and education, their data providers, service providers and software tools providers. This combination might not be in the remit of any one funder but it’s in the interests of all, because everyone stands to gain from researchers ascending the intellectual access ramps to achieve new outcomes and build new know-how.

This is the first draft of the executive summary of the report from the trip, posted here to invite your comment…

Today’s challenges demand the best quality decisions that can be achieved. The growing wealth of available data should be used to improve those decisions. Data are the catalysts in research, engineering and diagnosis. Data fuel analysis to produce key evidence and supply the information for compelling communication. Data connect computational systems and capture the work of large collaborative endeavours.

Data should be used fluently in research, investigation, planning and policy formulation to equip those responsible with the necessary information, knowledge and wisdom. The present cornucopia of data is under-exploited. Effort and resources should be rallied to deliver a data-use initiative that harnesses the potential of data by delivering a new focus and capability.

The data-use initiative will:

  • increase the impact of research on the quality of decisions, by
  • enabling all researchers to analyse an order of magnitude more data without intermediaries
  • with much improved methods to “go the last mile” to achieve influence.

An over-riding requirement is to engage the next generation of talented minds in the creative processes of distilling information from data, establishing knowledge and developing wisdom.

Principles

  • Support for research data should be in harmony with the evolving digital ecosystem.
  • Investment in collecting, preserving or generating data should be balanced with investment in analysing and interpreting data.
  • Facilitate the co-evolution of research practices, new methods and their supporting software.
  • Democratise the means for undertaking research by improving education and access to data, tools and facilities.
  • Align innovation, preservation and support for sustained use.
  • All users of data need to see that resources are neither free nor infinite.

Recommendations

  • Stimulate new thinking in the next generation to drive the international data-use impetus.
  • Invest in creating and sharing methods and software for exploiting data.
  • Increase data and method use by building “intellectual access ramps” and education.
  • Coordinate, improve and sustain the foundations for exploiting data.
  • Align creation of methods, aids to adoption, and provision of sustained infrastructure.

The data-use initiative will require that a greater proportion of research effort and investment is allocated to using data than hitherto. There are already large quantities of under-used data. The available data are growing rapidly through research investment and as the by-product of many other activities. A modest change in relative priorities will yield significant dividends.

Survival in the digital revolution depends on rapid and appropriate adaptation.

There are many global and local challenges that will overwhelm society unless we improve the quality of our decisions. Key to this is making the best use of all available data.

We’re half way through the trip. Geographically we’ve been to Boston, Chicago, Ann Arbor, Madison, Champaign and Albuquerque. Intellectually we’ve seen tremendous multidisciplinary research, exciting technologies, many examples of different alignments of data, hardware, people and software – and fields driven as much by library schools as computer scientists. Every visit has been superb.  We have met so many people, seen so many projects, had great discussions and arguments.  Our hosts have been fabulous:  every day has been incredibly rich and full. And, most importantly, our thinking evolves at every stop.

A very brief synopsis of our journey so far:

Tuesday 8th MIT. Hosted by Eric Prud’hommeaux & Philippe Le Hegaret in W3C, we caught up on standards including RDFa and HTML5 in the morning, gave a lunchtime talk to Carlo Ratti’s SENSEable Cities lab and met with Sam Madden and Michael Stonebraker in the afternoon to learn about SciDB – “a project in serious danger of succeeding”. In the evening I went to a Semantic Web gathering, where Oshani Seneviratne presented her study of Creative Commons attribution violations.

Wednesday 9th Started the day at the British Consulate with Jacqueline Ashborne, Science and Innovation officer, and learnt about their help for visits and collaboration in research. John Willbanks of Science Commons kindly gave us a ride to Harvard, alerting us to the legal issues of derivative works in the context of Web and database queries. The rest of the day was hosted by Alyssa Goodman and Roslin Reid: we had a really interesting mix of meetings, including Pepi Fabbiano who participated in the excellent “Harnessing the Power of Digital Data for Science and Society” document. We inaugurated our mission and this year’s IIC seminar series simulataneously with our first talk: “The DataQuest”.

Thursday 10th Chicago. Met with Ian Foster and his Computation Institute colleagues in University of Chicago. Great new building and amazing seminar room which we also inaugurated (I’ve never seen so many projectors, pointing in different ways and even at each other!) and enjoyed a round table discussion. Great people and projects, from analysing news to systems biology. Our meeting over dinner with Ian and Steve Tuecke recalled the early days of the UK e-Science programme.

Friday 11th Chicago. Spent the morning at University of Illinois at Chicago with Bob Grossman, a whiteboard and caffeine talking about the Open Cloud Consortium and testbed.  I applaud Bob’s principle of using the minimum software necessary! Northwestern in the afternoon to meet with David Martin, Noshir Contractor and Jim Chen, where we also enjoyed a tour of the Starlight facility in all its amazing technicolour connectivity.

Sunday 12th –Monday 13th. University of Michigan, hosted by Dan Atkins.  Lots of really useful meetings, with particular relevance from an e-Social Science viewpoint. We had a roundtable discussion over pizza in the Daniel Atkins conference Room (in the same building that Arpanet was conceived!) and gave the next version of our talk. The day shone with interdisciplinarity  and sophistication in e-Science thinking, from collaboratories to socio-technical design.

Tuesday 14th. University of Wisconsin - Madison, hosted by Miron Livny, the creator of Condor. Miron shared his many insights into technology adoption with a clarity for which he is famed. Met the team, learned about HDFS in Condor as a SAN-alternative, and demoed Ian Cottam’s very compelling Condor and Dropbox integration.

Wednesday 15th.  Early start at the University of Illinois at Urbana-Champaign where we met a group of scholars in the Center for Informatics Research in Science and Scholarship with a great understanding of the people side of the picture, and then to NCSA to give our talk and enjoy round tables on e-humanities (including eDream) and and e-science, all hosted by Jim Myers. We debated the Semantic Web! It was also a great chance to catch up on HASTAC.

Thursday 16th- Friday 17th University of New Mexico at Albuquerque, hosted by Bill Michener. The first visit where everyone we met was completely focused on data! Our talk and discussion were in the library – we soon overcame our instinct to talk quietly, as more and more rows of seats were added at the back :-) This was a visit with an emphasis on production research data and it was impressive to see the balance of skills involved in its delivery.

Half time, change ends. From Albuquerque we’re heading to Seattle then hopping down the west coast through San Francisco, LA and San Diego before coming back through Bolder to Washington. Watch this space for the second half!

When you watch the world of e-Research through the lenses of papers and websites you glimpse a world of new capabilities and changing practice. But how pervasive is this shift? “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” shouts Wired! Is that true? What’s it really like…how exactly is scientific practice changing? Understanding this last question is crucial if we are to succeed in creating the tools and techniques so that the new scientific methods can thrive. Our focus is in understanding how researchers will be using data in the future.

A good way to understand both the adoption and the trajectory of data intensive science is to go and ask, and this is the purpose of our “US fact-finding expedition”. On our three week tour, Malcolm Atkinson (UK e-Science Envoy) and I will visit key institutions and projects on a trip that takes us through Cambridge, Chicago, Michigan, Wisconsin, Urbana-Champaign, Albuquerque, Seattle, Palo Alto, Los Angeles, Irvine, San Diego, Boulder, Washington and Baltimore. We’ll be talking to practising researchers ranging from astronomers, biologists and chemists to computational musicologists and social scientists, and experts delivering technology from optical networks, clouds and databases to workflow systems and Web solutions.

I feel a bit like an explorer on a quest to discover new species! In fact there is truth to that, as we will observe and explain the things that we discover as we explore the ecosystem of technology and researchers. I earned by explorer badge on my ad hoc mission two years ago when I came back with “the new e-Science”; now I’ll see what’s changed. But it’s not just a survey, because we’re really after discussion and insight from the experts at the cutting edge(s) – especially about change and future needs, which don’t get captured elsewhere. And at the same time we’re sharing our e-Science experiences and doing the groundwork for future collaborations.

Watch this space!

Destinations on the Expedition

I believe that the academic paper is now obsolescent as the fundamental sharable description of piece of research. In the future we will be sharing some other form of scholarly artefact, something which is digital and designed for reuse and to drop easily into the tooling of e-Research, and better suited to the emerging practices of data-centric researchers.  These could be called Knowledge Objects or Publication Objects or whatever: I shall refer to them as Research Objects, because they capture research.

Many people are coming at this by tweaking what we have already in the scholarly knowledge lifecycle – like publishers with supplemental materials on a web site.  But for a minute let’s do a thought experiment and let go of this augmentation of an archaic form.  Forget papers: How would we define a Research Object instead?

At school we just had the “Three Rs” – Reading, writing and arithmetic. I  suggest that in e-Research there are Six Rs and they are the essential characteristics of the research record in contemporary research.  Research Objects should have these key properties:

  1. Replayable – go back and see what happened. Whether observing the planet, the population or an automated experiment, data collection can occur over milliseconds or months. The ability to replay the experiment, and to focus on crucial parts, is essential for human understanding of what happened.
  2. Repeatable – run the experiment again. Enough information for the original researcher or others to be able to repeat the experiment, perhaps years later, in order to verify the results or validate the experimental environment. This also helps scale to the repetition of processing demanded by data intensive research.
  3. Reproducible –an independent experiment to reproduce the results. To reproduce (or replicate) a result is for someone else to start with the description of the experiment and see if a result can be reproduced. This is one of the tenets of the scientific method as we know it.
  4. Reusable – use as part of new experiments. One experiment may call upon another, and by assembling methods in this way we can conduct research, and ask research questions, at a higher level.
  5. Repurposable – reuse the pieces in a new experiment. An experiment which is a black box is only reusable as a black box. By opening the lid we find parts, and combinations of parts, available for reuse, and the way they are assembled is a clue to how they can be reassembled.
  6. Reliable – robust under automation. This applies to the robustness of science provided by systematic processing with human-out-the loop, and to the comprehensive handling of failure demanded in complex systems where success may be the exception not the norm.

These points of definition have evolved over a series of talks and numbers vary. An interesting contender for number 7 is reflective - you can run a Research Object like a program but you can also look inside it like data; in other words it needs to be self-contained and self-describing. But that’s a means to an end rather than the end.  And contender number 8 is is replicatable, but to a computer scientist this is like repeatable and to a scientist it is like reproducible.  I’m not sure how many of six you have to score before something really is a Research Object, but maybe these six are actually necessary and sufficient.

How do we do this? In the Open Repositories world, the Object Reuse and Exchange standard is using RDF graphs to describe collections of things – like all the pieces that make up an experiment – even if they are distributed across the Web. It’s a great starting point for describing Research Objects – especially because, if we’re right, it is Research Objects rather than papers that will be collected in our repositories in the future. One day people will be saying “could I have a copy of that <Research Object> please?”

In my panel position at the European Semantic Web conference I suggested papers are an archaic, linear, human-readable form of Research Object and will be superseded. Actually, there will continue to be value in a human-readable narrative of an experiment, and of course we have a massive corpus in that form – though even today papers are increasingly read by machine rather than people!

Heraklion, Crete, June 2009 (revised August 2009)

The Web is simply the biggest, most successful, most usable distributed systems architecture ever.

So why is it that people keep proposing alternatives that struggle to succeed?  First I had this with the Grid, watching people giving talks saying “we’re nearly there”, “it’s a long game”, “distributed systems are difficult” and there’s me thinking “but… but… the Web works!”  Cloud services are making my point nicely now. But I’m still getting it with SOA, where on the one hand some surprisingly unquestioning people tell me it’s clearly the right way to go, while on the other I hear stories of real SOA achievements really quite low down in the maturity model.  Is the Emperor’s wardrobe so full of new clothes? Ok, I’m the first to say it’s horses for courses, but the neglect of the Web baffles me.

I think the reason for the dominant acceptance of the SOA architectural style is that it’s programmed into the profession.  We know how to write programs with procedure calls so we know how to do remote procedure calls.  We understand objects so we can do remote objects.  SOA is principled and you van buy training courses and get certificates. And it’s become a buzzword. And I think the reason it struggles is that first you have to servicify legacy apps – not so easy – and then later try to recompose services (flexibly, dynamically, … you know the mantra) – which turns out actually to be quite difficult because SOA services are complicated.

By their nature Web apps are also client-server, and people have become rather good at coupling things together. But there seems to be confusion over the principles of the Web architectural style, which is a pity given it’s the most successful distributed system ever. There have been efforts to capture it, starting with Fielding’s REST model. Rest has become misunderstood to a point which is damaging, so introducing a new word is probably a good idea – I like Resource Oriented Architecture. For example, the O’Reilly RESTful Web Services book attempts to establish some Resource Oriented Architecture Principles beyond Fielding.

Actually ROA is very easy to understand. You have a resource (think URI) and a small number of predefined methods that you can use with it – like GET and POST.  The web infrastructure is massively optimised to make these methods work really well.  When you want to do other stuff you don’t add more methods but rather you add more resources. Voila – Resource Oriented Architecture.  Everytime you navigate the Web you are using it.

In SOA you instead add methods and thereby create complex interfaces to individual entities, interfaces that must be maintained client- and server-side.  In pursuit of simplicity this can add complexity. Great if you want lots of code (though ironically the idea of SOA is to avoid the alternative of writing lots of code!)

So why is it we don’t have principles of ROA that architects can wield like those of SOA?  Well actually I think we could, and I would love to help people capture them! I think the challenge is that the success of the Web architecture is precisely because it has been allowed to be organic.  So the ROA principles will have a different nature to the inorganic principles of SOA.

I think we could start by doing what the Web 2.0 design patterns did.  Web 2.0 wasn’t created by a bunch of architects sitting down and saying “ok we have Web 1.0, let’s design the next version”.  Rather the design patterns are a set of observations on how the Web is actually being used, by people, today.  I think we could come up with a set of ROA patterns in a similar vein (indeed, surely, they would be related!)

Until we have these, SOA will dominate in any software design process that involves trained engineers. SOA isn’t broken – it’s an entirely plausible way of writing a lot of code to build systems when you don’t have a better way. But I think it has to be challenged – for any given system there may be a better architecture, and ROA is a candidate. Maybe ROA is good for people coupling things together and SOA is good for machines. Or maybe we need a resource-oriented variant of SOA.

So I think we should do two things:

1. Identify the design patterns that characterise ROA.

2. Compare and contrast ROA and SOA systems (preferably using the same systems built both ways!) to work out where and when we get the benefits.

My favourite example at the moment is a big sensor network project (SG4E) which is going to have an SOA middleware and REST on top for rapid application development. There’s a compelling argument for this classically 3 tier model - you know the one: Web 2.0 is flaky and needs robust services underneath so let’s put a properly-engineered SOA there. But this is very deeply a resource-oriented project (it’s about things, in the physical world), and surely could be ROA all the way down. What’s more, the project uses RDF (i.e. Resource Description Framework) – one might expect that to sit most comfortably in a resource-oriented model! This pervasive ROA question gives rise to a great thought experiment: What is the REST API for planet earth?

DDeR, Zakopane

« Older entries