Connecting the dots - the well posed question and code as a liability
Just a brief thought prompted by two, partly related, things streaming past my nose. Firstly Michael Nielsen discussed the views of Aristotle and Sunstein on collective intelligence. The thing that caught my attention was the idea that deliberation can make can make group functioning worse, leading to a collective decision that is muddled rather than actually identifying the best answer presented by members of the community. The exception to this is well posed questions, where deliberation can help. In science we are familiar with the idea that getting the question right (correct design of experiment, well organised theory) can be more important than the answer.
The second item was a blog post entitled “Data is good, code is a liability” from Greg Linden that was shared by Deepak Singh. Greg discussed a talk given by Peter Norvig which focusses on the idea that it is better to get a good sized dataset and use very sparing code to get at an answer rather than attempt to get at the answer de novo via complex code. Quoting from the post:
In one of several examples, Peter put up a slide showing an excerpt for a rule-based spelling corrector. The snippet of code, that was just part of a much larger program, contained a nearly impossible to understand let alone verify set of case and if statements that represented rules for spelling correction in English. He then put up a slide containing a few line Python program for a statistical spelling correction program that, given a large data file of documents, learns the likelihood of seeing words and corrects misspellings to their most likely alternative. This version, he said, not only has the benefit of being simple, but also easily can be used in different languages.
What struck me was the connection between being able to write a short, readable snippet of code, and the “well posed question”. The dataset provides the collective intelligence. So is it possible to propose the following?
“A well posed question is one which, given an appropriate dataset, can be answered by easily prepared and comprehensible code”
This could also possibly be turned on its head as “a good programming environment is one in which well posed questions can be readily converted to programs”. But it also raises an important point about how the structure of datasets relates to the questions you want to ask. The challenge in recording data is to structure it in such a way that the widest possible set of questions can be asked of that data. Data models all pre-suppose the kind of questions that will be asked. And any sufficiently general data model will be inefficient for most specific types of query.
Rajarshi Guha and Pierre Lindenbaum have been busy preparing different datastores for the solubility data being generated as part of the Open Notebook Science Challenge announced by Jean-Claude Bradley (more on this later). Rajarshi’s form based input has an SQL backend while Pierre has been working to extract the information as RDF. The point is not that one approach is better than the other, but that we need both, and possibly many more formats - and ideally we need to interconvert between them on the fly. A well posed question can easily founder on an inappropriately structured dataset (this is actually just a rephrasing of the Saunders Principle). It will be by enabling easy conversion between different formats that we might approach a situation where the aphorism I have suggested could become true.
Posted: November 6th, 2008 under Uncategorized, semantics.
Comments: none

darkest Oxfordshire. I am based at
ings that are immediately obvious. The first is that I start a lot of things and don’t necessarily manage to get very far with them and that I do a number of (currently) unrelated things (isolated subgraphs aren’t connected). Also that there are some materials that get widely re-used and some that don’t. There are also clearly things that I haven’t finished entering properly (isolated nodes). Finally, that we need a more sophisticated tool for playing with the view because building a human readable version of the graph will require some manipulation, grabbing subgraphs and moving them around. Welkin is great but after 30 minutes playing I have a bunch of feature requests. But this is what I’ve done so far. I am sure there are many things that can be done with this kind of view - but for the moment what is important is that it is an entirely new kind of way of looking at the record.