Thursday, February 28, 2008

Encyclopedia of Life: An interesting social experience in the sciences

EOL has been launched with huge fanfare: The best people in the world get together with a distinguished board of advisers, for biodiversity a huge award, and the backing of TED- but it faltered because the enormous amount of interest generated could not be handled (11.5M hits in the first five hours (Spiegel Online).

Insights start to trickle out on various blogs by colleagues that actually looked at EOL before it went down (Rod Page, Vince Smith) or then by slashdot whose members have a very different view on it.

It is also interesting to observe the New York Times who took down the article on EOL in my experience unusually quickly.

For all the often very crititical comments, nobody doubts we need an infrastructure like EOL which has to be an even more sophisticated and mashup engine then Google is. Many of us work since the begin of the Internet on solutions, because we all want to have access - and provide access for the public to this most thrilling knowledge.

For me the failure of the EOL start is not bad luck, but symptomatic for this approach.

Google has a huge, highly motivated and skilled staff working on their basic tools, the search engine. EOL is still looking for some of their very limited developer positions. Following the comments in slashdot, they are not up to speed with programming and producing adequate code.

EOL has a highly organized structure with directors, and various boards and distinguished scientists, overshadowed by EO Wilson. This is the typical autocratic structure, where authorities rule - each page will be authorized by a scientist, and this scientists have already blocked it seems more than a million pages, each about one species. Google has not authorities, but algorithms that do the work.

This structure does also not represent the scientific landscape. Biodiversity is marred by the problem that there zillions of small databases, most of them with very little usage beside the few creators. The institutions taking on their lead in EOL have done very little to come up with institutional support for creating a more comprehensive system with adequate metadata that would be used - similar to the large databases in astrophysics of particle physics. But more importantly, creating content is considered a voluntary contribution by the scientists - and if they don't the citizen scientists (aka amateurs) will chip in.

Megascience projects are scientific projects. You have a hypotheses you would like to falsify. It starts as a theoretical concept and then experiments are being developed and adequate tools are created to run the experiment. It can be a satellite, or cyclotron, but it includes an entire science team that can build, run, analyze and publish the findings. It take years to build up such complex endeavors. It might be worthwhile to study NASA development of new satellits for science missions. The best scientists in the field are involved, because if there is a flaw, hundreds of millions are being lost. There is a data share and archiving policy behind, allowing access to the baseline data.

Satellites and cyclotrons are not build with a letter by a 'big shot' to MacArthur, and without having thorougly reviewed written proper proposal.

This might work in traditional biological science where you could nicely summarize at the very begin of a new scientific topic, like the genetic bases of behavior (=sociobiology), and then write few article to expand into other areas like sociology; similarly "island biogeography", which are all nice catchwords, sometimes picked up by scientists sometimes not (Consilience) and sometimes causing a huge disaster (bio-prospecting), which essentially is one of the main reasons why it is extremely complicated to collect species in the wild.

The history of biodiversity informatics is then a field littered with many corpses - most of which because of misguidance, a wrong understanding of how this science works, and the pitfall of wanting to create the mother of all systems - which evidently is linked to a lot of fame.

What is what EOL want's to do? The goal is to provide the authoritative access to species information, and to stimulate research by allowing content to be exposed and open for data mining, that is the current hype about new discoveries.
But who the authorities are is already a political decisions - see for example the ant case.

What is science, and what is wikipedia or public opinion. Science is about citing. If webpages will be the element to which only authorized people can contribute and this for free (at least for now), and which are the element that ought be cited, using a doi in this case, then this means some radical changes.
What EOL ought to do is to think of innnovative ways to measure scientific accomplishment, and implement it at their own institutions. If Harvard would come up with metric for its taxonomists which includes, how often a specimen they discovered has been cited, how often an image they produced has been used, then that would have an effect. But neither is this a debate, nor is the system of 'deep citations' in place.

A scientist is measured by her productivity which is at the moment linnked to citation indeces, etc. Writing pages, and contributing to pages is not part of the system, and thus will not count, and thus contributions are very limited, even though a million odd pages are reserved already.
What is science? If you just can use a slider to get the information at your level, from basic k-12 to science, then where is the science behind - or is it just a compilation written by a skilled writer?

What about web publications in taxonomy? The Codes does not allow it. So how do you deal with the 90% of the new species to be found out there? Since the commissioners of those codes are very slugish, will that system of Codes and some control on naming just fall apart?

Most of the information on species is published. But there is no vision in EOL on how to access this. Their interpretation of copyright rules is that they have to be very conservative and essentially do not scan in anything that is younger than 75 years, unless there is a positive evidence that they are authorized. If you think that we double out knowledge every 10 years, then what do we get having most of the newer stuff not covered?
But even, if the copyright would not be an issue, the current working model at BHL will not be able to deliver the descriptions, since they operate on a volume and side bases, so they even have a problem to know, where a publication begins and ends.
If existing nomenclators or name servers are used, then the original pages could be postulated - but all the many redescriptions would slip through, not to speak of the ca 500,000 species for which there is no nomenclator yet (so in a way, if somebody comes up with a list of 1,2M species names, they are all copies of the same - again, nobody spends the money to catalogue the last 600K species). To discover automatically species descriptions prooved so far a costly process, since it needs the involvemet of specialists. is on such project aimaing at solving this issue, but it comes at a cost - which might only be paid for by community involvement. But that again needs access to literature that is promising (not the old stuff that excites few taxonomists), high quality OCR and a legal framework that doesn't shy off people.


Post a Comment

<< Home