Thursday, February 28, 2008

Encyclopedia of Life: An interesting social experience in the sciences

EOL has been launched with huge fanfare: The best people in the world get together with a distinguished board of advisers, for biodiversity a huge award, and the backing of TED- but it faltered because the enormous amount of interest generated could not be handled (11.5M hits in the first five hours (Spiegel Online).

Insights start to trickle out on various blogs by colleagues that actually looked at EOL before it went down (Rod Page, Vince Smith) or then by slashdot whose members have a very different view on it.

It is also interesting to observe the New York Times who took down the article on EOL in my experience unusually quickly.

For all the often very crititical comments, nobody doubts we need an infrastructure like EOL which has to be an even more sophisticated and mashup engine then Google is. Many of us work since the begin of the Internet on solutions, because we all want to have access - and provide access for the public to this most thrilling knowledge.

For me the failure of the EOL start is not bad luck, but symptomatic for this approach.

Google has a huge, highly motivated and skilled staff working on their basic tools, the search engine. EOL is still looking for some of their very limited developer positions. Following the comments in slashdot, they are not up to speed with programming and producing adequate code.

EOL has a highly organized structure with directors, and various boards and distinguished scientists, overshadowed by EO Wilson. This is the typical autocratic structure, where authorities rule - each page will be authorized by a scientist, and this scientists have already blocked it seems more than a million pages, each about one species. Google has not authorities, but algorithms that do the work.

This structure does also not represent the scientific landscape. Biodiversity is marred by the problem that there zillions of small databases, most of them with very little usage beside the few creators. The institutions taking on their lead in EOL have done very little to come up with institutional support for creating a more comprehensive system with adequate metadata that would be used - similar to the large databases in astrophysics of particle physics. But more importantly, creating content is considered a voluntary contribution by the scientists - and if they don't the citizen scientists (aka amateurs) will chip in.

Megascience projects are scientific projects. You have a hypotheses you would like to falsify. It starts as a theoretical concept and then experiments are being developed and adequate tools are created to run the experiment. It can be a satellite, or cyclotron, but it includes an entire science team that can build, run, analyze and publish the findings. It take years to build up such complex endeavors. It might be worthwhile to study NASA development of new satellits for science missions. The best scientists in the field are involved, because if there is a flaw, hundreds of millions are being lost. There is a data share and archiving policy behind, allowing access to the baseline data.

Satellites and cyclotrons are not build with a letter by a 'big shot' to MacArthur, and without having thorougly reviewed written proper proposal.

This might work in traditional biological science where you could nicely summarize at the very begin of a new scientific topic, like the genetic bases of behavior (=sociobiology), and then write few article to expand into other areas like sociology; similarly "island biogeography", which are all nice catchwords, sometimes picked up by scientists sometimes not (Consilience) and sometimes causing a huge disaster (bio-prospecting), which essentially is one of the main reasons why it is extremely complicated to collect species in the wild.

The history of biodiversity informatics is then a field littered with many corpses - most of which because of misguidance, a wrong understanding of how this science works, and the pitfall of wanting to create the mother of all systems - which evidently is linked to a lot of fame.

What is what EOL want's to do? The goal is to provide the authoritative access to species information, and to stimulate research by allowing content to be exposed and open for data mining, that is the current hype about new discoveries.
But who the authorities are is already a political decisions - see for example the ant case.

What is science, and what is wikipedia or public opinion. Science is about citing. If webpages will be the element to which only authorized people can contribute and this for free (at least for now), and which are the element that ought be cited, using a doi in this case, then this means some radical changes.
What EOL ought to do is to think of innnovative ways to measure scientific accomplishment, and implement it at their own institutions. If Harvard would come up with metric for its taxonomists which includes, how often a specimen they discovered has been cited, how often an image they produced has been used, then that would have an effect. But neither is this a debate, nor is the system of 'deep citations' in place.

A scientist is measured by her productivity which is at the moment linnked to citation indeces, etc. Writing pages, and contributing to pages is not part of the system, and thus will not count, and thus contributions are very limited, even though a million odd pages are reserved already.
What is science? If you just can use a slider to get the information at your level, from basic k-12 to science, then where is the science behind - or is it just a compilation written by a skilled writer?

What about web publications in taxonomy? The Codes does not allow it. So how do you deal with the 90% of the new species to be found out there? Since the commissioners of those codes are very slugish, will that system of Codes and some control on naming just fall apart?

Most of the information on species is published. But there is no vision in EOL on how to access this. Their interpretation of copyright rules is that they have to be very conservative and essentially do not scan in anything that is younger than 75 years, unless there is a positive evidence that they are authorized. If you think that we double out knowledge every 10 years, then what do we get having most of the newer stuff not covered?
But even, if the copyright would not be an issue, the current working model at BHL will not be able to deliver the descriptions, since they operate on a volume and side bases, so they even have a problem to know, where a publication begins and ends.
If existing nomenclators or name servers are used, then the original pages could be postulated - but all the many redescriptions would slip through, not to speak of the ca 500,000 species for which there is no nomenclator yet (so in a way, if somebody comes up with a list of 1,2M species names, they are all copies of the same - again, nobody spends the money to catalogue the last 600K species). To discover automatically species descriptions prooved so far a costly process, since it needs the involvemet of specialists. Plazi.org is on such project aimaing at solving this issue, but it comes at a cost - which might only be paid for by community involvement. But that again needs access to literature that is promising (not the old stuff that excites few taxonomists), high quality OCR and a legal framework that doesn't shy off people.

Wednesday, February 27, 2008

The launch of the Encyclopedia of Life

Yesterday the Encyclopedia of Life has been launched. No doubt, we need this infrastructure in one way or the other.

The question though will be, do we need content they think is the right one, or do we need the potential underlying infrastructure which assembles all the relevant information for us. Google's algorithms are asked, not again the old authorities. Not talk is asked, but deliverables, and that means what is on the Web, an in a scientific context, what can be cited and thus is open accessible. And once things can be cited, we now what the community considers important.

Unfortunately, EOL is already tied to one single man, EO Wilson. Though he wrote the now widely cited article in TREE in 2003, the EOL was not his idea but came out of a meeting including Smithsonian scientists, and thus it would be better to tie EOL to the community then to one single man, especially with Wilson's track record of advocate of copyright for descriptive work, where even after five years his Pheidole revision is still copyrighted. He also never joined the movement to make taxonomic descriptions open access similar to the what happened to the gene sequences and subsequently turned out a huge scientific success.

The last two paragraphs in the New York Times coverage of the EOL just shows the problematic, authoritarian position of EOL and his founders: "... he and other ant experts will be meeting at Harvard to plan how they can take advantage of the Encyclopedia of Life." There are enough tools out to measure what's relevant in our science, we do not need to refer to the old pre-open access and pre-e-publication period, where other opinions just could be suppressed by not citing.

So, why then should Harvard with a dismal track record on the Web, and furthermore being the anachronistic champion of producing copyrighted and non-open access material be given the lead by EOL to produce authoritative content on ants? Since 2003, Wilson's Pheidole which made it even into Nature online because of its copyright issues, nor the new Bolton Catalogue is online, whilst there is a huge community out there using the existing Internet based resources on ants.
Plazi.org is just the last one, which already provides access to well over 3,500 descriptions of ants, and for the first time provides a platform for any one to get a first glimpse and entry to what is known about a species, as much as it shows the power of using LSIDs and other standards to link to external databases. There are also most of the ant systematics literature online (>4,000 pdfs) on antbase.org some of it paid for by a grant from the Smithsonian Institution. There are 184,479 records of ant specimens available through GBIF, but none from Harvard.

But there is also a huge growing community of ant taxonomists in the South. More then 300 people, among them many taxonomists, attended last November's Simpósio de Mirmecologia in São Paulo - but nobody from Harvard. There publication on Neotropical ants are online, and they are working on a new electronic catalogue of the ants of the world. There is a growing community in Asia (ANeT) with their bi-annual meeting just held in India. All this hardly mentioned in Wards recent summary on ant taxonomy in Zootaxa (Ward being closely allied to Harvard):
"The literature on ant taxonomy is highly dispersed, however, and sometimes difficult to locate. Bolton’s (2003) monograph on ant classification provides an excellent entrée into this literature, including identification guides and keys. Ant identification resources are becoming increasingly available online, through sites such as AntWeb (www.antweb.org), Antbase (www.antbase.org), Australian Ants Online (www.ento.csiro.au/science/ants), Ants of Costa Rica (http://academic.evergreen.edu/projects/ants/AntsofCostaRica.html) and Japanese Ant Image Database (http://ant.edb.miyakyo-u.ac.jp/E). Several technological developments hold the promise of facilitating ant species-level taxonomy. These include improvements in imaging (e.g., Automontage system), specimen measurement, distribution mapping, and electronic organization of data. ."

What we need are resources, especially funding to help to image all types, open up all the literature and provide platforms like Scratchpads that allow to assemble our systematics information. If this is well done, including underlying mark up, LSIDs then this can be a source for EOL, acknowledging that we systematists deliver only a fraction of what is know about a species.

We need co-operation not building further divides into a highly fractured community. EOL should provide the tools, not politics, and those tools should decide what is relevant and what not, who has something to say and who not.