Open Data in Science

Peter Murray-Rust (University of Cambridge)

14:00 Wednesday 16 May
Open data Amphitheatre C
Chair: Matt Biddulph (hackdiary.com)
Science is increasingly based on the re-use of existing published data. Traditionally this has been associated with primary journal articles, either within the “fulltext” or attached as supplementary information. In some disciplines (biosciences, crystallography) there has been strong community pressure to publish this in open, machine-accessible form, either into data centres (e.g. bioinformatics institutes) or as supplemental data. Where this is accepted practice, data mining and text mining have generated new areas of knowledge-driven science. It is particularly valuable to be able to link data from different disciplines (e.g. biological function and molecular structure). In principle the new web technologies can access distributed data and create syntheses from which new insights arise.

However the successful areas are the exception. Most publishers make no effort to encourage the machine-readable publication of data and several actively oppose it by practices such as licenses, copyright and bans on robotic downloads. For example in chemistry Open databases have been resisted by publishers on the basis that they are a commercial challenge. Many publishers requires authors to hand over copyright on data, even though it can be argued that these are facts.

There are encouraging signs of progress – the ALPSP and STM publishers have argued that data (as opposed to fulltext) should be Open, and funders such as Wellcome are requiring not only Open Access to text, but Creative Commons licenses enabling re-use of data. The development of Science Commons is also very timely.

The presentation will advocate the following:
publishers should adopt a positive policy of making scientific data openly available and remove restrictions.
authors, editors and publishers should recognise the value of publications in semantic form (“machine-understandable”).
funders should require data to be semantic and open.
authors should deposit data in institutional repositories.
Demonstrations: The presentation will show the value of extraction of semantic data from publishers sites (including text-mining) with automated high-throughput spidering.

References: All relevant references can be found in: http://en.wikipedia.org/wiki/Open_Data.