On Reproducibility

I just got back from a fascinating one-day workshop on “Data and Code Sharing in Computational Sciences” that was organized by Victoria Stodden of the Yale Internet Society Project. The workshop had a wide-ranging collection of contributors including representatives of the computational and data-driven science communities (everything from Astronomy, and Applied Math to Theoretical Chemistry and Bioinformatics), intellectual property lawyers, the publishing industry (Nature Publishing Group and Seed Media, but no society journals), foundations, funding agencies, and the open access community. The general recommendations of the workshop are going to be closely aligned with open science suggestions, as any meaningful definition of reproducibility requires public access to the code and data.

There were some fascinating debates at the workshop on foundational issues; What does reproducibility mean? How stringent of a reproducibility test should be required of scientific work? Reproducible by whom? Should resolution of reproducibility problems be required for publication? What are good roles for journals and funding agencies in encouraging reproducible research? Can we agree on a set of reproducible science guidelines which we can encourage our colleagues and scientific communities to take up?

Each of the attendees was asked to prepare a thought piece on the subject, and I’ll be breaking mine down into a couple of single-topic posts in the next few days / weeks.

The topics are roughly:

  • Being Scientific: Fasifiability, Verifiability, Empirical Tests, and Reproducibility
  • Barriers to Computational Reproducibility
  • Data vs. Code vs. Papers (they aren’t the same)
  • Simple ideas to increase openness and reproducibility

Before I jump in with the first piece, I thought it would be helpful to jot down a minimal idea about science that most of us can agree on, which is “Scientific theories should be universal”. That is, multiple independent scientists should be able to subject these theories to similar tests in different locations, on different equipment, and at different times and get similar answers. Reproducibility of scientific observations is therefore going to be required for scientific universality. Once we agree on this, we can start to figure out what reproducibility really means.

Share
Posted in Conferences, Open Data, open science, Policy, Science | 3 Comments

Sad news about Warren DeLano

I just heard the sad news about Warren DeLano, one of the giants of open source scientific software (and the author of PyMOL). Warren passed away suddenly a few days ago. Like everyone else, I’m stunned and saddened by this news. For those of you who don’t know, PyMOL is a fantastic piece of software that has produced some of the highest quality journal covers in the past decade. I met Warren only a few times. But over the years, we’ve had a few fantastic conversations about Jmol, PyMol, and how to make open source sustainable in the scientific world. I’ll miss his voice and his contributions to the community.

Warren’s family has created a “In Memorium” page and blog. Please share your memories there !

Share
Posted in open science, Science, Software | Leave a comment

Debellor

Debellor is an open source Java framework for scalable data mining and machine learning. You may implement your own algorithms in Debellor’s architecture and combine them with pre-existing ones to create complex data processing networks and experimental setups. The unique feature of Debellor is data streaming, which enables efficient processing of massive data and is essential for scalability of algorithms.
Find Debellor at: http://www.debellor.org

Share
Posted in Knowledge Discovery and Data Mining | Leave a comment

Science Laboratory Inventory

Science Laboratory Inventory Control Management System and Orders: Web-based system that is useful for managing chemical and equipment inventory management and handling science labs orders from teachers to technicians for their practical sessions.
Find Science Laboratory Inventory at: http://sciencelabinv.sourceforge.net

Share
Posted in Tools | Leave a comment

Webolab

A web-based platform for collaborative research and high-performance computing interface, a tool for implementation of virtual labs. Supports Condor cluster management engine.
Find Webolab at: http://www.webolab.net

Share
Posted in Tools | Leave a comment

OpenOpt

Universal cross-platform (Win, Linux, Mac OS) Python-written numerical optimization toolbox. License: new BSD. Problems: NLP, LP, QP, SDP, SOCP, DFP(Non-linear Data Fit), NSP(nonsmooth), MILP, LSP, LLSP, MMP, GLP, MINLP etc. Connects to dozens of solvers (some are C- or Fortran-written). Provides graphic output of convergence and some more numerical optimization "MUST HAVE" features. Can involve automatic differentiation of FuncDesigner (http://openopt.org/FuncDesigner) models.
Find OpenOpt at: http://openopt.org

Share
Posted in Optimization | Leave a comment

What, exactly, is Open Science?

I was recently asked to define what Open Science means. It would have been relatively easy to fall back on a litany of “Open Source, Open Data, Open Access, Open Notebook”, but these are just shorthand for four fundamental goals:

  • Transparency in experimental methodology, observation, and collection of data.
  • Public availability and reusability of scientific data.
  • Public accessibility and transparency of scientific communication.
  • Using web-based tools to facilitate scientific collaboration.

The idea I’ve been most involved with is the first one, since granting access to source code is really equivalent to publishing your methodology when the kind of science you do involves numerical experiments. I’m an extremist on this point, because without access to the source for the programs we use, we rely on faith in the coding abilities of other people to carry out our numerical experiments. In some extreme cases (i.e. when simulation codes or parameter files are proprietary or are hidden by their owners), numerical experimentation isn’t even science. A “secret” experimental design doesn’t give skeptics the ability to repeat (and hopefully verify) your experiment, and the same is true with numerical experiments. Science has to be “verifiable in practice” as well as “verifiable in principle”.

In general, we’re moving towards an era of greater transparency in all of these topics (methodology, data, communication, and collaboration). The problems we face in gaining widespread support for Open Science are really about incentives and sustainability. How can we design or modify the scientific reward systems to make these four activities the natural state of affairs for scientists? Right now, there are some clear disincentives to participating in these activities. Scientists are people, and we’re motivated by most of the same things as normal people:

  • Money, for ourselves, for our groups, and to support our science.
  • Reputation, which is usually (but not necessarily) measured by citations, h-indices, download counts, placement of students, etc.
  • Sufficient time, space, and resources to think and do our research (which is, in many ways, the most powerful motivator).

Right now, the incentive network that scientists work under seems to favor “closed” science. Scientific productivity is measured by the number of papers in traditional journals with high impact factors, and the importance of a scientists work is measured by citation count. Both of these measures help determine funding and promotions at most institutions, and doing open science is either neutral or damaging by these measures. Time spent cleaning up code for release, or setting up a microscopy image database, or writing a blog is time spent away from writing a proposal or paper. The “open” parts of doing science just aren’t part of the incentive structure.

Michael Faraday’s advice to his junior colleague to: “Work. Finish. Publish.” needs to be revised. It shouldn’t be enough to publish a paper anymore. If we want open science to flourish, we should raise our expectations to: “Work. Finish. Publish. Release.” That is, your research shouldn’t be considered complete until the data and meta-data is put up on the web for other people to use, until the code is documented and released, and until the comments start coming in to your blog post announcing the paper. If our general expectations of what it means to complete a project are raised to this level, the scientific community will start doing these activities as a matter of course.

If you meet a scientist who tells you that they did a fantastic experiment and have wonderful data, you naturally ask them to email you a reprint. Any working scientist would be perplexed if the response was: “Oh, I’m not going to be writing this work up for publication.” It would be absolute nonsense in the culture of science to not publish a report in a journal on the work you have done. And yet, no one seems surprised when scientists are too busy or too secretive to release their data to the community. We should be just as perplexed by this. Instead of complaining about the reward and incentive systems, we should be setting the standard higher: “What do you mean that you haven’t got around to putting your data on the web? You aren’t done yet!” Or: “How can I possibly review this paper if I can’t see the code they were using? There’s now way for me to tell if they did the calculation right.” We’re going to have to raise the expectations on completing a scientific project if we want to change the culture of science.

Share
Posted in Open Data, open science, Policy, Science | 56 Comments

Adevs

Adevs (A Discrete EVent System simulator) is a C++ library for constructing discrete event simulations based on the Parallel DEVS and Dynamic DEVS (dynDEVS) formalisms.
Find Adevs at: http://www.ornl.gov/~1qn/adevs

Share
Posted in Simulation and Modeling | Leave a comment

Saros: Distributed Pair Programming

pairon I’m a big fan of pair programming, which is one of the primary modes of software development in my research group. Usually, two people sitting together can spot errors that one alone can’t, and the pace of the coding and debugging is often much higher than two people sitting separately. I don’t know if my graduate students are as appreciative of this technique as I am — how many students want their advisor right next to them for the entire afternoon, taking over their keyboard, and seeing all the IM requests coming across the screen? But as an researcher, I find it gives me a much greater feel for what we’re actually doing in the lab, sort of like a small group meeting where we’re both looking at the same data or the same plot. It would be great if there were a way to separate the pair-programming from the “sitting in the same cramped cubicle” part of the equation.

Christopher Oezbek just let us know about a cool open source Eclipse plugin called Saros. This lets two people sitting in different locales collaboratively edit and work on the same project. I’ve seen similar things in the editor SubEthaEdit (which is not open source), but Saros will let two programmers do this at the project level (with multiple files open), not just at the file level. It looks like a very cool tool to avoid those overcrowded cubicles (or the famous PairOn chair pictured above).

Saros is listed in our Software Engineering and Tools sections.

Share
Posted in Science, Software | 1 Comment

Saros

Saros is a research project to enable distributed pair programming (also called remote pair programming) in the Java IDE Eclipse. Saros supports real-time collaboration by two and more peers and adds many features to increase awareness and presence regarding the whereabouts of each peer.
Find Saros at: http://www.saros-project.org/

Share
Posted in Software Engineering, Tools | Leave a comment