I just got back from a fascinating one-day workshop on “Data and Code Sharing in Computational Sciences” that was organized by Victoria Stodden of the Yale Internet Society Project. The workshop had a wide-ranging collection of contributors including representatives of the computational and data-driven science communities (everything from Astronomy, and Applied Math to Theoretical Chemistry and Bioinformatics), intellectual property lawyers, the publishing industry (Nature Publishing Group and Seed Media, but no society journals), foundations, funding agencies, and the open access community. The general recommendations of the workshop are going to be closely aligned with open science suggestions, as any meaningful definition of reproducibility requires public access to the code and data.
There were some fascinating debates at the workshop on foundational issues; What does reproducibility mean? How stringent of a reproducibility test should be required of scientific work? Reproducible by whom? Should resolution of reproducibility problems be required for publication? What are good roles for journals and funding agencies in encouraging reproducible research? Can we agree on a set of reproducible science guidelines which we can encourage our colleagues and scientific communities to take up?
Each of the attendees was asked to prepare a thought piece on the subject, and I’ll be breaking mine down into a couple of single-topic posts in the next few days / weeks.
The topics are roughly:
- Being Scientific: Fasifiability, Verifiability, Empirical Tests, and Reproducibility
- Barriers to Computational Reproducibility
- Data vs. Code vs. Papers (they aren’t the same)
- Simple ideas to increase openness and reproducibility
Before I jump in with the first piece, I thought it would be helpful to jot down a minimal idea about science that most of us can agree on, which is “Scientific theories should be universal”. That is, multiple independent scientists should be able to subject these theories to similar tests in different locations, on different equipment, and at different times and get similar answers. Reproducibility of scientific observations is therefore going to be required for scientific universality. Once we agree on this, we can start to figure out what reproducibility really means.
I’m a big fan of pair programming, which is one of the primary modes of software development in my research group. Usually, two people sitting together can spot errors that one alone can’t, and the pace of the coding and debugging is often much higher than two people sitting separately. I don’t know if my graduate students are as appreciative of this technique as I am — how many students want their advisor right next to them for the entire afternoon, taking over their keyboard, and seeing all the IM requests coming across the screen? But as an researcher, I find it gives me a much greater feel for what we’re actually doing in the lab, sort of like a small group meeting where we’re both looking at the same data or the same plot. It would be great if there were a way to separate the pair-programming from the “sitting in the same cramped cubicle” part of the equation.