Resources

What makes a research project reproducible is not a simple question… Nonetheless I do believe than one should be able to reproduce ones’ own analysis without pain, even in the future. This may sound all obvious, but not be so easy to achieve in practice!

Here are some tips for students, based on my iterations toward more reproducible practices:

  • Start from the beginning: although it may sometimes feel like a waste of time, putting yourself in the situation of readily redoing your analysis will make your researcher’s life easier (and to the very least help you to rerun chunks from earlier work).

  • For each usual task (data analysis, plotting, citing references), you should master one tool down to its dirty details.

  • Relying on text-based formats (e.g. Markdown, LaTeX, CSV) is critical in order to be able to use version control to maintain your code, to write manuscripts, etc. This may indeed guide your choice of tools. A good starting points to learn version control with git is the Software Carpentry tutorial. GitHub’s documentation provides help on more advanced topics.

handling data

  • For data analysis and plotting, I enjoy very much (most of the time!) working with R’s tidyverse, in particular dplyr and ggplot2. If you are new to R or to data analysis from the command line, the companion book R for Data Science (available online) is the best introduction you can dream of. My advice: study sections 2 to 8 thoroughly, the next ones will be useful to go deeper on specific topics based on your needs.
    Hint: if you need to speed up your analysis with dplyr have a look at its parallelized counterpart multidplyr.

  • Dont overlook RStudio’s cheatsheets!

  • Follow a well-established coding style guide. If you don’t know which one to pick, use the lintr package (along with styler for existing code) to follow the tidyverse’s style guide.

  • Follow simple guidelines when recording data in spreadsheets

  • Use regular expressions whenever you can. Regexs are great, regexs are tough, and regexs are poorly taught (if at all!) unless you’ve a computer science background: luckily Damian Conway’s presentations are eye-opening (e.g. this 50’ video) and there is a great cheatsheet for R. Also you want to view what you’re doing, e.g. within RStudio using RegExplain addin or online with RegExr.com.

  • To share large datasets, Zenodo is a great (free) service. If you use another one, make sure that your dataset gets a DOI.

handling text

  • Despite LaTeX’s popularity in quantitative fields, I believe that the time is ripe to leave it to advanced editing where microtypography matters… Simpler syntaxes (in particular Markdown) are sufficient for literate data analysis (e.g. with Rmarkdown or R notebooks) and even for more advance tasks like writing a dissertation or an article; I put online a template to render a manuscript and its companion supplementary material with cross-reference. Whatever format you choose to rely on, don’t miss that pandoc is an incredibly powerful conversion tools between most formats (.md, .tex, .rtf, .docx, etc).
    Hint: the Markdown converter used by RStudio (pandoc-citeproc) is able to handle citations just like bibtex would (and in fact simpler!).

  • For storing and citing articles, Zotero is the most versatile open-source software.

handling DNA sequences

  • Benchling is the 21st century sequence editor to design and keep track of your molecular biology experiments: design primers, align sequencing chromatograms, test your next cloning in silico. It even has an integrated lab notebook!
    Big drawback: your data must be hosted on their servers…

handling microscopy data

academia survival kit

  • On research integrity: I was a founder of the Scientific Red Cards initiative (discontinued). Related issues are currently addressed by Retraction Watch and PubPeer.

  • Need a practical guide to research integrity? I find the guidelines of Piet Borst one of the most insightful! In particular, his “advice is to be generous towards your colleagues and tough towards yourself. Nearly everybody has the tendency to overestimate his own contribution to the research of others and underestimate the contribution of other people to his own work.” Probably worth meditating every time one is upset by the status of a collaboration…