ARANGS15

Automated and reproducible analysis of NGS data

Downloadable poster in PDF

   IMPORTANT DATES for this Course
   Deadline for applications: May 7th 2015 (NEW)
   Latest notification of acceptance: May 7th 2015
   Course date: May 11th - May 15th 2015

Candidates with adequate profile will be accepted in the next 72 hours after the application until we reach 20 participants.

Instructors:

Rutger Vos studied biology at the University of Amsterdam, where he graduated in 2000. He then embarked on his PhD research under professor Arne Mooers at Simon Fraser University in Vancouver, Canada, where he defended his thesis on phyloinformatic problems in 2006. As a self-taught programmer he then became involved in several open-source scientific software development projects while continuing his research career through a postdoctoral fellowship at the University of British Columbia (Vancouver, Canada) and a Marie Curie research fellowship at the University of Reading (Reading, UK). In Spring of 2012 he commenced his employment as the bioinformaticist of the Naturalis Biodiversity Center in a role where he combines novel research with bioinformatics contributions to various research programmes within the organization. In his spare time he also contributes to various open source software projects (TreeBASE, Bio::Phylo, NeXML) and is co-PI of the PhyloTastic project. In addition Rutger has taught bioinformatics workshops in the US, Japan, China, Kenya and three times before at GTPB (PHYLOINF09, ARANGS12 and ARANGS13).

Affiliation: Naturalis Biodiversity Center, Leiden, the Netherlands

Darin London studied biology at Texas Tech University in the United States, graduating with a Masters of Science in 1999. He has been supporting biological research with computational support for 12 years, starting at GlaxoSmithKline in 2000, then moving to the European Bioinformatics Institute to work on the Biomart system, and finally coming to the Institute for Genome Sciences and Policy at Duke University, Durham, USA in 2005. He has developed automated analysis pipelines to help researchers analyze over 25Tb of Vertebrate NGS data (GAII and HiSeq), mostly involving research work on the Encyclopedia of DNA Elements (ENCode). Darin has taught numerous workshops on programming and automation with the Perl programming language, including the phyloinformatics workshops at the NESCent, Durham, USA, and the GTPB (PHYLOINF09, ARANGS12 and ARANGS13)..

Affiliation: Institute for Genome Sciences and Policy, Duke University Medical Center, Durham (NC), USA.

Course Description

Introduction

Next generation sequencing (NGS) technologies for DNA have resulted in a yet bigger deluge of data. Researchers are learning that analyzing the data efficiently requires the creation of sophisticated pipelines, typically using commandline tools in a Linux or other Opensource Unix variant compute environment. Many researchers have created these pipelines to successfully analyze their data. Now they are faced with the challenge of making these pipelines available to their colleagues. The issue of reproducibility has emerged as a major issue ( Challenges in irreproducible research, Nature ), as researchers, peer reviewers, and even pharmaceutical companies discover that the software and data used to produce a particular research finding are either not available, poorly documented, or targetted to specific compute infrastructures that are not available to the wider research community. To remedy this, funding agencies and journals are creating policies to promote software reproducibility. In this brief workshop we will establish several best practices of reproducibility in the (comparative) analysis of data obtained by NGS. In doing so we will encounter the commonly used technologies that enable these best practices by working through use cases that illustrate the underlying principles. Building on the basis of an existing pipeline of commandline utilities, we will illustrate how the entire compute environment used to run the pipeline can be packaged into a unit that can be shared with other researchers such that they can make full use of the environment on their own machines, or on standard cloud compute environments such as amazon or google.

Best practices

  • Commandline scripting of analysis steps
  • Provisioning systems to standardize software environment requirements
  • Packaging of compute environment into static, portable units
  • Sharing of compute environment packages

Technologies

  • Next generation sequencing platforms
  • Command-line executables, command line scripting and batching
  • Provisioning Systems: Puppet, Dockerfile
  • Virtualization with Virtualbox and Vagrant
  • Containerization with Docker

Target audience

This course is aimed at researchers who've developed pipelines to analyze NGS data and now, faced with new reproducibility requirements, would like to learn how to package their analysis pipeline into in a reproducible (and shareable) way. This course will start with a very basic NGS pipeline that runs in a Linux commandline environment, and develop this pipeline into two packages that can be shared with, and used by other researchers. The ideal attendee is a scientist who is already comfortable developing scripted pipelines on the commandline, or who is not afraid to get his/her hands dirty to acquire the computer-literacy skills for dealing with the informatics side of data analysis.

Pre-requisites

The course assumes that attendees are not intimidated by the prospect of gaining experience working on UNIX-like operating systems (including the shell, and shell scripting). Attendees should understand some of the science behind high-throughput DNA sequencing and sequence analysis, as we will not go deeply into underlying theory (or the mechanics of given algorithms, for example) as such. What will be taught are technical solutions for automating and sharing such analyses in shareable, reusable compute environments, which will include (but is not limited to) beginner-level programming, and basic Linux provisioning. General computer literacy, (e.g. editing plain text data files, navigating the command line) will be assumed.

Detailed Program

Instituto Gulbenkian de Ciência,

Apartado 14, 2781-901 Oeiras, Portugal

GTPB Homepage

IGC Homepage

Last updated:  Mar 22nd 2015