Genomics at scale – building reproducible, scalable workflows for genome-based medicine


May 6, 2020

Contemporary computational genomics is inseparable from all the technical hurdles that arise when there is a need to handle the amounts of data counted in petabytes. Nevertheless, the sheer volume of data is just a tip of an iceberg. When considering the clinical applications, the true challenge stems from both the nature of the processed data and the purpose behind processing it. The clinical context renders analyses useful only if their results met rigorously defined standards and can be provided on time.

Scalability to tens of thousands of samples is a key issue in genetics. To do so, it is not enough to have access to scalable computing power. It is also important to automate the process that includes variant prioritization, variant description, and reporting following recommendations from bodies like the Association for Molecular Pathology and the American College of Medical Genetics and Genomics.

Reproducibility and precise versioning are crucial aspects to be met according to the College of American Pathologists (CAP). Requirements of scalability and pipeline version traceability are almost interoperable. If pipeline is precisely versioned and complete it means that it is set to run in the cloud without a bioinformatician supervision. Versioning also allows to manage upgrades and documentation which are both included in CAP NGS Laboratory Requirements.

Ecosystem of tools for scalable genomics is growing rapidly. In recent years, several languages have appeared that are dedicated specifically to enable defining bioinformatic workflows in platform agnostic manner, such as WDL, CWL or Nextflow DSL. Pipelines defined in those languages can now be run on cloud computing platforms including AWS, Google Cloud Platform, Alibaba Cloud and more thanks to engines like Cromwell or Nextflow. At Intelliseq we use WDL language, what allows our pipelines to be run not only on generic public clouds but also on genomic specific clouds like DNAnexus, DNAstack or by ixLayer. 

One of the crucial ideas that pushed the field of computational genomics forward was containerization. All tools can now be encompassed inside lightweight, semi-virtual machines called Docker containers. This procedure allows tools to be run without installation on every platform that has Docker Engine installed. It is enough to download Docker image with all the required tools for computation and run it in a local or cloud environment. At Intelliseq, we put inside the docker containers not only all tools, but also data sources required for computation - like reference genome or variant annotations. This allows us to achieve precise versioning of our pipelines, thus, achieving CAP requirements for bioinformatic pipelines.

How to prepare for scaling in advance? Rather than developing pipelines each time from scratch, use already developed ones - the other way around is reinventing the wheel. Once the base pipeline is established, it can be customized for a client with fractions of the development cost. 

Establishing specialized team of bioinformaticians inside a lab to develop genomic pipelines can be compared to establishing a team of electrical engineers to set up power generator for a lab. It is much more cost effective to cooperate with companies like Intelliseq that have already an established portfolio of workflows. Those workflows can be easily customized according to laboratory requirements and then integrated with Laboratory Information Systems or run by external providers like ixLayer.

How does the process of workflow development look like? It starts with a specification of requirements. Then, the team of Intelliseq scientists proposes the outline of workflow including already developed computational tasks as well as those that need to be developed. Intelliseq has a large collection of tasks performing procedures like quality check, alignment, variant calling, variant annotation, imputing, polygenic scores computation. Tasks are connected into fully functional workflows. The initial proposal includes also pricing and development timeline. Report generation can be included in the pipeline or it can be produced by other vendors based on results coming out of pipeline.

Intelliseq was established in 2014 by a group of scientists fascinated with genomics. At Intelliseq, we understand that in-depth data analysis and interpretation lies at the very heart of successful genome-wide research. The company consists of an interdisciplinary team of experts in the field of genomics, molecular biology, bioinformatics, mathematics, neuropsychology, and software development. We specialize at building complete workflows from fastq to report encompassing recognized software on public and commercial licenses and proprietary software. We offer expert consultancy about the optimal implementation model.

<h2>Want to know more?</h2>

Want to know more?

Get in touch with us.