Why a unified approach to big data is key

by

According to Frank Austin Nothaft, GTM lead for Genomics, Databricks, a unified approach to big data is the best way to get the most value. Here, he tells us more.

With the cost of sequencing dropping dramatically over the last decade, genomic data volumes have grown exponentially. By 2020, genomic data is expected to be reaching upwards of 40 exabytes per year.1 Pharmaceutical companies are rushing to tap into this genomic data goldmine in hopes of accelerating the costly long-tail of drug development.

Early population sequencing efforts, such as the Geisenger/Regeneron DiscovEHR collaboration,2 have demonstrated the ability to identify novel links between genomic variants and phenotypes of interest. By using information gleaned by integrating genomic data along with phenotypic sources, pharmaceutical development organisations can more precisely target treatments to the underlying biology driving a disease.

This can result in more effective drugs with reduced side effects and accelerate time to market. However, a number of technology hurdles must be overcome for organisations to fully draw value from their genomics data with advanced analytics and machine learning.

The challenges of large-scale genomic analysis

The sheer scale of genomic data is daunting, as a whole genome study can have upwards of 100GB of data per individual. While the data from one individual’s genomic profile provides insight into their biology, genomic data is most powerful when viewed at a population scale and with the added context of phenotypic datasets like EMR and imaging studies.

While bioinformatics tools have traditionally been designed for on-premises High Performance Computing (HPC) architectures, it is difficult to scale these storage systems to petabyte/exabyte volumes of data in a cost-effective manner.

For a pharmaceutical R&D team, the real value in advanced analytics is obtained once these techniques can be made easily accessible to domain scientists. Following the example of the Regeneron Genetics Center, advanced analytics can provide tremendous value when deployed through a web portal that allows bench scientists to rapidly drill down on the data that supports a link between a gene and a phenotype.3,4

Achieving agility and scale in the cloud

One approach that pharmaceutical manufacturers have traditionally leveraged for large-scale bioinformatics is on-premises HPC architectures. Unfortunately, the financial model for managing an HPC installation requires a significant up-front CapEx investment, which does not eliminate OpEx. An alternative approach is to use cloud computing.

Cloud computing allows elasticity around data infrastructure. Large public cloud companies have all made huge strides in readying their services for enterprises. These public cloud platforms now have more security and controls implemented around data that is being stored in them, essential when managing highly sensitive data.

Genomic data needs unified analytics

As commoditised sequencing allows more organisations to gain access to massive -omic datasets, success will be determined by an organisation’s ability to rapidly turn raw genomic data into actionable biological insight. A disjointed approach to data forces bioinformaticians, computational biologists and bench scientists to work in silos, which hampers the discovery and analytics process.

Building a homegrown patchwork solution of tools and technologies pulls valuable cycles from domain scientists whose expertise is in the biology rather than large-scale data. By leveraging cloud computing technologies to rapidly analyse genomic data along with bioinformatics platforms that bring together large-scale data processing and advanced analytics in one connected toolset, disparate teams across bioinformatics, computational biology, and bench science can be unified to accelerate R&D.

  1. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  2. http://www.discovehrshare.com/
  3. https://databricks.com/session/building-the-future-of-drug-discovery
  4. https://databricks.com/session/insights-from-building-the-future-of-drug-discovery-with-apache-spark
  5. https://datascience.nih.gov/blog/cloud
Back to topbutton