Big data has the potential to provide scientists with important insights but there are still some barriers to its widespread use that need to be overcome. In this article, Paul Denny-Gouldson from IDBS tells us more…
hurdles
Big data can provide scientists with valuable insights that might otherwise be inaccessible, but it is crucial to capture the outcome of the analysis in a way that it can be reused. Choosing the right technology ecosystem is essential for scientists aiming to leverage big data analytics. The time-consuming task of aggregating and analysing data can disrupt scientists and engineers from focusing on high-value tasks that require greater human input — meaning they spend valuable time copying and pasting data between applications!
Some of the obvious and perennial scientific questions present well known challenges — and they are automatically now attributed to the term ‘big data’. These tricky questions, for instance ‘is there a relationship between these two apparently disparate things?’, or ‘provide me with a data set that is an aggregation of all data we have collected from the past 15 years’ so I can do some analysis’ have been continually asked by scientists. They are starting to become tractable due to advances in storage (cloud based) and high-performance computing (HPC + GPU technologies) which can hold large sets of data in memory and then do high compute cycles/second on the data, producing results tantalisingly quick — what took weeks just a few years ago now takes seconds.
Accessible to all?
However, what is less talked about is how we initially access and integrate the scientific data, and how we make sense of the information that we gain from ‘big data analysis’ in a way that makes it usable by all. Currently, data sciences are a specialist function — and it may well stay that way — but the goal is taking what these specialists do and democratising it, making it accessible to all. Essentially, how we can make sure that the data aggregation and analysis process can be verified and repeated by all without causing the compute systems to overload and making sure that the data going in is of good quality?
This ‘democratising of big data analysis’ is the next hurdle the scientific informatics community has to tackle. There are two parts to the hurdle — the historical data and the new data being collected going forward. With respect to historical data there are a number of approaches that can be used to ‘tag’ it to make it more accessible and consumable — retrospectively ingesting data and using ontologies and meta data tagging platforms to help give a view of the data landscape. Other tools then sit on these platforms and consume data, massage it and present it to analytics tools (visualisations, machine learning etc.). This type of data tagging is a step in the right direction but it’s important to note that tagging does not capture the knowledge of the user — it is being inferred — so the ‘level of trust’ is lowered.
Importance of data description
Working with ‘current data’ requires that all the data (contextual and structured) is captured and tagged as best as possible at the point of creation — getting the user to make sure that the data is fully described with meta data — or at least getting the user to confirm what is correct or not correct if algorithms are used. While this has the advantage of increasing the level of trust in the data, reaching 100% is unachievable. The same requirements exist for both the historical and new data — an organisational semantic taxonomy and ontology of data types and tags that describe the science being done in a way that allows both humans and computers to interrogate the data and consume it effectively (search, aggregate, analyse etc.).
Whilst this sounds easy — it is not. Why? Because the science and the data landscape are continually evolving and so the systems that are being used to aggregate, tag and distil data need to do the same. These corporate scientific data ecosystems should be viewed as a living system — one that needs to be fed, curated and manged in the same way any other ‘live systems’ are.
It’s about what didn’t work too…
Successful R&D is not just being able to validate and defend results of experiments that worked — those published in journals. It is also about enabling other researchers to utilise existing findings and apply it to their research — and that includes those experiments that didn’t work. The knowledge that can be gained from those experiments that didn’t work is just as critical as the data about those that ‘worked’.
This is why knowledge systems that are internal are so critical — because basing decisions on curated data of only experiments that worked (typically those published in journals) does not give the true picture of the scientific knowledge landscape.
Barriers still to overcome
Using big data analytics, researchers can now begin to explore and expand their data sets and the types of analysis they use. A key to all of this however, is choosing the right ecosystem of applications, tools and data management infrastructures to manage these tasks. Making big data analytics, machine learning and other algorithm-based AI techniques available to all is still a barrier that needs to be overcome.
We have seen similar hype over the years with other computational techniques and these are only just becoming ‘democratised’ and integrated into scientists’ working practices — 20 years after the initial push and introduction to the masses. We may see a similar trajectory with big data analytics and AI in science — early adopters and pioneers are already looking at how to leverage these technologies and most large pharma and biotechs have programmes of work looking at problems.
The majority of the problems people are tackling with big data and AI are in the clinical trials, drug repurposing and real-world evidence space — not so much in early research and development. But given the amount of investment and continual simplification of deployment it is almost inevitable that more and more case studies will be available showing how big data analytics and AI tools can be used to help research and development.