
As the pre-eminent biologist Meg Lowman described in her account of life as an impoverished PhD student in The Arbornaut, she never worried about anyone breaking into her apartment because the only thing she had of value was her data. While offered as a colourful anecdote to bring her rough-scrabble beginnings into focus, it provides insight into the enormous division between researchers, businesses and the general public when considering the intrinsic value of data. Long before data becomes woven into knowledge or an eventual tapestry of understanding and insight, it is worth more than strands of gold to the investigator seeking truth.
What Value Data?
If the division between science and philosophy is demarcated by regarding facts over judgements, then it is perhaps paradoxical to require scientists to ascribe value to their data. Data, which the researcher considers indivisible from fact, must be evaluated not only for its intrinsic potential to reveal the reality of the universe, but also against the costs of its retention or loss. We must consider within institutional ledgers, not only irrevocable destruction or deletion, but also the consequences of burial beneath a mountain of other data, duplication, inadequate curation, or inoperability as technology evolves. Yet to ask a researcher which of their data is most valuable is paramount to asking them to pick their favourite child; choosing what do throw out is simply unconscionable.
While most people and institutions typically diverge from the scientist in their axiological appraisal of research data as much as parents in their affections for others’ children, it is worth considering what makes data valuable to an entire research organisation. Where does data sit relative to infrastructure, accolades, public perception, liability, and other intellectual property the ledgers? What are the accounting metrics to calculate value present and future? How do institutions assess the state of their research data health? And how do we untangle economic and ethical considerations knotted throughout such issues?
Exploring this landscape of research data value, appropriate data management emerges as a keystone variable set. Data retained within a hoard of undocumented/ undiscoverable/ unusable items is no more valuable than a gold coin clutched in a dragon’s lair. While the researcher may judge their data as intrinsically valuable, it is ultimately transformation of data into knowledge and, optimally, improved actions that imparts data’s extrinsic merit. Therefore, its worth can be calculated as a function of utility. Without proper curation, including extensive metadata documentation and community-wide access, archived data retains little intellectual value, which is perhaps even outweighed by its cost.
With the push for open science gaining momentum among funding agencies, (inter)governmental bodies, and the collective research community, data is increasingly available to those who know what they want, where to look for it, and have the proper credentials to gain access. Availability, however, does not necessarily equate to discoverability, viability, or interoperability. Emerging repositories exemplifying these three characteristics are generally domain or region specific. Inherently multidisciplinary, global, temporally and spatially variable research areas largely dependent upon single-shot data collection opportunities and under pressure to solve humanity’s most urgent problems, lead the way by virtue of necessity. These tools provide access to extraordinary volumes of georeferenced data from satellites, ground- and ocean- based sensors, and direct observation. Yet linkages with other types of data beyond the broad primary domains of Earth and space sciences, genomics, and ecology remain in their infancy.
But what if we were to unlock all types of research data from their disciplinary (or even sub-disciplinary) shackles via secure, interconnected, and exhaustively documented data lakes? What insights might we gain if any researcher could easily locate and incorporate all verified data related to a particular point in space or time? What patterns could be uncovered by setting AI/ML free in a playground of unbounded data: anthropological evidence for cognitive leaps spawned from exceptional meteorological conditions controlling crop yields; gene mutation rates as a function of microplastic, heavy metals, and hydrocarbon concentrations in soil/air/water; etc. The possibilities of questions unconceived lie beyond the imagination’s horizon.
There are certainly caveats to consider before embarking on a journey as audacious as this, such as data sensitivity, provenance, veracity and ownership. Leaving respective fields to grapple with satisfactory domain-specific solutions to these issues, we instead look towards how to overcome technical limitations by unfettering data via a legitimate global namespace and leveraging metadata for discoverability, insight, and controlled access. We challenge the pervasive perspective that metadata curation is a low-impact, altruistic act.
Our previous blog posts have focussed business case uses of metadata management in data lakes, however, data discovery, quality, integration, usability and security are similarly critical facets of research data management. Mediaflux’s capacity to deftly navigate the deep and diverse waters of unstructured data, enforce data governance and compliance via access controls, and integrate with existing infrastructures makes it a peerless solution for research data management challenges. Leading research institutions around the world pushing the boundaries of innovation and excellence are leveraging this powerful tool to hone the edges of their data management systems.
Read the Research Data Management White Paper here.