For Research

As the growth in the volume of research data accelerates, the challenge to harness and exploit data to drive research outcomes increases.

The ability to access and analyse data from all aspects of a research project (subject to authorisation) using a common platform allows for an integrated view of experimental data and is likely to generate additional insights by identifying new relationships.

Mediaflux Desktop is a web operating system for metadata and data enabling individuals to distributed groups to ingest, discover, and share any type of data. It utilises contemporary Web 2.0 technology, adjusting to data by dynamically constructing menus, panes and pop-ups, dependent upon the underlying data.

Mediaflux Desktop provides advanced features that are not normally associated with a browser based interface that enable uploading and downloading terabytes of data. It allows the ingestion of directories and performs file system pattern matching and packaging. It will continue to run even if the browser closes, allowing large data transfers to continue.

Key Features

  • Multi-user platform for ingesting and storing any type of data - upload and download data, from a single file to thousands, from a single byte to terabytes
  • Geospatial features - associate n-dimensional point data and polygons with data
  • Automated metadata extraction - metadata can be automatically extracted from data or added manually at any time; metadata enables discovery of data and can conform to any standard or to a customised schema
  • Flexible metadata management - metadata document definitions can be updated at any time on a live system, without the need to take tables, or the database, off-line
  • Discovery - easy to use and powerful web query tool
  • Designed for large data - scales to billions of files and petabytes of data; large datasets may be packaged based on patterns defining which files to process or ignore, which to coalesce and which are related
  • Replication - data can be automatically shared to multiple systems
  • Federation - search across multiple systems, loosely coupled, exponentially scalable
  • Workflow - support for simple or complex workflows and processes; deliver data to external applications and systems for transformation
  • Versioning - earlier versions are automatically preserved when data is updated
  • Traceable - trace results back to source data
  • High performance - parallel I/O for ingestion and replication
  • Auditing - all operations captured in an audit trail
  • Access control - flexible access control based on hierarchical (actor, control, subject) triplets; access control lists and fine grain control to metadata document level
  • Integration - can integrate with any other system
  • Low cost of ownership with bounded and known costs and commercial grade support.

Dealing with Large Data

Special techniques need to be employed to deal with large data. For example, genome sequencers can generate single files that are 20TB in size. Luckily they are text and can compress ~3:1. Nevertheless, days can be consumed in compressing data: that's why Mediaflux Desktop employs parallel compression tools.

There are researchers who need to share images that are 110,000 x 65,000 pixels in size. These cannot be shared over the web without computation or active systems. That's why Mediaflux Desktop utilises image pyramids to pan and zoom these images.

Information systems need also to deal with large numbers of files (100's of millions) and billions of objects. That's why Mediaflux Desktop utilises a high performance database that out scales traditional relational databases for the types of problems addressed.

All in all, there are many issues that need to be dealt with for large and large cardinality data.

Mediaflux Desktop can add significant value as a common platform for researchers, offering researchers far more capability than shared or distributed file systems.

Alignment of Structured and Unstructured Data

A common requirement is the management of both structured and unstructured data. If structured and unstructured data describing some aspect of the same fact are separated, research errors arise. Those errors may not be discovered for years, whereas had the entire data been dealt with as one unit, the errors would not have arisen.

Historically, data has been stored in file systems because that was the only option. However, this has significant drawbacks. Searching and analysis (of structured data) must be carried out in a database before getting to a file system (where unstructured data is stored). File systems are too slow - the wrong tool - for searching and quickly finding the requisite data from large data sets.

Mediaflux Desktop circumvents these drawbacks by treating structured and unstructured data as one unit, with a fast and powerful search capability, removing the need for users to directly store and manage data in file systems.

Storage Workflow

Storage processes are orders of magnitude slower than computational processes. Arcitecta Desktop includes many features designed to optimise storage workflow:

  • File Compilation Profiles (FCP)
  • Arcitecta Archives (AAR)
  • Parallel I/O
  • Storage
  • Hierarchical Storage Management (HSM).

File Compilation Profiles (FCP)

Researchers need to upload and download data, from a single file to thousands, from a single byte to terabytes. File compilation profiles simplify and automate this task.

Users can simply drag directories (or files) into the system, select an FCP, and then have the system automatically interpret and optimally package the contents of the input data.

The file system compiler can:

  • Traverse directories of files
  • Group data into archives of varying levels of compression
  • Assign logical MIME types to the data
  • Extract metadata, including geo-spatial extents, from file paths, sidecar XML files and other contextual sources in the client, and before the server-side content analysers, if any, extract information.

For example, the military imaging format CIB normally has 1,000-1,500 files per image. An FCP can be configured to detect the occurrence of a CIB data set, and automatically ingest and bundle it into a single archive. This significantly reduces both network transmission overhead, and storage overhead, since only one (atomic) file is transmitted and stored instead of up to 1,500 separate files. The Arcitecta CIB content analyser extracts geospatial extents and other metadata.

Arcitecta Archive

The Arcitecta Archive (AAR) is an archive format developed by Arcitecta to coalesce large numbers of small or large files. This can significantly reduce the number (and potentially the size) of files under management leading to substantial disk I/O and network I/O savings.

A typical example is where 10,237 files containing Long Mate Pair genome sequence data totalling 269GB are packaged into one file of less than 92GB.

AAR files support the following:

  • Coalesce any number of files into a single archive (AAR) - up to 2^63 bytes
  • Automatic re-inflation on extraction
  • Parallel compression and decompression, which drastically improves compression and decompression times on multi-CPU machines
  • Extraction of individual files from an AAR and transmission to a client without intermediate decompression
  • Table of contents can be extracted and stored locally
  • Archives can be split and merged to create derivate archives without additional decompression/recompression steps.
Arcitecta Archive

Figure 1: Arcitecta Archive (AAR) file supports parallel compression and decompression

AAR detects errors at the partial file level, enabling recovery of parts of an archive (non-corrupt files, or non-corrupt segments of files) in the event that part of the archive has become corrupted by the storage system(s).

Archive formats including ZIP, TAR.GZ and ISO images are also supported.

Parallel I/O

Mediaflux Desktop utilises parallel I/O for transmitting data to the Mediaflux server. This works for both direct file transmission, and/or streaming the output for on-the-fly archive generation.

Parallel I/O increases performance for wider area network transmission, and also improves performance for local area network transmission by ensuring the network is fully utilised. The parallel I/O system can be configured to specify the number of concurrent packet transmissions.

During upload to the server, data can be transmitted and arrive out of sequence - it is properly ordered within Mediaflux before passing to the service to receive/process the data.

Hierarchical Storage Management

Hierarchical storage management (HSM) offers a tiered virtual storage environment, allowing data to be automatically directed to the most cost effective storage tier. Mediaflux has the following capabilities in conjunction with third party HSM:

  • Reporting in which tier particular data is located
  • Explicitly migrating data from one tier to another.

These capabilities can be used with queries to allow metadata to drive HSM migration. For example, a query that utilises geospatial and other metadata might be constructed to migrate all data relating to a particular geospatial area for a project to high speed disk.

When migrating a file on-line, the entire file or specific byte ranges may be migrated.

Putting it All Together

These features can be used in combination to streamline the ingestion of data as illustrated below.

Optimised Ingest

Figure 2: Example optimised ingest storage workflow

The ingest process is as follows:

  1. The file system compiler bundles files into archives. The user simply drags and drops a directory
  2. The file system compiler packages data into one or more archives at the specified compression level
  3. The file system compiler transmits the data to the server using parallel I/O
  4. The server optionally extracts the table of contents from the archive and stores it in the database for later browsing
  5. The data is committed to storage - the data is cached on the HSM disk as it arrives from the client, ensuring no extra copy overhead is incurred.
Optimised Egest

Figure 3. Example optimised egest storage workflow

The egest process is as follows:

  1. A user or application searches for data using a query. The content of the archive can be browsed using the on-line table of contents, without retrieving the data from the HSM
  2. The user requests specific files from within the archive
  3. The server requests the requisite byte ranges for those files from the storage system - HSM in this case
  4. The files are extracted (without decompressing) from the archive and transmitted to the client
  5. The data is re-inflated as it arrives at the network boundary of the client computer.