Data Workflow

Data processes are orders of magnitude slower than computational processes. Mediaflux includes many features designed to optimise data workflow:

  • File Compilation Profiles (FCP)
  • Arcitecta Archives (AAR)
  • Parallel I/O
  • Hierarchical Storage Management (HSM).

File Compilation Profiles (FCP)

Users need to upload and download data, from a single file to thousands, from a single byte to terabytes. File compilation profiles simplify and automate this task.

Users can simply drag directories (or files) into the system, select an FCP, and then have the system automatically interpret and optimally package the contents of the input data.

The file system compiler can:

  • Traverse directories of files
  • Group data into archives of varying levels of compression
  • Assign logical MIME types to the data
  • Extract metadata, including geo-spatial extents, from file paths, sidecar XML files and other contextual sources in the client, and before the server-side content analysers, if any, extract information.

For example, the military imaging format CIB normally has 1,000-1,500 files per image. An FCP can be configured to detect the occurrence of a CIB data set, and automatically ingest and bundle it into a single archive. This significantly reduces both network transmission overhead, and storage overhead, since only one (atomic) file is transmitted and stored instead of up to 1,500 separate files. The Arcitecta CIB content analyser extracts geospatial extents and other metadata.

Arcitecta Archive

The Arcitecta Archive (AAR) is an archive format developed by Arcitecta to coalesce large numbers of small or large files. This can significantly reduce the number (and potentially the size) of files under management leading to substantial disk I/O and network I/O savings.

A typical example is where 10,237 files containing Long Mate Pair genome sequence data totalling 269GB are packaged into one file of less than 92GB.

AAR files support the following:

  • Coalesce any number of files into a single archive (AAR) - up to 2^63 bytes
  • Automatic re-inflation on extraction
  • Parallel compression and decompression, which drastically improves compression and decompression times on multi-CPU machines
  • Extraction of individual files from an AAR and transmission to a client without intermediate decompression
  • Table of contents can be extracted and stored locally
  • Archives can be split and merged to create derivate archives without additional decompression/recompression steps.
Arcitecta Archive

Figure 1: Arcitecta Archive (AAR) file supports parallel compression and decompression

AAR detects errors at the partial file level, enabling recovery of parts of an archive (non-corrupt files, or non-corrupt segments of files) in the event that part of the archive has become corrupted by the storage system(s).

Archive formats including ZIP, TAR.GZ and ISO images are also supported.

Parallel I/O

Mediaflux Desktop utilises parallel I/O for transmitting data to the Mediaflux server. This works for both direct file transmission, and/or streaming the output for on-the-fly archive generation.

Parallel I/O increases performance for wider area network transmission, and also improves performance for local area network transmission by ensuring the network is fully utilised. The parallel I/O system can be configured to specify the number of concurrent packet transmissions.

During upload to the server, data can be transmitted and arrive out of sequence - it is properly ordered within Mediaflux before passing to the service to receive/process the data.

Hierarchical Storage Management

Hierarchical storage management (HSM) offers a tiered virtual storage environment, allowing data to be automatically directed to the most cost effective storage tier. Mediaflux has the following capabilities in conjunction with third party HSM:

  • Reporting in which tier particular data is located
  • Explicitly migrating data from one tier to another.

These capabilities can be used with queries to allow metadata to drive HSM migration. For example, a query that utilises geospatial and other metadata might be constructed to migrate all data relating to a particular geospatial area for a project to high speed disk.

When migrating a file on-line, the entire file or specific byte ranges may be migrated.

Putting it All Together

These features can be used in combination to streamline the ingestion of data as illustrated below.

Optimised Ingest

Figure 2: Example optimised ingest storage workflow

The ingest process is as follows:

  1. The file system compiler bundles files into archives. The user simply drags and drops a directory
  2. The file system compiler packages data into one or more archives at the specified compression level
  3. The file system compiler transmits the data to the server using parallel I/O
  4. The server optionally extracts the table of contents from the archive and stores it in the database for later browsing
  5. The data is committed to storage - the data is cached on the HSM disk as it arrives from the client, ensuring no extra copy overhead is incurred.
Optimised Egest

Figure 3. Example optimised egest storage workflow

The egest process is as follows:

  1. A user or application searches for data using a query. The content of the archive can be browsed using the on-line table of contents, without retrieving the data from the HSM
  2. The user requests specific files from within the archive
  3. The server requests the requisite byte ranges for those files from the storage system - HSM in this case
  4. The files are extracted (without decompressing) from the archive and transmitted to the client
  5. The data is re-inflated as it arrives at the network boundary of the client computer.