Organizations are often focused on their high performance compute (HPC) cluster(s) in their quest to reduce time to insights and to stay ahead of the competition. Yet this singular focus on HPC cluster infrastructure can result in missing the bigger picture – the need for data processing pipelines in which the HPC processing is central, but only a step in a larger holistic scheme for handling the data.
A quantum of HPC processing is typically the venerable batch-oriented “job”, yet to execute an HPC job the required data for its execution is often copied on an ad hoc basis from either an archive or other persistent storage to the scratch storage on the HPC cluster. Then, when the job is complete, the results must be copied out from scratch, back to an archive or other persistent storage, also usually on an ad hoc basis.
This data processing approach was first employed with the earliest “stored program” computers such as the IBM 704 and remains little-changed. Today, organizations can create automated data processing pipelines by employing a contemporary data management solution. We’ll examine how this alternative approach can dramatically increase HPC cluster throughput, reduce human error, minimize storage costs, and even increase the reproducibility of numerical experiments.