Doug Buell spent 25 years as Technical Director of Research Computing Services at Dana-Farber Cancer Institute, one of the world’s leading cancer research and treatment centers, where he built and maintained the data infrastructure supporting life-saving science. His team relied on Arcitecta’s Mediaflux data management platform extensively. Here, Doug shares what two and a half decades in research computing taught him about storage, metadata, and why decisions that feel administrative in the moment matter enormously in the long run.
I was a programmer when I walked into Dana-Farber 25 years ago — someone who made systems work, fixed what broke, and built tools when nothing existed. Twenty-five years later, I left with a very different understanding of what I’d actually been doing. The data wasn’t a byproduct of the research. It was the research.
Storage Isn’t Strategy
Early on, data at Dana-Farber was scattered across shared drives and floppy disks. My first major initiative was to bring order to that fragmentation and build a web-based system to organize hundreds of protocol documents and amendments. It was unglamorous work, but it showed me that the moment you stop actively managing your information, you’ve already lost control of it.
Years later, a power outage exposed how vulnerable our patchwork environment was, with billions of files across aging servers with minimal redundancy. A single catastrophic event could have erased years of irreplaceable research. What we needed was a real strategy, not more storage.
We rebuilt around a three-tiered architecture:
Hot copy — day-to-day research access
Warm copy — operational recovery
Cold copy — long-term preservation
AWS Deep Glacier was on our shortlist, but the retrieval costs added up fast, and multi-day turnaround times were unviable for active research. We ended up going with tape — predictable costs, fast retrieval, and no charges for accessing our own data. Not the answer anyone expected, but it solved the actual problem.
Mediaflux and the Metadata Imperative
We brought in Arcitecta’s Mediaflux to manage the tape environment and it became the backbone of our file management strategy — tracking data across environments and pulling in rich metadata. Unfortunately, we didn't take full advantage of it early on. We treated it as a drop-in replacement for the old system rather than leveraging its tagging and cataloging capabilities, which came back to bite us when we needed to find and interpret data quickly.
It also showed me something I would spend the next decade trying to get researchers to actually do: Tag data upon creation, not retroactively when someone needs to find it.
We still have reel-to-reel tapes from the 1970s stored off-site. They contain data. Nobody knows what data. There are no labels, no context, nothing to tell you what instrument captured it or what study it belonged to — just magnetic tape sitting in a box, taking up space. Today’s instruments generate multi-terabyte datasets in hours, and if nobody tags them at the point of creation, we end up with the same problem at a much larger scale.
The Human Challenge
The hardest problems I dealt with weren’t on the server side, they were a matter of trust. Researchers are protective of their data — reasonably so, given what they’ve put into generating it — and when central systems let them down, they stopped trusting them. After one of our bigger power outages, a few labs quietly started building their own mini data centers. I can’t really blame them. But it created a fragmentation problem that took years to sort out.
Getting that trust back was slow, deliberate work. You can’t recover it quickly once it’s eroded.
When data lives across a dozen different places outside any central system, you can’t protect it, you can’t catalog it, and you certainly can’t build on it later.
Building for an AI-Ready Future
I’ve watched AI go from a buzzword to something researchers at Dana-Farber were actively building into their procedures. What that experience made clear to me is that the AI conversation is really a data conversation. Train a model on poorly tagged, incomplete, or inconsistent data, and the outputs reflect that. In a cancer research setting, that’s not an abstract concern.
The role I think research institutions should invest in more is what I’d call a ‘data librarian’, someone who knows the science well enough to understand what needs to be captured and knows the information architecture well enough to ensure it gets captured correctly.
What I Leave Behind
The final major decision I made before retiring was to purchase a new Spectra Logic automated, intelligent, energy-efficient, and fully air-gapped tape library. An on-premises solution delivering fast access at predictable costs, without egress penalties.
The storage challenges ahead are real. Datasets are growing larger, instruments are proliferating, and budgets aren’t scaling at the same pace. But the harder work is still the cultural. Getting researchers to see that tagging their data at the point of creation is the key to making that data usable in five years when someone else wants to build on it is the biggest hurdle.
The culminating lesson from 25 years of working on these problems that I want to pass along is: The technology will keep changing, but the underlying discipline — knowing what you have, where it is, and what it means — is what makes any of it actually useful.
Guest Contributor
Doug Buell , Retired Technical Director, Research Computing Services, Dana-Farber Cancer Institute
