High Throughput Computing (HTC) and High Performance Computing (HPC) represent two computational models that are very different, both in implementation as well as the resources required to run these.
Quoting XSEDE, part of the national cyber-infrastructure for high performance computing, "HPC codes … are tightly coupled MPI, OpenMP?, GPGPU, and hybrid programs. These codes require many low latency interconnected nodes." Because of this interconnect, HPC resources tend to be pricey. The Titan - Cray XK7 , which was ranked number one out of the TOP 500 in November 2012 , is an upgrade of $90 million.
HPC top supercomputers utilize codes that run on the entire system. In many cases HPC codes are not very portable since both MPI and GPU libraries often have library and/or machine specific components. For example, the Titan uses both CPU and GPU nodes. A light water reactor simulation called Vera will run in Titan, but "... the adaption to Titan's hybrid architecture is of greater difficulty than between previous CPU based supercomputers."
The Open Science Grid (OSG) is part of the National cyber-Infrastructure, but is dedicated to HTC computing. The OSG pool is a Virtual Cluster overlaying on top of OSG resources aggregated from many sites using the OSG Glidein Factory . By definition, a glidein allows the temporary addition of a grid resource to a local OSG pool. The OSG Glidein Factory is the glidein producing infrastructure tasked to advertise itself. It listens for requests from other OSG Virtual Organizations (VOs) or outside HTC users. It provides a way to run programs that utilize the spare capacity on a large number of resources in various locations.
|OSG Factory Glidein|
Any resource under one or limited ownership - be it a car, a laptop, a cluster or a data center among many examples - inherently can not be used 100% all the time. There is an enormous dormant capacity to be extracted from all pools managed by OSG, which sums up to many millions of hours of CPUs available. The OSG HTC technology brings forth this hidden power, elevating the utilization of its' managed resources as close to 100% as possible
HTC is, by design, a system based on unreliable components. Give work out to every node and the results eventually come back. If some of the nodes fail, the jobs can be restarted on a different system.
Many science problems can be adapted to HTC, perhaps easier than adapting existing codes for top ranking HPC machines. HTC supports a new frame of mind that unleashes what I call "guerrilla" science. I am inspired by Greg Thain, an HTC evangelist at University of Wisconsin:
As a researcher you are in a constant pressure to deliver results from a limited project funding. What will happen to your scientific project if computation were really cheap? (Because it is) So try not to think about being constrained by the amount of computation you have locally. What would happen if you could run 100,000 hours, one million hours? This is research. This is cheap. You can take risks. If you used 100,000 hours and still don't get the expected results, you still have the ability to analyze what happened and try again.
My take away is that in high end HPC there is serious work to be done to adapt a code to run efficiently from one Super Computer to another. In many cases - not all cases, but a surprising big number of them - it is easier adapting the code for HTC by breaking the applications in many chunks running for a few hours or less. It is worth considering.