An interview with Dr. Frank Wuerthwein. Can high energy particle physics change the way we do mainstream big data?
He is "developing, deploying, and now operating a worldwide distributed computing system for high throughput computing with large data volumes. In 2010, "large" data volumes are measured in Petabytes. By 2020, he expects this to grow to Exabytes." He is a key management member of Open Science Grid (OSG)
He will present at ISC Big Data'13 in Heidelberg, Germany, September 25, 2013 a talk titled Dynamically Creating Big Data Processing Centers – a Large Hadron Collider Case Study . We chatted about it and below is our conversation. Here are the slides:
Dynamic Data Center concept
Distributed Human Resources
M (Miha): What is the most significant thing about the paper you present at ISC Big Data'13?
FW (Frank Wuerthwein): The audience will be half university and half industry. I am trying to present the logic of what we do in particle physics.. To describe the 30,000 ft picture and why we are doing that. When I start with the "why do we do what do", I start up with the question; "what is the most valuable asset we have?". Usually in Supercomputing it is having the biggest computers money can buy. But for us the most valuable assets are human. This is the most important commodity we have. We did a study in CMS on how much money we spent in salaries, compared to cost of computing resources . We spent about five times more on people than in computing resources per-se.
So how do you maximize output, given that human effort is our most valuable commodity? I look where are all these people living. What are the organizational principles of our collaboration? How can technology support these principles, and thus support the productivity of the collaboration, rather than provide barriers?
M: Do you refer to positive user experience? What do you mean?
FW: It is not so much about user experience - of course we satisfy perhaps up to the 90% percentile but not more, - it is about how to organize humans around the resources for maximum productivity. CMS includes 2,000 people or so across 180 institutions in 40 countries. The computing infrastructure needs to both support centralized operations for the good of the collaboration as a whole, and reflect the distributed nature of the human capital.
M: How can you manage for distributed and centralized at the same time?
FW: We must be able to dynamically (1) add resources because people all over the world may temporarily have access to resources they can contribute, or want to use for their local needs. We must be able to (2) dynamically switch allocations of resources from global to local and vice versa. There must be incentives to donate resources to the common good. And (3) it must be possible to use software tools the collaboration provides on resources that are locally controlled. The allocations must be ultimately under local control in order to allow rapid changes without bureaucratic overheads. Humans who are local should not have to wait for a centralized resource allocation decision. Otherwise, they will hold their local resources, rather than sharing them freely when they don’t need them.
M: So what you say what is centralized is the workflow, and everybody must fit into it, where ever they are?
FW: Yes and no. We have both centralized and local workflows. However the switch should be seamless between local and centralized workflows. One should be able to donate resources and get them back in a very short time when needed.
|Fig. 1: A design illustrating how raw data originating from CERN are processed at peak and at steady state|
OSG = Open Science Grid, FNAL= Fermi National Lab, SDSC = San Diego Super Computing Center
Dynamic Data Centers
M: How does this tie up with the idea of Dynamic Data Centers?
FW: This is something the future will bring. It’s a natural next step given that we have already created dynamic compute centers. We will show an example of what we have actually done. There is a bigger picture. It answers the question; "Why do we have a distributed infrastructure in the first place?" People sometimes ask. "Why don't you have all resources in one place on the planet?" It does not make sense to have a single big building with all the computing resources just like it doesn't make sense to bring all the people into one place. Having a distributed architecture allows more resources, people and computers, to be more effective participants in the global CMS collaboration. Am I making sense?
M:Very much so. What you are saying is that are not only the machines, but the people too are distributed. This human distribution is as important as the machines distribution. Right?
FW: Even if it were possible to put all computer resources in one place, it would be not desirable, because the skilled people do not live in one single place. You want to add more resources to the system, without having to ask a central authority for the green light. You should be able to come with a rack of hardware and say: "Now I want to add this to the global system, while I can still use my rack in whichever way I want to use it locally. But when I am not using it, you can have it." The transitions in and out of the infrastructure must be seamless.
M: So what is the difference between High Throughput Computing (HTC) and what you propose?
FW: HTC is the technology that it makes this possible. I am trying to give people a sense on why is HTC a predominant computing paradigm in our field (high energy particle physics). Some people ask "Why don't you use the powerful supercomputers you have access to?" "How can it be that these distributed resources are better for you?". I will address these questions in my talk. In essence, the short answer is that a distributed system maximizes the human productivity. Then the ability to connect resources from all over the world is a tremendous advantage. This kind or organizing resource, at least in our field, is highly desirable. Once you accept that, then the idea of Dynamic Computer Centers makes a lot of sense.
M: What could be a more formal definition of what a Dynamic Computer Center is?
FW: I would describe everything you need to have a Dynamic Computer Center. You need disks (very large ones for huge data sets), you need networked access to these to stage in the data, you need an output configuration where you place your processed data. All of these can be created out of the existing resources on the fly.
M: Why calling it 'Dynamic"?
FW: I mean you can have anytime a resource and use it, without a need for pre-installed software on it. or use a very limited amount of such a software. So when I go away, the resource is "clean" for anybody to have it.
M: Is your work at UCSD on processing CMS data with Gordon supercomputer as a node in OSG, a good example of a Dynamic Data Center? Can one repeat this model in a different location?
FW: One can replicate this experience, because it is a matter of (1) making the basic API's available and (2) making it easy for the hardware operator to support the APIs It is essential to use the mechanisms of log in into supercomputer ( with all its cooling and electricity consumption, etc) for the distributed access (this is ssh). We export the batch system to the outside world, by interfacing with the ssh. We used Bosco for this. Then we needed an interface to move data in and out. We used gridftp. For applications we use things that work with the http:// protocol. And so on. We can provide all the technical details to anyone interested, but the principles are very simple. Everything under the hood is abstracted away so we can mix resources as they become available. We only expose the HTCondor submission, that everyone - in our world - understands.
M: Would you talk about it at ISC BigData'13?
FW: I will stay away from technical details and talk about the high level ideas I described in this interview.
M: Who do you think can benefit most from the Dynamic Data Center concept you helped develop?
FW: Once we implemented and deployed this dynamically "open" architecture for CMS, we realized that it is easily open in a second dimension. Not only is it open towards participants within the CMS collaboration from all over the world, but it is also easily made open across the entire range of scientific endeavors and the basic principles transcend particle physics. Biologists, engineers, mathematicians, chemists, sociologists, etc., all benefit from the basic structure. Once you have this deployed for us, it can easily be opened up also for others.
And so today my biggest fascination is from finding "new customers" outside of particle physics.
ISC Big Data, Heidelberg and Southern California
M: You are born in Heidelberg area in Germany. what made you come to US?
FW: I came originally to California as a post-doc at Caltech for a couple years. But when you are born in a cold country, and then you lived in Southern California once, it is very hard not to want to come back. I like surfing, because this way I became a "native" in San Diego.
Miha's Note: Frank has the surfing whether forecast on his web site
I never finished my university degree in Germany. I was supposed to be in the US only for one year. In that year with a scholarship to Cornell, I met my wife, and well, I never went back to Germany. I got my PH.D at Cornell, then to Caltech for a couple of years, then I went to MIT for four years, then I came to UCSD (University of California, San Diego). I crossed the USA three times, coast to coast.
M: ISC Big Data '13 is first ever Big Data conference - in Europe - event of such magnitude. Half of the attendants are from commercial world. What do you have to tell them?
FW: I don't know. When I heard about this conference, I thought it is a very interesting idea. I would expect to learn, more than I bring in. It is not obvious to me how much we have in common . I want to discover what they do, they will discover what we do and then define what common ground we share. Some of the most interesting things are not the talks, but the conversations during dinner, coffee breaks or in the corridors . Does it make sense?
M: It makes a lot of sense. This is how TOP500 was born a few decades ago.
FW: I want to have the maximally broad exposure, so I can have a maximum of avenues they can engage with me. If I talk only about Dynamic Data Centers, I can miss out ten other conversations which are worthwhile having. For example we have this dichotomy between structured and unstructured data. On one side you have Oracle like structured data, and to other extreme data unstructured, where you don't even know what to look for until you actually look for a specific purpose. I want to position my hundreds of Petabytes of data from particle physics in this continuum, I don't see this as a either / or. There is a lot of grey between.
M: It seems the amount of Data you process in CMS - Compact Muon Solenoid experiment at the CERN physics laboratory) - is way above what they accustomed today in commercial world
FW: To my knowledge, there are only two other places on the planet that have our volume of data. The National Security Agency (NSA) and Google. I am just guessing, as I don't know exactly how Google manages its data.
M: What should happen to consider ISC Big Data'13 a success for you?
FW: One of my collaborators will attend the conference to see what attracts him as a career: Academia or Industry. We want to discover how the commercial world values our work. If he has a clear picture, this would be a great success. For me personally, I am always looking for “new customers” as well as inspiration for doing things a different way.
M: How do you feel going back to a Heidelberg conference in the country you are born, after so many years in US?
FW: I never actually gave a conference talk in Heidelberg, In all these years, I only got back once to give a seminar talk. So for me, the Heidelberg University is very prestigious, and in that sense, this is very special occasion for me.