An Interview with David Ungar, IBM Research
Dr. David Ungar (D.U.) answers the questions of Miha Ahronovitz (M.A.) . This interview is a continuation of the article Many Core processors: Everything You Know (about Parallel Programming) Is Wrong!
David is a professor of computing languages, a highly regarded IBM researcher, a jazz singer, an encyclopedic mind and one of the advocates for a new emphasis in computer science, which he describes below
Crista Videira Lopes, Associate Professor on Computer Science at University of California, Irvine introduced David at SPLASH 2011 as "The one and only Ungar".
M.A. In a 1996 paper you and Randy Smith wrote: "Subjectivity is an inescapable aspect of two great natural systems: human language and the physical world". How did this evolve into the many cores research?
D.U. Count Alfred Korzybski, wrote a book Science and Sanity , published in 1933. He looked at the way people used language, and looked at the modern science of his day. He realized that common language usage was grounded in pre-science. Why not use language differently in order to think differently and discover new things? Human knowledge of the world is limited both by the imperfection of the human nervous system and by the structure of language.
I came across Korzybski's ideas from a science fiction book of A. E. van Vogt, The Pawns of Null-A (also published as The Players of Null-A) in 1956. One of Korzybski's key points was our nervous system constantly abstracts at every level.
We take a bunch of different things and we only look at their similarities, forgetting all the differences.
|This is an illusory object because different people|
viewing it will see different pictures, Some see
a pretty young woman, others an old woman.
|This is one the most famous quotes|
from Alfred Korzybski
So Randy and I said: “Lets put these concepts in a computational framework.” In the paper you mentioned Randall Smith and I described US, - a programming language that explicitly puts the viewpoint into the computational frame. Normally there's an object, a thing, you are throwing messages at the object, and the messages can carry other objects. But in US there is another object or perhaps many objects, that determine the context. A message in US is dispatched according to all of these objects: the receiver plus the context objects: So in US, introducing multiple, implicit cells, upon which the messages were dispatched, - in somewhat the same fashion as other multiple dispatch languages – led to a generalization from object oriented programming, to context-sensitive, or subjective programming.
Other people have done similar things, but - in my opinion, none was as elegant as the model described in the US paper we originally wrote in 1996.
M.A Do you mean the paper I mentioned in the question?
D.U. Yes. The difference between the procedural and object oriented programming, is one dimension of context. In procedural programming the subroutines are not associated with any object or class, and each time you run the same routine, the same thing happens. You can parametrize execution by what you pass through to the routines, but that’s about all you can do. In object oriented programming, there is always an object, which is the current receiver, but you can think of it as a context. Most messages are sent to current receiver, called "self" or "this". What that typical message send to the current receiver means is “Don’t change the context, do what I am asking in the current context”, but occasionally you send a message to an explicit receiver, a different receiver, not "self" or "this", and such an operation can be viewed as “Change the computational context - for a while. Run in that context. Put that object in control of the computation". What is important is that these contexts are implicit, as long as I am staying in the same computational context. That is, they are implicit, in many object oriented languages, such as Self, C++, Java, Java Script does (it follows C++ syntax) - the notion of context becomes implicit.The implicit nature of the syntax for a send to the current receiver is important, because it changes the way you think about things. The receiver becomes part of the background context.
So this is object oriented programming. Functional programming has no implicit context, object oriented programming has this one receiver and that’s it. Lots of folks have said; ” Let’s have multiple receivers”. The New Flavors extension of LISP (by Cannon and Moon, in Symbolics Zetalisp), was a relatively early language with multiple dispatches. But, what we did in Us, was we made those implicit. You have multiple implicit "selves", and they can take on different roles. For example, one role could be on behalf of whom you run this computation. Another one could be a point in time, you can say “run this computation as of January 1st, 2013”.
In US we have a nice elegant framework, in which to frame these computations.
The insight we had was that the multi-dispatch problem was identical to the multiple-inheritance problem. Searching through all these selves to find a method that matched, was in some sense isomorphic to the problem of sending a message to an object that can have many parents, each having a match to the message sent.. This was one of the most relevant insights.
This lead to the re-formulation of the problem; “Do I have many cells or do I have one that is the Cartesian product of these many cells?”. That is an interesting dual way to understand what US is all about.
M.A. How US teachings found their ways into many-cores research?
D.U. That did not lead us directly at all, but it was an example of being conscious of an abstraction. In a multi-core, or many-cores or distributing computing, each core is its own locus of reality. Jonathan Walpole (currently a professor of computer science at Portland State University) has a research area he named “relativistic programming”. It’s a wonderful, wonderful name, because it captures this idea that each core has its’ own frame of reference.
Paul McKenney (Distinguished Engineer with IBM Research in Portland, Oregon) open sourced his book “Is Parallel Programming Hard, And, If So, What Can You Do About It? “ The book is a wonderful expedition into the state-of-the-art on how to do modern parallel programming. When you read this parallel programming 101, it becomes clear that the order in which events happen in real time is different on different cores, unless you really work to make things consistent in the way they need to be. And any attempt to impose consistency, has a cost, explicit or implicit.
For example architectures like SPARC have three different STORE models, and presumably the more relaxed one run faster than the less relaxed ones. The new C++ 11 Standard has different ordering models, and again, the idea is the more relaxed one will give you a better performance. Can you tolerate the loss of coherence in different viewpoints? In the Renaissance project, we are specifically embracing the idea that every core computes in its own context. 20:31.
This is the long answer to your question
M.A. This is though the real answer. Albert Einstein once said, we should make things as simple as possible, but not simpler.
D.U. The same viewpoint applies to why I have been frustrated with most computer languages that static type checking. Because any use of something, is in a context, and in any given context, one only needs a subset of the invariant provided by that of that type, but if you start making unique types for every invariant, you have a huge number of types, and that becomes unmanageable. So typically type systems have types to represent clusters of invariants. The problem you find is that you want to use something somewhere, where the full types aren’t compatible, , but in that context it will work just fine, This is the common experience that SMALTALK, LISP, SELF, maybe Java Script programmers have, when they try to use a language that enforces static checking.
M.A. In parallel distributed computing, using a product like Grid Engine we got different results if the round robin servers were processing some sequential jobs mixed with parallel, or if we dedicated the servers to parallel MPI processing exclusively. We didn’t know why at the time. How can you explain this?
D.U. I think the problem you described, is that you have different results if you configure the servers differently. One of the principles in our project is what we call “end-to-end non-determinism”. The idea is if you go into these parallel systems to get performance – this holds for multi-core, many-core, distributed, then you need to take an approximate route. This means you need to have a way to teach the client to specify to your program what sort of level of approximation or accuracy is tolerable for a given output. For example if you ran a weather prediction system, a variance of 10% in 5 minutes in the probability of rain, is perfectly acceptable. If you use the output of the thing to make an irreversible decision, then you may want to have tighter values. The trade off is between accuracy and time and cost of the computation. The idea is to allow our clients to make those trade offs. There is no free lunch, because to really exploit parallel computational systems, it’s going to become much more valuable to be able to make these trade offs, than it has been before, because the cost of synchronization and communication , goes way up in proportion to the cost of computation
M.A. Can you describe something where accuracy is not important, specifically not generically?
D.U.: I think I just gave you one, weather forecasting. It depends what you mean by accuracy. For example, if I am doing plans on the future course of my business, and I am rolling up every individual transaction for the last five years, and if I have answers in the range of millions and billions dollars, then accuracy to a penny is probably not important. In fact the input data input have only certain accuracy anyway. If these decisions are made for a period of a day, if the numbers are off a little bit for a second, it is OK too.
M.A. NOAA wants to increase the warning period for a hurricane from the current 14 days to more. How can they use modern parallel programming with many-cores to achieve this?
D.U. I don’t know what their limitation is, whether is the computation, or the model or the fact that weather is inherently a chaotic system. I can’t answer your question, but the example I would give is an anthill. If you are at a picnic, you will see the ants arrive. Ants are not performing a deterministic computation to discover your picnic. What happens is there are always a relatively small number of ants randomly scouting, and when one ant finds a picnic, it it returns to the hill, while laying down a scent trail, then another ant follows that scent trail, laying down more scent if it finds food. So this is an example of an algorithm, where the accuracy is plenty good enough. Such systems need a large number of ants distributed in space; randomness is an essential ingredient, as every ant behaves slightly different from the other, then positive feedback (following the right scent trail and negative feedback. (so that not all ants scout in the same direction).
M. A. Nice metaphor, actually a real answer. How can we simulate this in many-cores?
M. A. Nice metaphor, actually a real answer. How can we simulate this in many-cores?
D.U. We need a large number of cores before “the anthill” algorithms begin to pay off. But I think as long as we have these four essential ingredients (1) scale (2) randomness, (3) positive feedback and (4) negative feedback we will get some interesting results.
M.A. I have heard a lot about strong scaling and weak scaling. What is your view on those definitions? Are they relevant in the future many-cores computations?
D.U. My understanding of “strong scaling “is speeding up a fixed workload, in proportion of the number of processors. “Weak scaling“ refers scaling the workload up and having it run just as efficiently as you increase the parallelism available.
What people want, some people do, is to run bigger problems, and this may be called “weak scaling”, but I think the real problems have a combination of both “strong” and “weak” scaling. They want the problems to run faster than before and they want to run larger problems than ever before. The key is to take a highly parallel system, and get that parallelism to pay off. This is something we have been struggling with in our field for many years. There are some special cases where the folks have succeeded, like SETI at Home, a shining example. However we want to penetrate the mainstream, - for example for a large business to be able to perform large business analytic computations - in a timely fashion - and have an increase in parallelism of economically viable systems pay off.
M.A. You mentioned in your SPLASH slide presentation in November 2011, that the hardware trends created the need for the many-cores new ideas and research . What about the actual needs of real customers forming large market segments?
D.U. Did the iPad happen because of hardware developments or because of a real need? How would you answer this question?
M.A. Steve Jobs inferring real needs from desires that customers could not utter by themselves. The enormous revenues from the sales of iPad proved Mr. Jobs right. To me, this is a product management success as well.
D.U. My question to you was getting at an interesting linguistic assumption. When you say “real needs”, underneath those words there is an assumption that the need is an objective thing in reality that you can measure with an instrument. You are abstracting away important aspects of context.
The way I would look at it is, as hardware changes, the economic trade off changes. The hardware is able to do different things for different costs compared to before. For example when IBM invented the hard disk drive, all of a sudden hardware became capable of storing larger amounts of data for a fraction of the price. What you call a real need, as I understand it, is an abstraction of purchasing decisions made by clients. The client’s context changes, and part of that context is what the hardware can deliver for what cost. As the the price performance envelope of hardware changes by many orders of magnitude over the years, things that can be performed by computers come to have more value to the client than the cost they had to lay out. Demand for computation increases as the price per unit of computation decreases.
This decrease is what ultimately the computer scientists and all computer companies try to do.
M.A, Is it better to use one thousand core processors instead of ten one hundred processors?
D.U. Why do you care? What you should have asked me, what is the cost and how long will it take to solve the problem by using a single thousand-core processor versus ten hundred core processors. After such a calculation, I am sure you will find that there exist problems out there where the usage of thousand-core processor will be the best solution
M.A Referring to the Golden Circle idea , answering well the question of “why,” as opposed to “what” and “how”, assures success of ideas and products. Can you comment?
D.U. I don’t buy the premise in the way you presented it. I think there are some things I don’t really care why, or there might be other things that I do. Take for example the electricity delivered to my house. I care about the voltage consistency, the frequency; pretty simply, it all boils down to a few plain parameters.
In IBM, Sam Palmisano, IBM Chairman, defined the three core values for IBM: (1) Dedication to every client's success (2) Innovation that matters, for our company and for the world (3) Trust and personal responsibility in all relationships. I really like these values, they resonate with me. The things I do and other people do in IBM, should be judged according to those values. Even companies such as Apple where Jobs said there are no focus groups, those companies wouldn’t succeed if people were not judging them as providing products that have value.
I really believe that harnessing the parallelism will let us deliver value. I think we will be able to perform jobs for our clients that need to be performed, giving them the differentiators that they need to care about.
M.A. I read last week an interview with Intel’s Brian David Johnson in Scientific American. Brian is a futurist and technology planner. He is paid to craft visions of “both Intel’s prospective technologies and what’s coming for the entire computing industry until the year 2020. How does it work in IBM?
D.U. I am one of many, many researchers that IBM has. Not every single researcher is going to be correct in his or her bet on what the future holds. But enough IBM researchers will be correct enough of the time, and the impact of their discoveries or inventions will be great enough, that as a whole it pays off. We all pursue what we think is right. The diversity in minds, in background and in thoughts, will ensure that we don't miss any key future developments. I think there is a very great chance that our work will pay off handsomely.
It’s been widely observed that cycle times have not shrunk very much. There could some disruptive innovation at circuit or technology level – IBM has folks working on that – that will change this, but for now this is our assumption, and it's held true for a few years.
The next bet we make is that special relativity will continue to hold. No one will discover a way to communicate between two cores faster than speed of light. That seems like a sound assumption.
The third bet we made is that feature sizes will continue to scale downwards, meaning people will be able to pack more devices, economically, in a single die. The original direction of the Renaissance project was looking at 1.000 cores per processor, but this assumption we are relaxing as we go along. Perhaps systems will have many dies, each containing tens or hundreds of cores. Either way, the more numerous the cores, the higher the communication cost, and the more the case for programming paradigms that require less communication.
We also see it is very difficult to program parallel systems in the exact way, although one can improve with experience. Then we see there many systems in nature that are massively parallel are functioning very well.
Because of all these assumptions, the likelihood for the payoff is strong. We will find ways to program systems, that use a lot of local interactions and lots of feedback loops, instead of global communications, global synchronization. Even the use of a local synchronization, could be too expensive.
M.A. Is there any one in IBM who performs the function of futurologist, similar to David Brian Jones in Intel ?
D.U. As I mentioned before, I am one of 3,000 researchers across IBM’s nine global labs and we are all focused on various strategic bets that may benefit the company, 5, 10, 15 to 30 years from now. John Kelly is the Senior VP and Director of Research in IBM. He directs the research effort and helps guide IBM's overall technical strategy. [Miha’s note: IBM has an annual income of $1B per year from IP alone].
Within IBM, the folks on the project (Sam Adams, Doug Kimelman, Mark Wegman and I) would like to make a persuasive case that convinces people, to incorporate these ideas into products. We want to see those products be wildly more successful than before.
Outside of IBM, all of us on the project (including Prof. Theo D’Hondt, Stefan Marr, and Mattias De Wael at Vrije Universiteit in Brussels; and Prof. Andrew Black, and Max Orhai at Portland State University) would like to encourage the formation of a new field or sub-discipline in Computer Science, the exploration of probabilistic data structures that sacrifice synchronization, but achieve better performance. We are organizing a workshop on this subject at the forthcoming ACM SPLASH 2012 conference. We are hoping to bring together researchers from all over the world. It is clear to me, and if I am right, it will become clear to lots of people that this is an important new direction for computer science.
M.A. Do you have description of this new computer science field, or sub-field?
D.U. Let me quote from the workshop proposal, written by Andrew Black, Theo D’Hondt, Doug Kimelman, Martin Rinard, and myself:
“Massively-parallel systems are coming: core counts keep rising – whether for conventional cores as in multicore and manycore systems, or specialized cores as in GPUs. Conventional wisdom has been to utilize this parallelism by reducing synchronization to the minimum required to preserve determinism—in particular, by eliminating data races. However, Amdahl’s law implies that on highly-parallel systems even a small amount of synchronization introduces serialization that limits scaling. Thus, the conventional wisdom is doomed to fail as it hits “the CAS ceiling.”
A new school of thought is arising: one that accepts and even embraces non determinism (including data races), and is in return able to dramatically reduce synchronization, or even eliminate it completely. However, this approach requires that we leave the realm of the certain and enter the realm of the merely probable.
How can we cast aside the certainty of truth, the security of correctness, the logic of a proof, and adopt a new way of thinking, where answers are good enough but not certain, and where many processors work together in parallel without quite knowing the states that the others are in? We may need some amount of synchronization, but how much? Or better yet, how little? What mental tools and linguistic devices can we give programmers to help them adapt to this challenge?“
M.A. Thank you very much. Is there anything you want to ad, that I missed asking?
|Dr. David Ungar, IBM Research|