An Interview with David Ungar, IBM Research
Dr. David Ungar (D.U.) answers the questions of Miha Ahronovitz (M.A.) . This interview is a continuation of the article Many Core processors: Everything You Know (about Parallel Programming) Is Wrong!
David is a professor of computing languages, a highly regarded IBM researcher, a jazz singer, an encyclopedic mind and one of the advocates for a new emphasis in computer science, which he describes below
Crista Videira Lopes, Associate Professor on Computer Science at University of California, Irvine introduced David at SPLASH 2011 as "The one and only Ungar".
M.A. In a 1996 paper you and Randy Smith wrote: "Subjectivity
is an inescapable aspect of two great natural systems: human language and the
physical world". How did this evolve into the many cores research?
D.U. Count Alfred Korzybski, wrote a book Science
and Sanity , published in 1933. He looked at the way people used language,
and looked at the modern science of his day. He realized that common language usage was grounded in
pre-science. Why not use language differently in order to think differently and discover
new things? Human knowledge of the world is limited both by the imperfection of
the human nervous system and by the structure of language.
I came across
Korzybski's ideas from a science fiction book of A. E. van Vogt, The Pawns of Null-A (also published as The Players of Null-A) in 1956. One of Korzybski's key points was our
nervous system constantly abstracts at every level.
We take a bunch of different things and we only look at their similarities, forgetting all the differences.
This is an illusory object because different people viewing it will see different pictures, Some see a pretty young woman, others an old woman. |
This is one the most famous quotes from Alfred Korzybski |
So Randy and I said: “Lets put these concepts in a computational framework.” In the paper you mentioned Randall Smith and I described US, - a programming language that explicitly puts the viewpoint into the computational frame. Normally there's an object, a thing, you are throwing messages at the object, and the messages can carry other objects. But in US there is another object or perhaps many objects, that determine the context. A message in US is dispatched according to all of these objects: the receiver plus the context objects: So in US, introducing multiple, implicit cells, upon which the messages were dispatched, - in somewhat the same fashion as other multiple dispatch languages – led to a generalization from object oriented programming, to context-sensitive, or subjective programming.
Other people have done similar things, but - in my opinion, none was as elegant as the model described in the US paper we originally wrote in 1996.
M.A Do you mean the paper
I mentioned in the question?
D.U. Yes. The difference between the procedural and object
oriented programming, is one dimension of context. In procedural programming the subroutines are
not associated with any object or class, and each time you run the same routine, the same thing happens. You can parametrize execution by what you pass through to the routines, but
that’s about all you can do. In object
oriented programming, there is always an object, which is the current receiver,
but you can think of it as a context. Most messages are sent to current receiver, called "self" or "this". What that typical message send to the current receiver means is “Don’t change the context, do what I
am asking in the current context”, but occasionally you send a message to an
explicit receiver, a different receiver, not "self" or "this", and such an operation can be viewed as “Change the
computational context - for a while. Run in that context. Put that object in
control of the computation". What is
important is that these contexts are implicit, as long as I am staying in the same
computational context. That is, they are implicit, in many object oriented languages, such as Self, C++, Java, Java Script does (it
follows C++ syntax) - the notion of context becomes implicit.The implicit nature of the syntax for a send to the current receiver is
important, because it changes the way you think about things. The receiver becomes part of the background context.
So this is object oriented programming. Functional
programming has no implicit context, object oriented programming has this one
receiver and that’s it. Lots of folks have said; ” Let’s have multiple
receivers”. The New Flavors extension of LISP (by Cannon and Moon, in Symbolics Zetalisp), was a relatively early language
with multiple dispatches. But, what we did in Us, was we made those
implicit. You have multiple implicit "selves", and they can take on different roles. For example, one role could be on
behalf of whom you run this computation. Another one could be a point in time,
you can say “run this computation as of January 1st, 2013”.
In US we have a nice elegant framework, in which to frame
these computations.
The insight we had was that the multi-dispatch problem was
identical to the multiple-inheritance problem. Searching through all these
selves to find a method that matched, was
in some sense isomorphic to the problem of sending a message to an object that
can have many parents, each having a match to the message sent.. This was one of the most relevant insights.
This lead to the re-formulation of the problem; “Do I have many
cells or do I have one that is the Cartesian product of these many cells?”.
That is an interesting dual way to understand what US is all about.
M.A. How US teachings
found their ways into many-cores research?
D.U. That did not lead us directly at all, but it was an example
of being conscious of an abstraction. In a multi-core, or many-cores or
distributing computing, each core is its own locus of reality. Jonathan Walpole
(currently a professor of computer science at Portland State University) has a
research area he named “relativistic programming”. It’s a wonderful, wonderful
name, because it captures this idea that each core has its’ own frame of
reference.
Paul McKenney (Distinguished Engineer with IBM Research in
Portland, Oregon) open sourced his book “Is Parallel Programming Hard, And, If
So, What Can You Do About It? “ The book
is a wonderful expedition into the state-of-the-art on how to do modern
parallel programming. When you read this parallel programming 101, it becomes clear that the order in
which events happen in real time is different on different cores, unless you
really work to make things consistent in the way they need to be. And any
attempt to impose consistency, has a cost, explicit or implicit.
For example architectures like SPARC have three different
STORE models, and presumably the more relaxed one run faster than the less
relaxed ones. The new C++ 11 Standard has different ordering models, and again,
the idea is the more relaxed one will give you a better performance. Can you
tolerate the loss of coherence in different viewpoints? In the Renaissance project, we are
specifically embracing the idea that every core computes in its own context.
20:31.
This is the long answer to your question
M.A. This is though
the real answer. Albert Einstein once said, we should make things as simple as
possible, but not simpler.
D.U. The same viewpoint applies to why I have been
frustrated with most computer languages that static type checking. Because any use of something, is in a context, and in any given context, one
only needs a subset of the invariant provided by that of that type, but if you start making unique types for every invariant, you have a huge number of
types, and that becomes unmanageable. So
typically type systems have types to represent clusters of invariants. The
problem you find is that you want to use something somewhere, where the full
types aren’t compatible, , but in that context it will work just fine, This is the common experience
that SMALTALK, LISP, SELF, maybe
Java Script programmers have, when they try to use a language that enforces static
checking.
M.A. In parallel distributed computing, using a
product like Grid Engine we got different results if the round robin servers
were processing some sequential jobs mixed with parallel, or if we dedicated
the servers to parallel MPI processing exclusively. We didn’t know why at the
time. How can you explain this?
D.U. I think the problem you described, is that you have
different results if you configure the servers differently. One of the
principles in our project is what we call “end-to-end non-determinism”. The
idea is if you go into these parallel systems to get performance – this holds
for multi-core, many-core, distributed, then you need to take an approximate
route. This means you need to have a way
to teach the client to specify to your program what sort of level of approximation or
accuracy is tolerable for a given output. For example if you ran a weather prediction system, a
variance of 10% in 5 minutes in the probability of rain, is perfectly
acceptable. If you use the output of the thing to make an irreversible
decision, then you may want to have tighter values. The trade off is between
accuracy and time and cost of the computation. The idea is to allow our clients
to make those trade offs. There is no
free lunch, because to really exploit parallel computational systems, it’s going to become much more valuable to be able
to make these trade offs, than it has been before, because the cost of
synchronization and communication , goes way up in proportion to the cost of
computation
M.A. Can you describe
something where accuracy is not important, specifically not generically?
D.U.: I think I just gave you one, weather forecasting. It
depends what you mean by accuracy. For example, if I am doing plans on the
future course of my business, and I am rolling
up every individual transaction for the last five years, and if I have answers in the
range of millions and billions dollars, then accuracy to a penny is probably
not important. In fact the input data input
have only certain accuracy anyway. If these decisions are made for a period of a
day, if the numbers are off a little bit for a second, it is OK too.
M.A. NOAA wants to increase the warning period for a
hurricane from the current 14 days to more. How can they use modern parallel
programming with many-cores to achieve this?
D.U. I don’t know
what their limitation is, whether is the computation, or the model or the fact
that weather is inherently a chaotic system. I can’t answer your question, but the
example I would give is an anthill. If you are at a picnic, you will see the ants arrive.
Ants are not performing a deterministic computation to discover your picnic. What happens is there are always a relatively small number of
ants randomly scouting, and when one ant finds a picnic, it it returns to the hill, while laying down a
scent trail, then another ant follows
that scent trail, laying down more scent if it finds food. So this is an example of an algorithm, where the accuracy is
plenty good enough. Such systems need a large number of ants distributed in
space; randomness is an essential ingredient,
as every ant behaves slightly different from the other, then positive
feedback (following the right scent
trail and negative feedback. (so that not all ants scout in the same direction).
M. A. Nice metaphor, actually a real answer. How can we simulate this in many-cores?
M. A. Nice metaphor, actually a real answer. How can we simulate this in many-cores?
D.U. We need a large number of cores before “the
anthill” algorithms begin to pay off. But I think as long as we have these four
essential ingredients (1) scale (2) randomness, (3) positive feedback and (4)
negative feedback we will get some interesting results.
M.A. I have heard a
lot about strong scaling and weak scaling. What is your view on those
definitions? Are they relevant in the future many-cores computations?
D.U. My understanding
of “strong scaling “is speeding up a fixed workload, in proportion
of the number of processors. “Weak scaling“ refers scaling the workload up and
having it run just as efficiently as you increase the parallelism available.
What people want,
some people do, is to run bigger problems, and this may be called “weak
scaling”, but I think the real problems have a combination of both “strong” and
“weak” scaling. They want the problems to run faster than before and they want to run larger problems than ever before. The key is to take a highly parallel
system, and get that parallelism to pay off. This is something we have been
struggling with in our field for many years. There are some special
cases where the folks have succeeded, like SETI at Home, a shining example. However we want to penetrate the mainstream, - for
example for a large business to be able to perform large business analytic computations - in a timely fashion - and have an increase in parallelism of
economically viable systems pay off.
M.A. You mentioned in
your SPLASH slide presentation in November 2011, that the hardware trends
created the need for the many-cores new
ideas and research . What about the actual needs of real customers forming
large market segments?
D.U. Did the iPad happen because of hardware developments
or because of a real need? How would you answer this question?
M.A. Steve Jobs inferring real
needs from desires that customers could not utter by themselves. The enormous
revenues from the sales of iPad proved Mr. Jobs right. To me, this is a product
management success as well.
D.U. My question to you was getting at an interesting
linguistic assumption. When you say “real needs”, underneath
those words there is an assumption that the need is an objective thing
in reality that you can measure with an instrument. You are abstracting away
important aspects of context.
The way I would look at it is, as hardware changes, the
economic trade off changes. The hardware is able to do different things for
different costs compared to before. For example when IBM invented the hard disk
drive, all of a sudden hardware became
capable of storing larger amounts of data for a fraction of the price. What you
call a real need, as I understand it, is an abstraction of purchasing decisions made by
clients. The client’s context changes, and part of that context is what the
hardware can deliver for what cost. As the the price performance envelope of hardware
changes by many orders of magnitude over the years, things that can be
performed by computers come to have more value to the client than the cost they had to lay out. Demand for computation increases as the price per unit of computation decreases.
This decrease is what ultimately the computer scientists
and all computer companies try to do.
M.A, Is it better to
use one thousand core processors instead of ten one hundred processors?
D.U. Why do you care? What you should have asked me, what is
the cost and how long will it take to solve the problem by using a single thousand-core processor versus ten hundred core processors. After such a calculation, I
am sure you will find that there exist problems out there where the usage of thousand-core
processor will be the best solution
M.A Referring to the Golden Circle idea , answering well the question of “why,” as
opposed to “what” and “how”, assures success of ideas and products. Can you
comment?
D.U. I don’t buy the premise in the way you presented it. I
think there are some things I don’t really care why, or there might be other
things that I do. Take for example the electricity delivered to my house. I care about the voltage consistency, the frequency;
pretty simply, it all boils down to a few plain parameters.
In IBM, Sam Palmisano, IBM Chairman, defined the three core values
for IBM: (1) Dedication to every
client's success (2) Innovation that matters, for our company and for the world
(3) Trust and personal responsibility in all relationships. I really like these
values, they resonate with me. The things I do and other people do in IBM,
should be judged according to those values. Even companies such as Apple where
Jobs said there are no focus groups, those companies wouldn’t succeed if people were not judging them as providing products that have value.
I really believe that harnessing the parallelism will let us
deliver value. I think we will be able to perform jobs for our clients that
need to be performed, giving them the differentiators that they need to care
about.
M.A. I read last week
an interview with Intel’s Brian David Johnson in Scientific American. Brian is
a futurist and technology planner. He is paid to craft visions of “both Intel’s
prospective technologies and what’s coming for the entire computing industry
until the year 2020. How does it work in IBM?
D.U. I am one of many, many researchers that IBM has. Not every single researcher is going to be correct in his
or her bet on what the future holds. But enough IBM researchers will be correct
enough of the time, and the impact of their discoveries or inventions will be
great enough, that as a whole it pays off. We all pursue what we think is
right. The diversity in minds, in background and in thoughts, will ensure that we don't miss any key future developments. I think there is a very great chance that our
work will pay off handsomely.
It’s been widely observed that cycle times have not shrunk
very much. There could some disruptive innovation at circuit or technology level
– IBM has folks working on that – that will change this, but for now this is
our assumption, and it's held true for a few years.
The next bet we make is that special relativity will
continue to hold. No one will discover a way to communicate between two cores
faster than speed of light. That seems like a sound assumption.
The third bet we made is that feature sizes will
continue to scale downwards, meaning people will be able to pack more devices,
economically, in a single die. The
original direction of the Renaissance project was looking at 1.000 cores per
processor, but this assumption we are relaxing as we go along. Perhaps systems will have many dies, each containing tens or hundreds of cores. Either way, the more numerous the cores, the higher the communication cost, and the more the case for programming paradigms that require less communication.
We also see it is very difficult to program parallel systems
in the exact way, although one can improve with experience. Then we see there
many systems in nature that are massively parallel are functioning very well.
Because of all these assumptions, the likelihood for the payoff
is strong. We will find ways to program systems, that use a lot of local
interactions and lots of feedback loops, instead of global communications, global
synchronization. Even the use of a local synchronization, could be too
expensive.
M.A. Is there any one
in IBM who performs the function of futurologist, similar to David Brian Jones
in Intel ?
D.U. As I mentioned before, I am one of 3,000 researchers across IBM’s nine global labs and we are all focused on various strategic bets that may benefit the company, 5, 10, 15 to 30 years from now. John Kelly is the Senior VP and Director of Research in IBM. He directs the research effort and helps guide IBM's overall technical strategy. [Miha’s
note: IBM has an annual income of $1B per year from IP alone].
Within IBM, the folks on the project (Sam Adams, Doug Kimelman, Mark Wegman and I) would like to make a persuasive case that
convinces people, to incorporate these ideas into products. We want to see those products be wildly more successful than before.
Outside of IBM, all of us on the project (including Prof. Theo D’Hondt, Stefan Marr, and Mattias De Wael at Vrije Universiteit in Brussels; and Prof. Andrew Black, and Max Orhai at Portland State University) would like
to encourage the formation of a new field or sub-discipline in Computer
Science, the exploration of probabilistic data structures that sacrifice
synchronization, but achieve better performance. We are organizing a workshop on this
subject at the forthcoming ACM SPLASH 2012 conference. We are hoping
to bring together researchers from all over the world. It is clear to me, and
if I am right, it will become clear to lots of people that this is an important
new direction for computer science.
M.A. Do you have
description of this new computer science field, or sub-field?
D.U. Let me quote from the workshop proposal, written by Andrew Black, Theo D’Hondt, Doug Kimelman, Martin Rinard, and myself:
“Massively-parallel systems are coming: core counts keep rising – whether for conventional cores as in multicore and manycore systems, or specialized cores as in GPUs. Conventional wisdom has been to utilize this parallelism by reducing synchronization to the minimum required to preserve determinism—in particular, by eliminating data races. However, Amdahl’s law implies that on highly-parallel systems even a small amount of synchronization introduces serialization that limits scaling. Thus, the conventional wisdom is doomed to fail as it hits “the CAS ceiling.”
A new school of thought is arising: one that accepts and even embraces non determinism (including data races), and is in return able to dramatically reduce synchronization, or even eliminate it completely. However, this approach requires that we leave the realm of the certain and enter the realm of the merely probable.
How can we cast aside the certainty of truth, the security of correctness, the logic of a proof, and adopt a new way of thinking, where answers are good enough but not certain, and where many processors work together in parallel without quite knowing the states that the others are in? We may need some amount of synchronization, but how much? Or better yet, how little? What mental tools and linguistic devices can we give programmers to help them adapt to this challenge?“
M.A. Thank you very
much. Is there anything you want to ad, that I missed asking?
Dr. David Ungar, IBM Research |
Comments