Friday, December 28, 2012

How can I learn to be a good Product Manager?

This my answer to the question from Quora
I am surprised no one suggested the Haas Business School Product Management , a 5 days executive program. I am an alumni. See the group on LinkedIn. The PM skill is not bullet-able with plain vanilla advice (be this, be that) or vague Chinese cookies advise  (learn to balance).
A good product manager creates desire, see Michelle Ahronovitz on Nir Eyal's "Creating Desire" , learns the user experience as in  Tristan Kromer's Customer Development and User Experience and pays attention to what is invisible to the naked eye.
That's why developing products in corporations can be mildly to very frustrating. Just attending UC Berkeley there is a bout of fresh air, I did not breathe at Sun Microsystems before it's fall.
A Product Manager is in essence a proto-CEO of a startup company. You should go to Berkeley, even for 5 days, to see what I mean.

Wednesday, December 19, 2012

An interview with Miron Livny : Bosco, HTCondor and more

Miron Livny  is a professor of Computer Science. He leads  the Center for High Throughput Computing (CHTC) at the University of Wisconsin -Madison and serves as the technical director of the Open Science Grid  (OSG).

As described on HTCondor web site:
...most scientists are concerned with how many floating point operations per month or per year they can extract from their computing environment rather than the number of such operations the environment can provide them per second or minute. Floating point operations per second (FLOPS) has been the yardstick used by most High Performance Computing (HPC) efforts to evaluate their systems. Little attention has been devoted by the computing community to environments that can deliver large amounts of processing capacity over long periods of time. We refer to such environments as High Throughput Computing (HTC) environments.
 In essence, Miron's  research is driven by the following challenge:
How can we accommodate an unbounded need for computing and an unbounded amount of data with an unbounded amount of resources?
This is a daring thought to have and it inspired  a fascinating conversation  that was triggered by the upcoming release of the BOSCO tool by the Open Science Grid (OSG)

Professor Miron Livny
Miha:  You brought up for the first time the HTC  in a 1997 interview with HPCwire.  When did  the idea start?

Miron:  My PhD  at  Weizman Institute of Science in Israel (1978)  dealt with load balancing in distributed systems, a new field at the time. I looked into the undesired state of such a system.One of the many customers (users) are waiting for a  resource, while one or more of the resources capable and willing to serve this customer are idle, yet they can not serve this user. Distributed computing, scheduling, matchmaking  absorbed me.  I do the same thing now for over thirty years.

Miha:  And then?

Miron  Then in  1983 I came to University of Wisconsin.  While at Weizman Institute,  all the research was simulation work. I really started developing the technology when I came to US. Here we developed the concept of  separate, individual resource ownership. This was key.  In my doctoral  thesis,  the system owned everything.

Miha:. At Super-Computing events the popular perception is that TOP500 computers designed to maximize FLOPS are most desirable to have. What's best  platform to maximize HTC?

Miron   In principle, I can run a lot of independent tasks on IBM Blue Gene, for example. It may not be as easy because Blue Gene does not have a full stack of communication and people are running MPIs. In fact we work with IBM to add HTC capabilities to their machines, But is this platform  the most  cost effective way of doing it?

A successful company such as Google realized that following the distributed computing principles that underpin  HTC is the way to go.  They decided to design a  system where each component is assumed to be unreliable, and it is miracle if it actually does what it is intended to do. From the beginning they envisaged a system based on unreliable components. Giving work out to every node and the results eventually come back even if some of the nodes fail. In HPC, because of the MPI model,  everything has to work,  otherwise the whole thing falls apart

Miha:   Did Google adapted HTCondor open source code?

Miron:  While they refer to our system in their publications as an example for the resource manager they have developed, - no, they did not use our software.   One of the reasons for this - I believe - is that  is doing  more than Google needs in their environment. They wanted something minimal attuned to their needs and not carrying the extra luggage.

Miha:  Unofficial studies estimate last year Google Compute Engine had 800,000 cores. Today this number may be at least several million cores. It appears the decision to embrace these HTC principles paid off for them

Miron:  Google wrote everything in house. I visited them a few times to share my experiences, and some of my former students joined them. If you look in the chapter 13 of the Grid  book edited by Ian Foster et.al, Rajesh Raman - who joined Google - and I described the principles of HTC, not the HTCondor  technology.

We are computer scientists.  We create computer science principles, not just develop a software package. Anyone  who wants to use those principles is encourage to create  the software tailored to their needs, if doing so is cost effective for them. It goes without say that for Google it was!

Miha:  Do you have multiple adaptations of HTCondor for different users?

Miron:   We are funded by the National Science Foundation (NSF) and the Department of Energy ( DOE) to support the scientific HTC. It is therefore natural that our work is driven by the needs of the main HTC science communities and motivated by emerging science disciplines that embrace HTC.

Miha: What about industry?

Miron:   While we have a lot of users in industry, we are not developing HTCondor for their needs.  HTCondor enables science, and one major  discovery can have a long term impact on the future of HTC.

Miha:   One your collaborators, Cycle Computing  has a powerful Amazon Web Service cluster  priced on demand per hour that uses HTCondor. Are they other examples?

Miron:   Red Hat  offers support services for  for MRG that uses HTCondor. For example all of Dreamworks  Animation rendering farms are managed by HTCondor. It is a Red Hat account. Fedora and Debian also offer HTCondor.

An intuitive diagram of Bosco
Miha: Let's talk  Bosco,  Who had the idea to develop it separately  from  HTCondor?

Miron: The idea developed naturally. You just have to look at the HTCondor architecture under the hood.  The philosophy was always "Submit locally and run globally" – namely submit on your workstation and run on any workstation in the organization. Now,  why don't we  go to the researchers and say: "OK, we'll give you just this one  piece of software, called Bosco, that is running on your side.  You will  learn and install Bosco, in a few hours,and it will enable you to Submit locally and run on machines all over the world!"

You can view Bosco  "My  Personal High Throughput  Manager"

Miha:  This is the empathy Bosco seeks. Make the researchers happy. Offer them a simple interface to create a galaxy of HTC resources

Miron::  It is not just a job submission interface. It is a High-End HTC Manager.

I can share with you that I have a graduate student student who is building  a MatLabs-based piece of software that  we believe it will manage in its final incarnation  more than  1 million jobs.  But the user of this tool - yet unnamed - does not think in terms of jobs, it thinks about the points in the calculation of a multi-dimensional  function. Each of the points is calculated by another HTC-Job, originated via MatLabs, but behind the scenes, it  is Bosco that does the heavy lifting. This combinations of Matlabs with Bosco, enables a researcher with very basic computing skills to invoke a computation where every point is an independent job that is running on different machines in different parts of the world.

Miha:   If this works, it will be mind boggling. 

Miron:   Bosco can help us by spreading the practical incarnation of the "Submit Locally, Run Globally" concept in HTC. If you submit say to SGE (Sun Grid Engine on one its many flavors), PBS, LSF clusters,  you can get in, but you can not get out to another cluster. You are stuck with SGE or PBS or LSF  When you are submitting to Bosco, you can go everywhere. And that's the concept: Bosco helps a lot the researcher to build bridges to different on and off campus resources and thus help improve throughput. Regardless of the software that manages these remote resources, Bosco should get your jobs there and bring back the results!

Miha:  How did a scientist submit  and manage jobs on different clusters, before Bosco?

Miron:  If you want to run two jobs, don't use Bosco, because you don't have a High Throughput Computing Problem.  But if you have 1,000 or 100,000 jobs, then you need a tool that will help you manage the entire high throughput computing activity. Without Bosco, when you want to use two clusters, you "ssh" to one cluster and submit jobs, and then "ssh" to  the second one, and submit some jobs. This is a manual process that is not effective and does not scale. In Bosco there is only one submission for all the jobs.

Miha: What exactly this means?

Miron:  All the researcher sees is one Bosco interface where he writes his submission files. Bosco automatically sends the jobs to the remote clusters via  SSH. It can even load balance the workload among the remote clusters It takes care of everything needed to work seamlessly on those remote clusters (passwords, port numbers). The scientist does not know the name of the remote machine and its exact location.  The researchers – users -  do not have to install ANYTHING on the remote machine . There is no need to talk to the remote system administrator or ask anyone for any "special" treatment on those remote sites

Many cluster administrators put a limit to how many jobs you can have queued. In such a case, you must submit in batches of the  maximum allowed number of jobs, say 500 which for 100,000 jobs means  200 manual submissions. Bosco does this automatically, we call it throttling.

System administration tasks for a Bosco  researcher
managing multiple clusters from all over the world

Miha:    David Ungar - in an interview of how to program many  core processors said  that  the precision of a calculation  depends how  much money one needs to spend. How does it work in HTC?

Miron:  For many HTC applications throughput is linked to  accuracy / understanding / knowledge. Namely, if we can run more jobs (simulations / searches) we can establish more confidence in the results or explore more options / parameters / cases that will give us more information about the problem.

Miha: :  Is it possible to predict (one day) what will be the minimum number of resources needed to deliver the results in y accuracy within, say, x days?

Miron: Having clouds helps us with a cost model for the computing part of this challenge, something we did not have in the past . Today, if you give me a function that relates product quality to CPU hours,  I can translate it to $'s using clouds. It is not a trivial function to create Time to market can be a factor, too.

Miha:   We defined Bosco in many ways, but we need a single definition of Bosco

Miron:  I totally agree,

Miha:   Can you help us spell this message?

Miron: I want the message to come from  the Bosco team  and you are part of it, You are new , while I am doing the same thing for thirty years.

Notes:

(1) Bosco is another capabilities provided by the Open Science Grid for high-end high-throughput super-computing jobs running anywhere in the world and managed from local campuses

(2) Bosco beta v1.1 is available for download now

Monday, December 17, 2012

Why placing a free open source product on a very expensive infrastructure?

The supreme test for grid or cloud software is to have it run on Amazon Web Services (AWS), particularly for High Performance and High Throughput computing.

AWS has become the status symbol, and a very expensive one

Cycle Computing


In April  2012 Cycle Computing  Utility Super-Computing offer - based on the open source HTCondor " - a 50,000-core utility supercomputer in the Amazon Web Services (AWS) cloud for Schrödinger and Nimbus Discovery as customers." HPC in the Cloud reported  Cycle used 51,132 cores from 6,742 Amazon EC2 instances with 59 TB memory.

Everybody said Wow, but the service cost from Amazon alone was nearly five grand ($4,828.85  per hour to be precise). Amazon cleaned the table and they took the bulk of the profits here.

In September 2012 the cost went down. Cycle blog New CycleCloud HPC Cluster Is a Triple Threat: 30000 cores, $1279/Hour mentions the lowered Amazon toll money in the title

$1,300 per hour is still a considerable price to pay, fattening AWS' revenues.

As long as one pays, any HTCondor user can submit a job to AWS/EC2 by reading the section 5.3.7 of the   HTCondor manual.

Note added December 18, 2012:  The award winning research of Victor Ruotti human stem research from  Morgridge Institutehas a reported cost of only $120 per hour. Why? According to Cycle Computing blog Victor used HTCondor- totally free and open source -  not CycleCloud. with Grid Engine, Torque, or other commercial grid software that has usually a $99 per core list price annual subscription. Also Cycle used Opscode Chef  to configure the nodes' software, which according to the web site has a cost "from $120".

We have a Wild West world when we are talking of costs in Utility Super-Computing. From $5,000 to $120 per month in six months? What is a real story? No one knows what money we end up paying, unless Cycle donates ten grand, AWS donates $9,500, we use the free HTCondor and voila, we have $120 per hour.

AWS has also a hazy reputation of competing with the product and services  of their own customers. They did this with many of their clients offering Hadoop, storage and databases implementation. The nicest story is how they compete with Netflix, their best customer who developed their movie streaming technology on EC2 .

Open Grid Scheduler/Grid Engine


For many,  AWS has a reputation of offering for all practical purposes infinite resources. In realm of High Throughput Computing there no such thing as infinite resources.

Rayson Ho, the star open source developer of the Open Source Grid Engine reported in the November 21, 2012 blog that they built a an AWS EC2 cluster of 10,000 nodes,with instance sizes  t1.micro to c1.xlarge - of  1 to 8 cores per node.  They used a lot of spot instances, so the cost is not that expensive, as they paid for the EC2 cost themselves.

They did not go beyond 10,000 nodes, for now, because spot instances - the only one Rayson team can afford -  were very hard to get:
  • We kept sending spot requests to the us-east-1 region until "capacity-not-able" was returned to us. 
  • At peak rate, we were able to provision over 2,000 nodes in less than 30 minutes. In total, we spent less than 6 hours constructing, debugging the issue caused by the EBS volume limit, running a small number of Grid Engine tests, and taking down the cluster.
  • Instance boot time was independent of the instance type: EBS-backed c1.xlarge and t1.micro took roughly the same amount of time to boot.

My Crystal Bowl


Looking at my crystal ball,  the pioneering work of Cycle and Open Scheduler / Grid Engine are a great stepping stones that will deliver bigger and bigger clusters inside Amazon. But something bothers me.

  • there is a limit on how big these clusters can grow  
  • there is a limit to how many resources AWS / EC2  can offer. For High Throughput computing, AWS is not sufficient
  • I do not know how many will put up with AWS prices, with AWS risk of competing with their own customers, and  facing the risk of proven, recurring outages. 
  • why placing a free open source product on a very expensive infrastructure, and feed money to 3rd parties, when the developers  themselves don't make a cent?

I am astonished  how no one whom I know did not ask this question yet


A tool like Bosco?


Imagine a tool that that submits with one single script jobs to HTCondor based Cycle Cluster and to Open Scheduler. This tool is called Bosco,. As a leading scientist described this project

Bosco can help us by spreading the practical incarnation of the "Submit Locally, Run Globally" concept in High Throughput Computing (HTC). If you submit say to SGE (Sun Grid Engine on one its many flavors), PBS, LSF clusters, you can get in, but you can not get out to another cluster. You are stuck with SGE or PBS or LSF When you are submitting to Bosco, you can go out everywhere. And that's the concept: Bosco helps science a lot, because High End Science is about HTC.
Assuming AWS will offer the resources for free, the tool can work right away.  If you don't care what you pay, and you receive the AWS invoice without having a heart attack :), sure use Bosco. In real life, we need accounting and a cost forecasting capabilities. AWS has creating an entire cottage industry trying to do just that.

But for the future, the concept of adding clusters as easy as adding nodes to cloud, is the winning proposition.

Clouds are too small. We need to enable super clouds, where each node is a cluster -  call it cloud-as-a-node - to sound cool.

Notes:

(1)  Bosco beta v1.1 is available for download now. Try it!
(2) I am part of the tiny Bosco team, but the opinions expressed in this blog are entirely mine.


Saturday, December 15, 2012

Isaac Asimov's amazing foresight

This video attracted many  people as they believe Asimov predicted the existence of Quora as a necessity. In reality, he predicted much more , so I place it here for our readers.

He said more than two decades ago, before Yahoo, Google etc we will have Internet and we will educate anyone - including slower learners or older users - at their own pace. We can ask questions and get answers on the computers all of us will own in our homes, the same way we own automobiles and air conditioning units.

Indirectly, he predicted the highly functional autistic employment



See the classic question they ask Asimov; "What if computers dehumanize the human mind?" OMG, didn't I hear this before? Mr. Asimov looks modern and he will remain so for many centuries.  Twenty five years ago Isaac Asimov was perceived as eccentric. His interviewer represented the common sense.

Mr. Asimov owned a Radio Shack TRS-80 computer which sold for $3,450 in 1980. He said:

"I do not fear computers. I fear the lack of them." 

Thursday, December 06, 2012

The Übercloud Experiment. An Interview

The Übercloud Experiment is the brainchild of Wolfgang Gentzsch and Burak Yenier.

Wolfgang Gentzsch is mostly based in Germany - he travels a lot. Among his many achievements, Wolfgang is the Chairman of the ISC Cloud Conference for HPC  Big Data in the Cloudthe main founder of Gridware Inc, acquired by Sun Microsystems in 2000  He is probably the Mr. HPC-Cloud  worldwide. He lists - with an ingenious sense of humor  - the 2010 Dilbert Award - on his LinkedIn page. Just click the link.

Burak Yenier is a hands-on expert on what we call  today an Internet based cloud. Before it was called Application Service Provider (ASP) or E-Commerce. Based in Silicon Valley for the last 13 years,  Burak worked in various start-ups and he is now the VP of Operations (my translation: the guy who gets the job done) of a large San Francisco financial services company.

We talked over Skype, hours before the rains in Northern California brought a 30 hours power outage in my office. I call this luck.

Miha: How did you two meet?

Burak:  I was puzzled on  why the cloud computing was not up and running or easily accessible in High Performance Computing (HPC).  Wolfgang, whom  I have not met before, seemed the one who may have the answer.  I sent him an InMail via LinkedIn. Wolfgang responded immediately and said: "I will give you 1.5 hours right now.  Let's talk!"

Wolfgang:  Burak had some deep questions around “why is cloud adoption in HPC so slow?” .  I had no answers for some of his questions. I had no practical experience from solid case studies, because they aren't many.  So we started an intense four weeks discussion trying to seek proven facts that we can use.  Some answers we came across were from real experts who had proof to back up their statements, but many others  believed that they  had an answer, but  all they had were suppositions. I  traveled to US to see Burak and we spent a few days together in the same room, brainstorming. At the end, we decided to find out ourselves. We needed proven answers and came out with the idea for a real community project.

Miha: I like its' name The Übercloud Experiment.and its Ümlaut. Ümlauts bring good fortune. You just completed the Phase 1. How did it go?

HPC and Clouds. Courtesy: http://www.hpcadvisorycouncil.com/advanced_hpc_cloud.php
Burak: Our main  objective in Phase 1 was to discover if the  remote HPC Cloud access could work given the right circumstances. As the word "cloud" has so many meaning, The Übercloud Experiment definition refers to the remote access to one or many  HPC clusters. We solved this conundrum at the end of  Phase 1 and we can say, yes, it works. And it works not only for academic, but for commercial and industrial environments.  We focused on  the end user requirements, more than performance measurements. We expected about ten participants from Silicon Valley area, but we ended with one hundred and sixty participants from twenty countries. We expected five projects and we ended with more than 20 projects. We concluded that the remote HPC Cloud access is not simple, not mature, but sure it works and has a tremendous up-scaling potential. The sheer enthusiasm of the participants does not cease to amaze us.

Miha: Awesome. What about Phase 2?

Wolfgang: In Phase 1, we added an entire team of experts to each end-user. This was not simple, but it worked. We recruited as volunteers IaaS experts, ISV experts, HPC experts and so on. This took a lot of time. But now we have a good idea of what works and what isn't working.  Phase 2 started officially at SC'12 in Salt Lake City in November, but  started de-facto in December 2012. We count already two hundred and sixty participating organizations expanding outstandingly in size of  the community we created in Phase 1. Now we are trying to remove most, if not all the  roadblocks that prevent the  HPC cloud from being easy to use.

Miha: How The Übercloud Experiment manages 260 or more participants?

Wolfgang: A challenge! But using our Phase 1 learning - we implemented an easy to use and elegant on-line  project management tool, Basecamp. Our communications are now efficient and scale up.

Miha: What motivates the participants?  What they hope to happen at the end of Phase 2?

Wolfgang: Every participant has in mind future benefits, depending on the role they play as a team member. The end user will want to access more resources on-demand at attractive pricing. The ISVs in crowded market segments, like Computer Aided Engineering (CAE), can devise more utility services and increase the revenues. The HPC consultants will recruit more customers for HPC Cloud implementations.

Burak: We want to keep the experiment going.  As long as our community is willing to keep this effort up we continue.  Later the community will look for commercial benefits, but there is nothing decided on how to explore this in our  Übercloud Experiment  so far.

Miha: Do you mean  monetization is  not part of The Übercloud Experiment?

Burak: We know that monetary exchanges already exist in remote HPC services for some time, before we started our experiment.  But we do not have the patterns, the user success stories that scale up,  to make those  business operations mainstream. That we have to figure it out and then,  the economic model will follow. This is a question of time rather than a question of how. We observe, report and disseminate our  discoveries to accelerate this process.

Miha: Do you  plan to implement some kind of certifications?

Burak: Not for now. The community is not mature enough. But each project implies a concept of informal certification, as users, providers and consultants are testing each other up.

Miha: Did any 3rd parties ( investors, large companies , analyst groups)  express interest in The Uber-Cloud Experiment?

Wolfgang: No, but this may happen in the future. Burak and I are both pleasantly surprised at the success of The Uber-Cloud Experiment.  We asked ourselves: Why?  One reason is that we think the timing is perfect. The HPC community recognizes the progress of the cloud business in mainstream enterprises and is anxious to emulate it in HPC.  Another reason is the participants invest their own efforts and the available resources they  already have, and the project and each participant do not generate any measurable  profits and are not penalized for failure. We hope -in time -  the collective intellectual  knowledge  of The Uber-Cloud Experiment  teams may become a valuable equity..

Miha:  Do you have ready the report for Phase 1?

Wolfgang; Yes: the Executive Summary is already  our website,  See also the HPC wire article  HPC as a Service: Lessons Learned

Miha: Do you plan to offer this report for free.?

Wolfgang. The detailed 75 pages  report will be available only to our participants.

Burak:  We invite anyone interested in our work to join The Übercloud Experiment and this way , they will have access to all the reports and acquire a first hand experience.

Miha: How important is The Übercloud Experiment in your professional life?

Burak: We do this in our spare time, and we shift the work across time zones Germany (Wolfgang) and I in California. We have no conflicts with our day jobs.. We put money from our own pockets. We are having a lot fun. The enthusiasm of our participants keeps us motivated. The good of this project is the engagement of the HPC people. They come from research and industry,  they are very clear thinkers and easy to work with. The challenge was that we were drinking water from fire hose. We were expecting 10 participants and now we have now 260. It looked like mess, but now we have an organized mess.

Wolfgang: For me, this is definitely the number one excitement project currently. I have other interesting projects, yes,  but The UberCloud Experiment stands out  as a different, breakthrough kind. It has a new technology, a new idea and  we're talking to 260 people in 25 countries. Uber-Cloud Experiment now also conquers the Life Sciences community  It complements nicely my work as  the chairman of  the  ISC Cloud Conference.
Burak Yenier, and Wolfgang Gentzsch 

Blog Archive

About Me

My photo

AI and ML for Conversational Economy