Why placing a free open source product on a very expensive infrastructure?

The supreme test for grid or cloud software is to have it run on Amazon Web Services (AWS), particularly for High Performance and High Throughput computing.

AWS has become the status symbol, and a very expensive one

Cycle Computing


In April  2012 Cycle Computing  Utility Super-Computing offer - based on the open source HTCondor " - a 50,000-core utility supercomputer in the Amazon Web Services (AWS) cloud for Schrödinger and Nimbus Discovery as customers." HPC in the Cloud reported  Cycle used 51,132 cores from 6,742 Amazon EC2 instances with 59 TB memory.

Everybody said Wow, but the service cost from Amazon alone was nearly five grand ($4,828.85  per hour to be precise). Amazon cleaned the table and they took the bulk of the profits here.

In September 2012 the cost went down. Cycle blog New CycleCloud HPC Cluster Is a Triple Threat: 30000 cores, $1279/Hour mentions the lowered Amazon toll money in the title

$1,300 per hour is still a considerable price to pay, fattening AWS' revenues.

As long as one pays, any HTCondor user can submit a job to AWS/EC2 by reading the section 5.3.7 of the   HTCondor manual.

Note added December 18, 2012:  The award winning research of Victor Ruotti human stem research from  Morgridge Institutehas a reported cost of only $120 per hour. Why? According to Cycle Computing blog Victor used HTCondor- totally free and open source -  not CycleCloud. with Grid Engine, Torque, or other commercial grid software that has usually a $99 per core list price annual subscription. Also Cycle used Opscode Chef  to configure the nodes' software, which according to the web site has a cost "from $120".

We have a Wild West world when we are talking of costs in Utility Super-Computing. From $5,000 to $120 per month in six months? What is a real story? No one knows what money we end up paying, unless Cycle donates ten grand, AWS donates $9,500, we use the free HTCondor and voila, we have $120 per hour.

AWS has also a hazy reputation of competing with the product and services  of their own customers. They did this with many of their clients offering Hadoop, storage and databases implementation. The nicest story is how they compete with Netflix, their best customer who developed their movie streaming technology on EC2 .

Open Grid Scheduler/Grid Engine


For many,  AWS has a reputation of offering for all practical purposes infinite resources. In realm of High Throughput Computing there no such thing as infinite resources.

Rayson Ho, the star open source developer of the Open Source Grid Engine reported in the November 21, 2012 blog that they built a an AWS EC2 cluster of 10,000 nodes,with instance sizes  t1.micro to c1.xlarge - of  1 to 8 cores per node.  They used a lot of spot instances, so the cost is not that expensive, as they paid for the EC2 cost themselves.

They did not go beyond 10,000 nodes, for now, because spot instances - the only one Rayson team can afford -  were very hard to get:
  • We kept sending spot requests to the us-east-1 region until "capacity-not-able" was returned to us. 
  • At peak rate, we were able to provision over 2,000 nodes in less than 30 minutes. In total, we spent less than 6 hours constructing, debugging the issue caused by the EBS volume limit, running a small number of Grid Engine tests, and taking down the cluster.
  • Instance boot time was independent of the instance type: EBS-backed c1.xlarge and t1.micro took roughly the same amount of time to boot.

My Crystal Bowl


Looking at my crystal ball,  the pioneering work of Cycle and Open Scheduler / Grid Engine are a great stepping stones that will deliver bigger and bigger clusters inside Amazon. But something bothers me.

  • there is a limit on how big these clusters can grow  
  • there is a limit to how many resources AWS / EC2  can offer. For High Throughput computing, AWS is not sufficient
  • I do not know how many will put up with AWS prices, with AWS risk of competing with their own customers, and  facing the risk of proven, recurring outages. 
  • why placing a free open source product on a very expensive infrastructure, and feed money to 3rd parties, when the developers  themselves don't make a cent?

I am astonished  how no one whom I know did not ask this question yet


A tool like Bosco?


Imagine a tool that that submits with one single script jobs to HTCondor based Cycle Cluster and to Open Scheduler. This tool is called Bosco,. As a leading scientist described this project

Bosco can help us by spreading the practical incarnation of the "Submit Locally, Run Globally" concept in High Throughput Computing (HTC). If you submit say to SGE (Sun Grid Engine on one its many flavors), PBS, LSF clusters, you can get in, but you can not get out to another cluster. You are stuck with SGE or PBS or LSF When you are submitting to Bosco, you can go out everywhere. And that's the concept: Bosco helps science a lot, because High End Science is about HTC.
Assuming AWS will offer the resources for free, the tool can work right away.  If you don't care what you pay, and you receive the AWS invoice without having a heart attack :), sure use Bosco. In real life, we need accounting and a cost forecasting capabilities. AWS has creating an entire cottage industry trying to do just that.

But for the future, the concept of adding clusters as easy as adding nodes to cloud, is the winning proposition.

Clouds are too small. We need to enable super clouds, where each node is a cluster -  call it cloud-as-a-node - to sound cool.

Notes:

(1)  Bosco beta v1.1 is available for download now. Try it!
(2) I am part of the tiny Bosco team, but the opinions expressed in this blog are entirely mine.


Comments

Popular Posts