There is an acute need to transform the classical high performance computing (HPC) skill set. Adrian Cockroft, from Battery Ventures says in a recent interview with InsideHPC
The biggest skill shortage in IT today is related to big data and data scientists. Most people in HPC have the analysis and math skills needed to meet this need, but you may need to retool from being an MPI programmer to learning R or Hadoop or Spark.From Supercomputing 2014 , Nicole Hemsoth reports yesterday morning (November 18, 2014) in HPCwire:
As many are aware, the market has actually turned out to be something of a purgatory—with major technology shifts in processor, memory, accelerator, and storage still off on the horizon.... (the HPC) market has ground to a halt. ... The global Top 500 list of the world’s fastest supercomputers remained almost entirely unchanged at the high end.Readers of this blog have read warning signs that this will come. In my 2012 blog Why HPC TOP500 never made any money and never will in its' present shape I said HPC never catered to real markets. They just believed so.
The BoscoR project which started almost two years ago, showed the importance of learning R.
The BigData TechCon in Boston April 2015 advertises itself:
Why not enhance your career skills and increase your future value by becoming a Big Data Expert? More than 40 expert speakers and instructors will teach you how to master Hadoop, Spark, NoSQL, Hive, R, Pig, MongoDB, Cassandra, and other Big Data technologies and put them to work in your company!Today every organization has to collect data to stay competitive. They understand how to store it, retrieve it and slice it. The idea now is to understand the data itself, to detect patterns and trends that will help the organization get new customers or members.
Opportunity for HPC is in Big Data
High performance engineers are trained to solve complex data problems, of an order of magnitude higher than everything imaginable. This is a quote from an interview with Dr. Frank Wuerthwein, an expert in particle physics new phenomena at the high energy frontier with the CMS detector at the LHC (Large Hadron Collider) at CERN.
Dr. Wuerthwein is one of the nearly six thousands scientists working for nearly twenty years to detect the legendary Higgs particle. The work lead to the Nobel Prize for Physics in 2013To my knowledge, there are only two other places on the planet that have our volume of data. The National Security Agency (NSA) and Google. I am just guessing, as I don't know exactly how Google manages its data.
However no real life company analyzing big data has the means and the patience to pay 6,000 scientists for 20 years to get an extraordinary result.
Some of technologies used at CERN will reach the mainstream. Hadoop was born from the internal needs of Google and Yahoo. This is how Cloudera, Hortonworks and MapR were born.
But seeing top engineers in HPC start companies is not easy. I noticed this first hand myself. I can give many explanations, but none match my utter disbelief in a community of people who believe that life can not happier outside research and universities.
From Big Data to Executive Decision - marketing approach
This is the metric that depends on the level of confidence of the people who are not statisticians, not data scientists, yet they take all important decision in corporations and government. At the top, we have all elite CEOs in the world. They must believe us in spite of the punch line "If you torture big data long enough, it will confess"
The Economist Intelligence Unit surveyed over 600 business leaders, across the globe and industry sectors about the use of Big Data in their organizations. In other words, non-technical interviewers ask mostly non-technical C level executives questions about how Big Data will affect them. The final result is an infographic like this;
|Courtesy of Capgemini link is here|
I look at this picture. It is nice, it has many beautiful colors but it says absolutely nothing to increase the credibility of a Big Data analysis. It lacks substance.
From Big Data to Executive Decision - engineering approach
I discovered an article from the blog of Evo Eftimov, a big data hands-on consultant, from London, U.K. Big Data and Systems Explained in Simple Practical Terms.
This is his opening paragraph:
This is a conceptual level overview intended to map and explain key concepts and thus facilitate the decision making process of senior executives, architects, business analysts and developers. Big Data stripped from hype and magic.
Then I read things that I guess most interviewers from The Economist Intelligence unit did not know as clearly:
Aha! So even before being amazed by PBs, just have a logic to organize the data sets for maximum predictability and situational awareness. The following is the clearest, shortest description of how many big data types we have and what "unstructured data" is::What is Big Data?Think about your current data sets (and/or new data sets you would like to start using) and the following possibilities:Each of your current data sets can get much longer. The implications are that they will contain many more historical patterns about the behavior of your business or physical processes. More patterns equates to more predictive analytics power. Longer data sets also equate to better situational awareness (different from predictive analytics), because they contain more information about broader set of events in the environment of interest.You can join each of your current data sets with other / additional data sets and thus obtain data sets with more variables / dimensions (conducive to more powerful models) as well as more opportunities for event correlations (cause-effect chains) across the different data sets.Only then start thinking about MBs, GBs, TBs, PBs, etc and whether it can be stored on one or more computers in terms of how big, your big data really is.
Evo has a startling statement : When it comes to big data algorithms, there is nothing new under the sun. These algorithms have been around for ages and used for “small data” His article describes those algorithms and then states:Big Data Types
- Big Data at rest / in storage
- Streaming Big Data (the key difference with the above is that often it is processed as a sequence of segments aka sliding window or even discrete data points)As the content of each of the above can be (1) Structured (e.g. market data), (2) Semi-structured and (3) Unstructured (created without specific format in mind e.g. the content of tweets, blogs, forums / chat rooms and some types of documents such as web pages)
So for those wondering what exactly a “Data Scientist” is supposed to mean and provide, the answer is these are usually Ph.D. level people specializing in e.g. Statistical Analysis, Machine Learning and Natural Language Processing. Like the category 3 algorithms, they have been around for ages. The new glamour of big data have pulled them into the limelight and have given them a new titleThe article concludes
How do you Apply the Big Data Algorithms to Big Data for Maximum Performance?Well, by applying the well known principle of parallel processing / algorithms. If you want something done faster, especially if it is also associated with the processing of large amount of data, then divide the processing and the data into chunks and then execute them in parallel.
HPC offers on in Big Data processing at a scale never attempted before
If there is no news about algorithms in processing data, if data scientists always existed, but used different names, what HPC brings new to Big Data?
- Parallel processing
- Scale and size of the data
In BoscoR project R researchers stated "that distributed or parallel processing was the least common solution to their big data needs. This could be attributed to the difficulty of processing data with the R language on distributed resources, a challenge we set out to solve with BoscoR."
For this it is a need of a new breed of HPC engineer, the one who masters all the new tools for Big Data used today. We also need startup companies well financed to deliver services of Big Data processing and analytics, the same way Amazons offers cloud services today.
The government and academia can not lead this movement towards a new financially independent HPC. If I had the power, I would hand a copy of From Zero to One to every participant, visitor or passerby at SC'14, ISC'15 and others