Thursday, January 02, 2014

Why Big Data is not for everyone

In a  recent article in Readme, we learn that Hadoop finally will bring data mainstream
The dirty little secret of Hadoop has been just how dull many of its tasks have been. By far the biggest use for Hadoop to date has been as a "poor person's ETL"—that is, a form of data integration, at the risk of oversimplifying—rather than all the big, sexy data science we see constantly hyped.
The Big Data "hyped" is not trivial

In a webinar published in skytree.net site, Machine Learning: How to Make it Work in Your Organization, Bradley Voytek, a UCSD Neuroscience Professor and Uber Data Evangelist is one of the speakers. He taught me a vivid lesson that common people are not able by themselves to make sense on big data.
  • It is foolish to believe that my data have a better understanding of the world than  I do
  • It is arrogant to believe that the person who best knows what to do with my data is me. 
Professor Bradley Voytech, Ph.D and family
  • The more advanced the statistical method used, the fewer critics are available to be properly skeptical
  • The more advanced the statistical method used, the more likely the data analysts will be to use math as a shield
  • Any sufficiently advanced statistics, can trick people into believing the results reflect truth
To illustrate, Bradley shows his calculations on how many people were born in British Empire  between September 4, 1752 and September 13, 1752. He extracted world's data births for that period, extrapolated and then applied a % proportional to the British Empire share of the then known world population

However it was impossible for any citizen of the British Empire to be born between September 4, 1752 and September 13, 1752.  From Wikipedia
Year 1752 (MDCCLII) was a leap year starting on Saturday of the Gregorian calendar, and a leap year starting on Wednesday of the 11-day slower Julian calendar. In the British Empire, it was the only year with 355 days, as September 3 through September 13 were skipped.
 Sometimes the great algorithms we have can fail, if we have no knowledge of the real world. It is important to know when our models work, but it is equally important to know when our models break.

By the way Hadoop does not do any predictive analytics. It just collects the data ready to be analyzed.

Skytree's CTO, Alexander Gray says there is not one ML (Machine Learning) algorithm universally valid. Is your analysis parametric or non-parametric? Frequentist or Bayesian ? If you rush to look up the definitions on these terms, you proved Bradley Voytek right.

Alexander Gray, Ph. D, Skytree CTO
How many people viewed this youtube webinar? I see only  194. Such a fascinating subject, with only 194 viewers, it shows we are talking of an a new elite,

They are the big data scientists elite. able to use R open source predictive analysis in clusters, or hiring Revolution Analytics (who actually use R open source to deliver more solid and easier to use predictive analytics).  Or maybe using Skytree Server  - The Machine Learning Server  which according to the web site
The Hadoop ready enterprise Machine Learning platform. Delivering high performance Advanced Analytics for critical business issues, such as: churn prediction, fraud detection, lead scoring, customer segmentation, recommendations and more.
This product embodies the popular perception that we can press buttons to discover money making opportunities from Machine Learning . Perhaps for a reduced set of capabilities, this could be a solution that eliminates the need to consult the data science elite for some more day to day tasks

Woody Allen says one of the secrets of success is just showing up. So in spite of the rumored difficulties . Skytree at least showed up with a product that works for mere mortals and not just elites
Post a Comment

Blog Archive

About Me

My photo

AI and ML for Conversational Economy