Hadoop 101 paper by Miha Ahronovitz and Kuldip Pabla
Originally written for Cloud Tutorial
What is Hadoop?
Kuldip Pabla, Ahrono & Associates
Hadoop is a fault-tolerant distributed system for data storage which is highly scalable. The scalability is the result of a Self-Healing High Bandwith Clustered Storage , known by the acronym of HDFS (Hadoop Distributed File System) and a specific fault-tolerant Distributed Processing, known as MapReduce.
Why have Hadoop as integral part of the enterprise IT?It processes and analyzes variety of new and older data to extract meaningful business operations wisdom. Traditionally data moves to the computation node. In Hadoop, data is processed where the data resides . The type of questions one Hadoop helps answer are:
What types of data we handle today?Human-generated data that fits well into relational tables or arrays. Examples are conventional transactions – purchase/sale, inventory/manufacturing, employment status change, etc. This is the core data managed by OLTP relational DBMS everywhere. In the last decade, humans generated other kinds of data as well, like text, documents (text or otherwise), pictures, videos, slideware. Traditional relational databases are a poor home for this kind of data because:
Example of Hadoop usageNetflix (NASDAQ: NFLX) is a service offering online flat rate DVD and Blu-ray disc rental-by-mail and video streaming in the United States. It has over 100,000 titles and 10 million subscribers. The company has 55 million discs and, on average, ships 1.9 million DVDs to customers each day. Netflix offers Internet video streaming, enabling the viewing of films directly on a PC or TV at home. Netflix’s movie recommendation algorithm uses Hive (underneath using Hadoop, HDFS, MapReduce) for query processing and Business Intelligence. Netflix collects all logs from website which are streaming logs collected using Hunu.
They parse 0.6TB of data running on Amazon S3 50 nodes. All data are processed for Business Intelligence using a software called MicroStrategy.
Hadoop challengesTraditionally, Hadoop was opened for developers. But the wide adoption and success of Hadoop depends on business users, not developers. Commercial distributions will have to mak
Servers running Hadoop at Yahoo.com
To best illustrate, here it is a quote from Yahoo Hadoop development team:
Hadoop Integration with resource management cloud softwareOne such example is Oracle Grid Engine 6.2 Update 5. Cycle Computing also announced an integration with Hadoop. It reduces the cost of running Apache Hadoop applications by enabling them to share resources with other data center applications, rather than having to maintain a dedicated cluster for running Hadoop applications. Here is a relevant customer quote “The Grid Engine software has dramatically lowered for us the cost of data intensive, Hadoop centered, computing. With its native understanding of HDFS data locality and direct support for Hadoop job submission, Grid Engine allows us to run Hadoop jobs within exactly the same scheduling and submission environment we use for traditional scalar and parallel loads. Before we were forced to either dedicate specialized clusters or to make use of convoluted, adhoc, integration schemes; solutions that were both expensive to maintain and inefficient to run. Now we have the best of both worlds: high flexibility within a single, consistent and robust, scheduling system"”
Getting Started with HadoopHadoop is an open source implementation of the MapReduce algorithms and distributed file system. Hadoop is primarily developed in Java. Writing a Java application, obviously, will give you much more control and presumably improved performance. However, it can be used with other environments including scripting languages using “streaming”. Streaming applications simply reads data from stdin and write their output to stdout.
Installing HadoopTo install Hadoop, you will need to download Hadoop Common (also referred as Hadoop Core) from http://hadoop.apache.org/common/. The binaries are available from Open Source under an Apache License. Once you have downloaded the Hadoop Common, follow the installation and configuration instructions.
Hadoop With Virtual MachineIf you have no experience playing with Hadoop, there is an easier way to install and experiment with Hadoop. Rather than installing a local copy of Hadoop, install a virtual machine from Yahoo! Virtual machine comes with Hadoop pre-installed and pre-configured and is almost ready to use. The virtual machine is available from their Hadoop tutorial. This tutorial includes well documented instructions for running the virtual machine and running Hadoop applications. The virtual machine, in addition to Hadoop, includes Eclipse IDE for writing Java based Hadoop applications.
Hadoop ClusterBy default, Hadoop distributions are configured to run on single machine and the Yahoo virtual machine is a good way to get going. However, the power of Hadoop comes from its inherent distributed nature and deploying distributed computing on a single machine misses its very point. For any serious processing with Hadoop, you’ll need many more machines. Amazon’s Elastic Compute Cloud (EC2) is perfect for this. An alternative option to running Hadoop on EC2 is to use the Cloudera distribution. And of course, you can set up your own cluster of Hadoop by following the Apache instructions. Resources
There is a large active developer community who created many scripted languages such as HBase, Hive, Pig and others). Cloudera, has a supported distribution.e it even easier for business analysts to use Hadoop.