Hadoop using Cloudera CDH

Cloudera’s Distribution including Apache Hadoop (CDH)
• A single, easy-to-install package from the Apache Hadoop core repository
• Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version
• 100% open source

• Apache Hadoop
• Apache Hive
• Apache Pig
• Apache HBase
• Apache Zookeeper
• Apache Flume, Apache Hue, Apache Oozie, Apache Sqoop, Apache Mahout

HP Vertica Hadoop Distributed File System (HDFS) Connector

HP Vertica was the first analytic database company to deliver a Hadoop Connector. HP Vertica now offers two connectors to transfer data seamlessly between Hadoop and HP Vertica:
  1. The Hadoop Distributed File System (HDFS) connector enables you to load data from HDFS using the HP Vertica native COPY facility. This mechanism simplifies and accelerates the process of loading data stored in HDFS without any MapReduce coding. The connector also ensures that data is loaded from the Hadoop cluster with the optimal amount of parallelism. By using the connector with the HP Vertica External Tables feature, you can even query data in HDFS without copying data into HP Vertica.
  2. The Hadoop & Pig Connector is bidirectional and enables you to move data from Hadoop to HP Vertica or vice versa via either MapReduce or Pig jobs.
With HP Vertica HDFS and Pig Connectors, you have unprecedented flexibility and speed in loading data from HDFS to the HP Vertica Analytics Platform and querying data from the HP Vertica Analytics Platform in Hadoop. The HP Vertica HDFS and Pig Connectors are open source, supported by HP Vertica, and available for download.
HP Vertica provides optimized JDBC and ODBC client drivers for most platforms including Windows, Linux, Solaris, AIX, and others.

Companies that use Hadoop (Big Data)

Largest company that uses Hadoop is probably Yahoo or Facebook
On February 19, 2008, Yahoo! Inc. launched what it claimed was the world’s largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is used in every Yahoo! Web search query.

In June 2012 Facebook claimed that they had the largest Hadoop cluster in the world with 100 PB of storage.

American Airlines
Electronic Arts
the New York Times
Trend Micro
aggregate knowledge
skybox imaging
Gravity Interactive
CBS interactive
pulse point
Huron Consulting group
Rap Leaf
Apollo group inc

What is Apache Hadoop?

Hadoop is open source software framework for distributed processing of large datasets.

Scalable data storage and processing

  • Open source Apache project
  • Harnesses the power of commodity servers
  • Distributed and fault-tolerant

“Core” Hadoop consists of two main parts

  • HDFS (storage)
  • MapReduce (processing)

Big data implementation lifecycle

Big Data will have two lifecycle stages – Implementation and Management. Big Data implementation stage, will have following activities,

  • Cluster sizing & scalability roadmap
  • Hardware architecture for cluster elements
  • Network architecture for Big Data
  • Storage architecture for Big Data
  • Information Security architecture for Big Data
  • Cluster Installation

The Big Data Managament stage, will include, folowing activities,

  • Cluster configuration
  • Cluster Administration
  • Commissioning & de-commissioning of cluster nodes
  • Version upgrade
  • Data backup/restore
  • Data Archival
  • Cluster performance tuning
  • Securiy log collection and analysis

Big data software packages

Product Vendor
Apache Hadoop suite Open Source (Apache Foundation)
HDFS ( http://hadoop.apache.org )
Pig Latin  
Cloudera Hadoop Distribution Cloudera ( http:// www.cloudera.com )
HortonWorks Hadoop Distribution HortonWorks (http:// www.hortonworks.com )
Greenplum EMC ( http://www.greenplum.com )
Netezza IBM ( http://www-01.ibm.com/software/in/data/netezza/ )
Infosphere IBM ( http:// www.ibm.com/software/data/infosphere )
Oracle Oracle Big Data Appliance ( http:// http://www.oracle.com/us/products/database/big-data-appliance/overview/index.html )
MongoDB Open Source ( http://www.mongodb.org )
CouchDB Open Source ( http://couchdb.apache.org )

Best Alternative to Hadoop

There are many alternatives to Hadoop, but the others are far behind. Hadoop is the undisputed leader in Big Data.

The most promising alternative to Hadoop is Spark

Spark(http://www.spark-project.org/) is one more open source system developed at the UC Berkeley AMP Lab. Users include UCB, Conviva, Klout, and Quantifind, among others.

The claim is, it runs 100x times faster than Hadoop in scenarios like iterative algorithms and interactive data mining . Spark is also used for data processing.  Spark is also the engine behind Shark, a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive. A comparison of the performance of logistic regression using Hadoop MapReduce and Spark is shown in the below figure (advertised).

Spark benefits from in memory compared to hadoop’s disk based. It can cache datasets in memory to speed up reuse.

There might not be one solution fit all kind of framework and therefore, its wise to evaluate other related distributed frameworks like Spark which could help in  achieving solution to specialized kind of scenarios/problem. Compatibility with hadoop is a plus.

[1] http://spark-project.org/

What is Big Data?

Gartner defines Big Data as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

We use Big Data especially for unstructured data. An example of unstructured data is collection of customers social media posts for analyzing on customer behavior. 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data.

Organizations are discovering that important predictions can be made by sorting through and analyzing Big Data.  However, since 80% of this data is “unstructured”, it must be formatted (or structured) in a way that that makes it suitable for data mining and subsequent analysis. Hadoop is the most popular Big Data package. Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes.