Apache Hadoop, an open source software framework this is very well designed to support data intensive distributed analytics involving thousands of nodes and petabytes of data as its core comprised of Hadoop distributed file systems and Map Reduce components. Fast, reliable analysis of unstructured and complex data made many enterprises to deploy Hadoop with their IT legacy systems. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.
Over the years, various components have been added and it is evolving to handle various data challenges and needs businesses are facing today.
Some of the major components are as follows:
HDFS – Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. The HDFS architecture deoptimizes time to access and optimizes time to read. The time required to access the first record is sacrificed to accelerate the time required to read a complete file. HDFS is therefore appropriate where a small number of large files are to be stored and processed.
HBase – HBase is the Hadoop database. HBase provides the capability to perform random read/write access to data. HBase is architected as a distributed, versioned, column-oriented database that is designed to store very large tables — billions of rows with millions of columns using a cluster of commodity hardware. HBase is layered over HDFS and exploits its distributed storage architecture.
MapReduce – Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. It essentially consists of jobs which have two functions, map and reduce, and a framework for running a large number of instances of these programs on commodity hardware. The map function reads a set of records from an input file, processes these records, and outputs a set of intermediate records. These output records take the generic form of (Key, Data). As part of the map function, a split function distributes the intermediate records across many buckets using a hash function. The reduce function then processes the intermediate records.
HIVE – Hive is described as a data warehouse infrastructure built on top of Hadoop. Hive actually implements a query language (Hive QL), based on the SQL syntax, that can be used to access and transform data held within HDFS. The execution of a Hive QL statement generates a MapReduce job to transform the data as required by the Hive QL statement. Translating this to the RDBMS vernacular, Hive QL can be considered part view and part stored procedure. Two differentiators between Hive QL and SQL are that Hive QL jobs are optimized for scalability (all rows returned) not latency (first row returned) and Hive QL implements a subset of the SQL language.