Hadoop and components

Apache Hadoop, an open source software framework this is very well designed to support data intensive distributed analytics involving thousands of nodes and petabytes of data as its core comprised of Hadoop distributed file systems and Map Reduce components. Fast, reliable analysis of unstructured and complex data made many enterprises to deploy Hadoop with their IT legacy systems. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.

 

Over the years, various components have been added and it is evolving to handle various data challenges and needs businesses are facing today.

 

Some of the major components are as follows:

HDFS – Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. The HDFS architecture deoptimizes time to access and optimizes time to read. The time required to access the first record is sacrificed to accelerate the time required to read a complete file. HDFS is therefore appropriate where a small number of large files are to be stored and processed.

 

HBase – HBase is the Hadoop database. HBase provides the capability to perform random read/write access to data. HBase is architected as a distributed, versioned, column-oriented database that is designed to store very large tables — billions of rows with millions of columns using a cluster of commodity hardware. HBase is layered over HDFS and exploits its distributed storage architecture.

 

MapReduce – Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. It essentially consists of jobs which have two functions, map and reduce, and a framework for running a large number of instances of these programs on commodity hardware. The map function reads a set of records from an input file, processes these records, and outputs a set of intermediate records. These output records take the generic form of (Key, Data). As part of the map function, a split function distributes the intermediate records across many buckets using a hash function. The reduce function then processes the intermediate records.

 

HIVE – Hive is described as a data warehouse infrastructure built on top of Hadoop. Hive actually implements a query language (Hive QL), based on the SQL syntax, that can be used to access and transform data held within HDFS. The execution of a Hive QL statement generates a MapReduce job to transform the data as required by the Hive QL statement. Translating this to the RDBMS vernacular, Hive QL can be considered part view and part stored procedure. Two differentiators between Hive QL and SQL are that Hive QL jobs are optimized for scalability (all rows returned) not latency (first row returned) and Hive QL implements a subset of the SQL language.

Use cases for big data in telecom

The following use cases has been identified with respect to the analytics using these data sets –

· CDR Analytics

o Regulatory Compliance / CDR secondary data store

o Mediation replacement

o EDW augmentation

· Digital Content Analytics

o Content Analytics

o Network Analytics with 4G/LTE

o Targeted Advertisement

· Service Request and log analytics

o Product Engineering – Defects analytics, Usability Analytics, Warranty/Repair

o Service Assurance – Service Desk Operational efficiency

· Customer Application Mining

o Document Management and search/Mining

· Public Safety – Video Surveillance

On further analysis with respect to the technology feasibility, current solution scenarios, repeatability of the opportunity and TCS customer cases, narrowed down on the following use cases

1. CDR Regulatory Compliance

2. Mediation Replacement

3. Digital Content analytics

4. Service Request and Log Analytics

The other use cases need to be analyzed further to see if a broader solution base and customer base can be derived for further solution development.

Big data implementation for telecos

The various areas where large data, unstructured and semi-structured data is generated at a fast pace or need to be processed in a low latency space.

 

Below are the following data sets analyzed for telcos

 

Data Set
Volume
Variety
Velocity
Remarks
Order
Medium
Medium
Low
Customer application form is Semistructured
CDRs
High
Low
High
Billions of records per day
Payments
Medium
Low
Medium
 
Network Data
High
Medium
High
Mostly structured data for call usage
– very high volumes – semistructured
if web data – deep packet
inspection is included
Subscriber
Medium
Low
Low
 
Products
Low
Low
Low
Telcos moving towards simpler
products

 

It is clear from the data sets that the CDR, Network data, web usage data (including for web surfing, social interactions and content consumption – video, audio, shopping) are the key data sets that are relevant for Big Data elements.

Emergence of Big Data – What is new way?

Current Data Management in Telecom vs Big Data

Traditionally, telcos are using traditional RDBMS and appliances like Teradata, Netezza, Exadata to process their volumes CDR and Network data.

It is a fact that these solutions have come up to handle mass data processing requirements but they bring huge amount of obligations along with them.
1. The cost component which is very huge for any of these traditional appliance solutions – Both one time and AMC
2. The proprietary software’s and hardware’s which makes CSP to depend heavily on the vendor
3. Not all of these requirements can be met with one single implementation of these software
i.e for every major requirement CSP has to deploy similar or same solution which means additional cost

Big Data in Telecom


Traditionally telcos have been handling large volumes of call detail records (CDRs). And with the emergence of 3G and 4G and smarter devices (smart phone, tablets), the amount of CDRs continue to increase exponentially. Telecom service provides with an average of 50 million subscribers, it’s clear that how ‘Big’ the data has to be handled by Communication Service Provider (CSP) especially usage records, for their operational activities as well strategic decisions. In addition to this voluminous data other challenges are: _ Data gets generated from different network elements _ Data gets stored in multiple palaces in multiple application for the ease of managing and to meet specific business requirements Further it also carries valuable information about the customer and their behavior patterns which could be leveraged for improving the top line, bottom line and customer centricity

Big Data analytics classification

Big Data analytics can be differentiated from traditional data-processing architectures along with a number of dimensions:

Big Data analytics is most appropriate for addressing strategic decisions based on very large data volume. It can handle larger volumes of data as compared to Enterprise Data Warehouses and has a capability of handling non-structured data (semi, quasi and unstructured). Currently it is not suitable for Real time analytics. Complex processing of data can be enabled using Big Data technologies.

big-data

Big Data Key players

IBM offers InfoSphere BigInsights based on Hadoop in both a basic and enterprise edition
Google added AppEngine-MapReduce to support running Hadoop 0.20 programs on Google App Engine.
Cloudera offers CDH (Cloudera’s Distribution including Apache Hadoop) and Cloudera
Enterprise

In May 2011, MapR Technologies, Inc. announced the availability of their distributed filesystem and MapReduce engine, the MapR Distribution for Apache Ha

doop.
EMC released EMC Greenplum Community Edition and EMC Greenplum HD Enterprise Edition in May 2011. The community edition, with optional for-fee technical support, consists of Hadoop, HDFS, HBase, Hive, and the ZooKeeper configuration service.
In June 2011, Yahoo! and Benchmark Capital formed Hortonworks Inc., whose focus is on making Hadoop more robust and easier to install, manage and use for enterprise users.
In Oct 2011, Oracle announced the Big Data Appliance, which integrates Hadoop, Oracle
Enterprise Linux, the R programming language, and a NoSQL database with the Exadata hardware

data

3 V’s of Big data

Big data spans three dimensions: Variety, Velocity and Volume.

Variety – Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more.

Velocity – Often time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business.

Volume – Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.

Which data is big data?

Every day, we generate 2.5 quintillion bytes (2.5 Million Terabytes) of data and 90% of the data in the world today has been created within the past two years. This data come from everywhere: from sensors used to gather climate information, posts to social media sites, digital pictures and videos posted online, transaction records of online purchases, and from cell phone GPS signals to name a few. This data is Big Data
Big data is more than a challenge; it is an opportunity to find insight in new and emerging types of data, to make business more agile, and to answer questions that, in the past, were beyond reach.

 

Big Data is fast emerging as a technology trend for organizations to leverage large volume of variety of data (structured, unstructured and semi structured) generated at a high velocity, for analysis and deriving better insights for their business.