what are No SQL databases?

 NoSQL databases (Not Only “Structured Query Language”)

SQL databases – Relational and OLAP
Non-SQL databases – Key value, Column-Family, Graph, Document

Non-SQL, or non-relational, databases are still in rapid growth stage versus the already matured SQL stores. The pro of non-SQL databases, thanks to the open-source innovations from Hadoop since 2005, is their free upfront investment and flexible structures, enabling infinite more data that do not fit into a neat table to be captured.

SQL vs NoSQL databases – relational vs non-relational databases

SQL databases
NoSQL databases
Upfront cost
free to very expensive
ongoing cost
lower operating costs from adminstration and analytics
Higher operating costs from adminstration and analytics
(processing speed)
Slow, mostly historical
MB to TB
TB to ZB
Structured data like RDBMS
Semi-structured to unstructured
Stable models
Unstable flat schemas

Definition of Big Data

Big data is not a single technology but a combination of old and new technologies that helps companies gain actionable insight. Therefore, big data is the capability to manage a huge volume of disparate data, at the right speed, and within the right time frame to allow real-time analysis and reaction. As we note earlier in this chapter, big data is typically broken down by three characteristics:

  • Volume: How much data
  • Velocity: How fast that data is processed
  • Variety: The various types of data
    Although it’s convenient to simplify big data into the three Vs, it can be misleading and overly simplistic. For example, you may be managing a relatively small amount of very disparate, complex data or you may be processing a huge volume of very simple data. That simple data may be all structured or all unstructured. Even more important is the fourth V: veracity. How accurate is that data in predicting business value? Do the results of a big data analysis actually make sense?
It is critical that you don’t underestimate the task at hand. Data must be able to be verified based on both accuracy and context. An innovative business may want to be able to analyze massive amounts of data in real time to quickly assess the value of that customer and the potential to provide additional offers to that customer. It is necessary to identify the right amount and types of data that can be analyzed to impact business outcomes. Big data incorporates all data, including structured data and unstructured data from e-mail, social media, text streams, and more. This kind of data management requires that companies leverage both their structured and unstructured data.

3 V’s of Big Data

Big data is defined as any kind of data source that has at least three shared characteristics:

  • Extremely large Volumes of data (How much data)
  • Extremely high Velocity of data (How fast that data is processed)
  • Extremely wide Variety of data (The various types of data)

Programming language R

R is an open source programming language and software environment for statistical computing and graphics. R is an implementation of the S programming language combined with lexical scoping semantics inspired by Scheme. R was created by Ross Ihaka and Robert Gentleman and is now developed by the R Development Core Team. The R environment is easily extended through a packaging system on CRAN.

Additional free resources include:
  • PDF HTML An Introduction to R, a basic introduction for beginners.
  • PDF HTML The R Language Definition, a more technical discussion of the R language itself.
  • PDF HTML Writing R Extensions, a development guide for R.
  • PDF HTML R Data Import/Export, a data import and export guide.
  • PDF HTML R Installation, an installation guide (from R source code).
  • PDF HTML R Internals, internal structures and coding guidelines.

IBM PureData System for Hadoop

IBM PureData System for Hadoop is part of the IBM PureSystems family of solutions (including IBM PureApplication System, IBM PureFlex System and PureData System) and is designed to help organizations embrace big data, cloud computingand mobile computing. According to IBM, the system has built-in data archive capabilities to analyze historical data and conduct real-time data analysis.

IBM PureData System for Hadoop is essentially an extension of IBM’s Hadoop-based platform (called InfoSphere BigInsights) that  allows companies of all sizes to cost-effectively manage and analyze data, and manage administrative, workflow, provisioning and security.

What is hive ?

Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Google and Hadoop

Hadoop was heavily influenced by Google’s architecture – notably, the google filesystem and MapReduce publications

Early adoption by Web companies such as Yahoo and Facebook