LinkedIn’s graph networks that capture complex first, second, and third degree connectivity is best stored in databases like Espresso DB or Neo Technology.
what are No SQL databases?
NoSQL databases (Not Only “Structured Query Language”)
SQL databases – Relational and OLAP
Non-SQL databases – Key value, Column-Family, Graph, Document
Non-SQL, or non-relational, databases are still in rapid growth stage versus the already matured SQL stores. The pro of non-SQL databases, thanks to the open-source innovations from Hadoop since 2005, is their free upfront investment and flexible structures, enabling infinite more data that do not fit into a neat table to be captured.
SQL vs NoSQL databases – relational vs non-relational databases
SQL databases
|
NoSQL databases
|
|
Databases
|
Oracle
Microsoft MySQL |
Hadoop
MongoDB Cassandra Redis |
Upfront cost
|
free to very expensive
|
free
|
ongoing cost
|
lower operating costs from adminstration and analytics
|
Higher operating costs from adminstration and analytics
|
throughput
|
High
|
Low
|
Latency
(processing speed) |
Real-time
|
Slow, mostly historical
|
portability
|
MB to TB
|
TB to ZB
|
storage
|
Centralized
|
Distributed
|
structure
|
Structured data like RDBMS
|
Semi-structured to unstructured
|
consistency
|
Stable models
|
Unstable flat schemas
|
3 V’s of Big Data
Big data is defined as any kind of data source that has at least three shared characteristics:
-
Extremely large Volumes of data (How much data)
-
Extremely high Velocity of data (How fast that data is processed)
-
Extremely wide Variety of data (The various types of data)
Definition of Big Data
Big data is not a single technology but a combination of old and new technologies that helps companies gain actionable insight. Therefore, big data is the capability to manage a huge volume of disparate data, at the right speed, and within the right time frame to allow real-time analysis and reaction. As we note earlier in this chapter, big data is typically broken down by three characteristics:
-
Volume: How much data
-
Velocity: How fast that data is processed
-
Variety: The various types of dataAlthough it’s convenient to simplify big data into the three Vs, it can be misleading and overly simplistic. For example, you may be managing a relatively small amount of very disparate, complex data or you may be processing a huge volume of very simple data. That simple data may be all structured or all unstructured. Even more important is the fourth V: veracity. How accurate is that data in predicting business value? Do the results of a big data analysis actually make sense?
Programming language R
R is an open source programming language and software environment for statistical computing and graphics. R is an implementation of the S programming language combined with lexical scoping semantics inspired by Scheme. R was created by Ross Ihaka and Robert Gentleman and is now developed by the R Development Core Team. The R environment is easily extended through a packaging system on CRAN.
- PDF HTML An Introduction to R, a basic introduction for beginners.
- PDF HTML The R Language Definition, a more technical discussion of the R language itself.
- PDF HTML Writing R Extensions, a development guide for R.
- PDF HTML R Data Import/Export, a data import and export guide.
- PDF HTML R Installation, an installation guide (from R source code).
- PDF HTML R Internals, internal structures and coding guidelines.
IBM PureData System for Hadoop
IBM PureData System for Hadoop is part of the IBM PureSystems family of solutions (including IBM PureApplication System, IBM PureFlex System and PureData System) and is designed to help organizations embrace big data, cloud computingand mobile computing. According to IBM, the system has built-in data archive capabilities to analyze historical data and conduct real-time data analysis.
IBM PureData System for Hadoop is essentially an extension of IBM’s Hadoop-based platform (called InfoSphere BigInsights) that allows companies of all sizes to cost-effectively manage and analyze data, and manage administrative, workflow, provisioning and security.
Hadoop distributions available in the market
Oracle and HP are partnered with Cloudera
Microsoft and Teradata are partnered with Hortonworks
IBM came up with its own Hadoop distribution(IBM InfoSphere BigInsights)
Intel’s Hadoop distribution
SAP is partnering with Intel and Hortonworks
What is hive ?
Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Google and Hadoop
Hadoop was heavily influenced by Google’s architecture – notably, the google filesystem and MapReduce publications
Early adoption by Web companies such as Yahoo and Facebook