Hadoop Big Data Technology

Some basic concept about Hadoop Technology

Hadoop Overview:

Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets (Big Data) on computer clusters built from commodity hardware.
The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce).
Hadoop splits files into large blocks (default 64MB or 128MB) and distributes the blocks amongst the nodes in the cluster. To process the data, Hadoop Map/Reduce transfers code (specifically Jar files) to nodes that have the required data, which the nodes then process in parallel. This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing than by using a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking.


Hadoop architecture:

Hadoop is composed of four core components—Hadoop Common, Hadoop Distributed File System (HDFS), MapReduce and YARN.
Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications; and
Hadoop MapReduce – a programming model for large scale data processing
 
Hadoop Common:
A module containing the utilities that support the other Hadoop components
HDFS
A file system that provides reliable data storage and access across all the nodes in a Hadoop cluster. It links together the file systems on many local nodes to create a single file system.
Yet Another Resource Negotiator (YARN)
The next-generation MapReduce, which assigns CPU, memory and storage to applications running on a Hadoop cluster. It enables application frameworks other than MapReduce to run on Hadoop, opening up a wealth of possibilities.
MapReduce
A framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable, fault-tolerant manner.
The term “Hadoop” often refers not just to the base modules above but also to the collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark, and others.
Pig
A programming language designed to handle any type of data, helping users to focus more on analyzing large data sets and less on writing map programs and reduce programs.
Hive
A Hadoop runtime component that allows those fluent with SQL to write Hive Query Language (HQL) statements, which are similar to SQL statements. These are broken down into MapReduce jobs and executed across the cluster.
HBase
A column-oriented non-relational (noSQL) database that runs on top of HDFS and is often used for sparse data sets.
Spark
An open-source cluster computing framework with in-memory analytics performance that is up to 100 times faster than MapReduce, depending on the application.
Sqoop
An ELT tool to support the transfer of data between Hadoop and structured data sources.
Flume
A distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of log data. Its main goal is to deliver data from applications to the HDFS.
HCatalog
A table and storage management service for Hadoop data that presents a table abstraction so the user does not need to know where or how the data is stored.
Jaql
A query language designed for JavaScript Object Notation (JSON), which is primarily used to analyze large-scale semi-structured data. Core features include user extensibility and parallelism.
Avro
An Apache open source project that provides data serialization and data exchange services for Hadoop.
Solr
An enterprise search tool from the Apache Lucene project that offers powerful search capabilities, including hit highlighting, as well as indexing capabilities, reliability and scalability, a central configuration system, and failover and recovery.
Kerberos
A network authentication protocol that works on the basis of “tickets” to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner.
Zookeeper
A centralized infrastructure and set of services that enable synchronization across a cluster.
Oozie
A management application that simplifies workflow and coordination between MapReduce jobs.


Why use Hadoop?

Hadoop changes the economics and the dynamics of large-scale computing.
While large Web 2.0 companies such as Google and Facebook use Hadoop to store and manage their huge data sets, Hadoop has also proven valuable for many other more traditional enterprises based on its five big advantages.
Scalable
Hadoop is a highly scalable storage platform, because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational database systems (RDBMS) that can’t scale to process large amounts of data, Hadoop enables businesses to run applications on thousands of nodes involving thousands of terabytes of data.

Cost effective

Hadoop also offers a cost effective storage solution for businesses’ exploding data sets. The problem with traditional relational database management systems is that it is extremely cost prohibitive to scale to such a degree in order to process such massive volumes of data. In an effort to reduce costs, many companies in the past would have had to down-sample data and classify it based on certain assumptions as to which data was the most valuable. The raw data would be deleted, as it would be too cost-prohibitive to keep. While this approach may have worked in the short term, this meant that when business priorities changed, the complete raw data set was not available, as it was too expensive to store. Hadoop, on the other hand, is designed as a scale-out architecture that can affordably store all of a company’s data for later use. The cost savings are staggering: instead of costing thousands to tens of thousands of pounds per terabyte, Hadoop offers computing and storage capabilities for hundreds of pounds per terabyte.

Flexible

Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data. This means businesses can use Hadoop to derive valuable business insights from data sources such as social media, email conversations or clickstream data. In addition, Hadoop can be used for a wide variety of purposes, such as log processing, recommendation systems, data warehousing, market campaign analysis and fraud detection.

Fast

Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.

Resilient to failure

A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy available for use.
The MapR distribution goes beyond that by eliminating the NameNode and replacing it with a distributed No NameNode architecture that provides true high availability. Our architecture provides protection from both single and multiple failures.
When it comes to handling large data sets in a safe and cost-effective manner, Hadoop has the advantage over relational database management systems, and its value for any size business will continue to increase as unstructured data continues to grow.

Prominent Users:

Yahoo!
On February 19, 2008, Yahoo! Inc. launched what it claimed was the world’s largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a Linux cluster with more than 10,000 cores and produced data that was used in every Yahoo! web search query.
There are multiple Hadoop clusters at Yahoo! and no HDFS file systems or MapReduce jobs are split across multiple datacenters. Every Hadoop cluster node bootstraps the Linux image, including the Hadoop distribution. Work that the clusters perform is known to include the index calculations for the Yahoo! search engine.
On June 10, 2009, Yahoo! made the source code of the version of Hadoop it runs in production available to the public. Yahoo! contributes all the work it does on Hadoop to the open-source community. The company’s developers also fix bugs, provide stability improvements internally, and release this patched source code so that other users may benefit from their effort.
 Facebook:
In 2010 Facebook claimed that they had the largest Hadoop cluster in the world with 21 PB of storage. On June 13, 2012 they announced the data had grown to 100 PB. On November 8, 2012 they announced the data gathered in the warehouse grows by roughly half a PB per day.
Other users:
As of 2013, Hadoop adoption is widespread. For example, more than half of the Fortune 50 use Hadoop.


For References

http://www-01.ibm.com/software/data/infosphere/hadoop/ 
http://searchstorage.techtarget.com/tip/Hadoop-clusters-Benefits-and-challenges-for-big-data-analytics
http://www.itproportal.com/2013/12/20/big-data-5-major-advantages-of-hadoop/
http://en.wikipedia.org/wiki/Apache_Hadoop