Hadoop Distributed File System | HDFS

Hadoop Distributed File System | HDFS | Trenovision

Design And Concepts

  • Hadoop is based on work done by Google in the late 1990s/early 2000s
  • Specifically, on papers describing the Google File System (GFS) published in 2003, and MapReduce published in 2004
  • Distribute the data and store in the distributed file system called as HDFS (Hadoop Distributed File System)
  • Data is spread among machines in advance
  • No data transfer over the network is required for initial processing
  • Computation happens where the data is stored, wherever possible
  • Individual nodes can work on data local to those nodes
  • Data is replicated multiple times on the system for increased availability and reliability
  • Developers need not worry about network programming

Distributed Framework for processing and storing data generally on commodity hardware.

  • Completely Open Source
  • Top level Apache project
  • Available for download from Apache Hadoop website http://hadoop.apache.org
  • Runs on Linux, Mac OS/X, Windows, and Solaris.
  • Client apps can be written in various languages.



Features:

  • Scalable: store and process petabytes, scale by adding Hardware
  • Economical: 1000’s of commodity machines
  • Efficient: run tasks where data is located
  • Reliable: data is replicated, failed tasks are rerun

Primarily used for batch data processing, not real-time / user facing applications.

  • HDFSisresponsibleforstoringdataonthecluster
  • Dataissplitintoblocksanddistributedacrossmultiplenodesinthecluster
  • Eachblockistypically64MBor128MBinsize
  • Eachblockisreplicatedmultipletimes
    • Defaultistoreplicateeachblockthreetimes
    • Replicasarestoredondifferentnodes
    • Thisensuresbothreliabilityandavailability
  • FilesinHDFSare‘writeonce’*
    • Norandomwritestofilesareallowed
  • Optimizedforlarge,streamingreadsoffiles
    • Ratherthanrandomreads
  • Performsbestwitha‘modest’numberoflargefiles
    • Millions,ratherthanbillions,offiles
    • Eachfiletypically100MBormore
  • “Classic” HDFS Architecture
    • NameNode(master)
    • DataNode(Slave)
    • Secondary NameNode(master)



Name Node–NN (master)

  • Stores all metadata
  • Information about file ownership and permissions
  • Information about file locations in HDFS
  • Names of the individual blocks
  • Locations of the blocks
  • File attributes, e.g. creation time, replication factor
  • Holds metadata in RAM for fast access
  • Changes to the metadata are made in RAM and are also written to a log file on disk called edits
  • Metadata is stored on disk and read when the NN starts up
  • Collect block reports from Data Node son block locations
  • Replicate missing blocks

DataNode–DN (slave)

  • Actual contents of the files are stored as blocks on the slave nodes
  • Block creation, deletion and replication upon instruction from the NN
  • Blocks are simply files on the slave nodes’ underlying file system
  • Nothing on the slave node provides information about what underlying file the block is a part of
  • Each block is stored on multiple different nodes for redundancy
  • Communicates with the NN and periodically send block reports
  • Clients access the blocks directly from data nodes for read and write

Secondary NameNode–2NN (master)

  • Secondary NameNode(2NN) is not a failover NameNode
  • Performs memory-intensive administrative functions for the NameNode
  • Periodically combines a prior file system snapshot and editloginto a new snapshot
  • New snapshot is transmitted back to the NameNode



Basic File System Operations

File system Operations includes moving, deleting, reading, copying and listing directories.
Applications can read and write HDFS files directly via java API.
Access to HDFS from commandlline is achieved with hadoopfs command

  • hadoopfs –ls(listing)
  • hadoopfs –mkdir(make directory)
  • hadoopfs –rmdir(remove directory)
  • hadoopfs –cat (view contents)

The Java Interface

File system Operations includes moving, deleting, reading, copying and listing directories.

  • Applications can read and write HDFS files directly via java API
    public static FileSystemget(Configuration conf) throws IOException
    public static FileSystemget(URI uri, Configuration conf) throws IOException
  • Method to create a directory
    • public booleanmkdirs(Path f) throws IOException
  • This is the method on file system permanently remove files or directories
    • public booleandelete(Path f, booleanrecursive) throws IOException