You should go through Hadoop then first of all we need to understand Big Data & how Hadoop came into picture. Then you should understand how Hadoop architecture works in respect of HDFS (Hadoop Distributed File System), YARN (Yet Another Resource Negotiator) & Map Reduce.
Let us understand in brief:
What is Big Data?
Big Data is a term used for a collection of data sets that are large and complex, which is difficult to store and process using available database management tools or traditional data processing applications. The challenge includes capturing, storing, searching, sharing, transferring, analyzing and visualization of this data.
It is characterized by 5 V’s.
Volume: Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace.
Velocity: Velocity is defined as the pace at which different sources generate the data every day. This flow of data is massive and continuous.
Varity: As there are many sources which are contributing to Big Data, the type of data they are generating is different. It can be structured, semi-structured or unstructured.
Value: It is all well and good to have access to big data but unless we can turn it into value it is useless. Find insights in the data and make benefit out of it.
Veracity: Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness.
What is Hadoop & its architecture?
The main components of HDFS are Name Node and Data Node.
It is the master daemon that maintains and manages the Data Nodes (slave nodes). It records the metadata of all the files stored in the cluster, e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. It records each and every change that takes place to the file system metadata.
For example, if a file is deleted in HDFS, the Name Node will immediately record this in the Edit Log. It regularly receives a Heartbeat and a block report from all the Data Nodes in the cluster to ensure that the Data Nodes are live. It keeps a record of all the blocks in HDFS and in which nodes these blocks are stored.
These are slave daemons which run on each slave machine. The actual data is stored on Data Nodes. They are responsible for serving read and write requests from the clients. They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the Name Node.
For processing, we use YARN (Yet Another Resource Negotiator). The components of YARN are Resource Manager and Node Manager.
It is a cluster level (one for each cluster) component and runs on the master machine. It manages resources and schedule applications running on top of YARN.
It is a node level component (one on each node) and runs on each slave machine. It is responsible for managing containers and monitoring resource utilization in each container. It also keeps track of node health and log management. It continuously communicates with Resource Manager to remain up-to-date.
So, you can perform parallel processing on HDFS using Map Reduce.
It is the core component of processing in a Hadoop Ecosystem as it provides the logic of processing. In other words, Map Reduce is a software framework which helps in writing applications that processes large data sets using distributed and parallel algorithms inside Hadoop environment. In a Map Reduce program, Map () and Reduce () are two functions. The Map function performs actions like filtering, grouping and sorting. The result generated by the Map function is a key value pair (K, V) which acts as the input for Reduce function.