Apache Hadoop is a framework for running computer clusters. It enables the storage of distributed data (a distributed file system) and parallel data processing across the nodes of a cluster. You can think of Hadoop as an operating system for computer clusters. Like a normal operating system, it consists of several components. The most important are:
- Hadoop Distributed File System (HDFS), enables the storage of distributed data
- YARN, the platform with which the computer resources are managed and distributed to the nodes
- MapReduce, the programming software that enables parallel processing of data within the computer cluster
Hadoop is an open source software managed and licensed by the Apache Software Foundation. Nevertheless, several proprietary versions of it are sold and rented by companies, and these are also supported. Examples of proprietary providers of computer cluster frameworks are MapR and Hortonworks.
When Hadoop came on the market in 2006, it changed the world of high volume data for the better. Prior to that, there were already proprietary solutions from a few world-class technology companies, which Computer cluster operated to meet their high volume data storage and processing needs, but these were out of reach for the rest of the world. Then came Hadoop, an open-source framework that made it economically viable for a wide range of companies to run computer clusters.
Hadoop is used when you are dealing with "big data", i.e. when the requirements for storing and processing data go beyond the traditional capacities of individual computers and require you to run a computer cluster, and when you want to use open source rather than pay for the software.