Spark

Spark is an open source cluster computing framework for processing large amounts of data on computer clusters. Spark typically sits on top of a distributed file system such as HDFS. While its core functionality is Big Data processing on computer clusters, Spark increasingly offers other functionalities, such as the MLlib package, which allows machine learning models to be implemented in Spark. In fact, Spark has become so important to Big Data processing that it is developing its own ecosystem of related tools. For example, Spark Streaming extends the functionalities to be able to process streaming data. 

Spark has become the most important Big Data tool. It enables the processing and increasingly the analysis of Big Data on computer clusters, including the implementation of machine learning algorithms. A major advantage of Spark is that it offers different programming interfaces that allow it to be used with different programming languages, e.g. Python and R. To put this in less technical terms, there is no need to learn another language, you can simply type Spark commands in, say, Python. Nevertheless, using Spark requires a deep knowledge of how distributed systems work. 

You can use Spark in almost any Big Data use case for data preparation, processing and, increasingly, machine learning. 

Data Navigator Newsletter