What is MLLib?

MLLib stands for Apache Spark’s Machine Learning Library. This library is designed for scalability, simplicity and can be integrated with other tools. Sparks has created a library that is fast, compatible with many languages and scalable. This helps data scientists put their efforts into modelling and solving data problems instead of dealing with other complexities that surround distributed data such as configuration and infrastructure.

This scalable machine-learning library consists of various learning utilities and algorithms that include clustering, classification, regression, collaborative filtering, clustering and dimensionality reduction. It also contains several optimization primitives that lie below them. MLLib integrates seamlessly with other components from Spark that include DataFrames, Spark SQL and Spark Streaming. This library can be installed Databricks runtime.

You can use the library in Python, Java and Scala, that work in Spark applications. In Python, it interoperates with NumPy (Spark 0.9) and R libraries (Spark 1.5). This also makes it possible to include the library in complete design workflows. You can use any source of Hadoop data such as local files, HBase and HDFS and plug it into Hadoop workflows. It also works in Apache Mesos, in the cloud, standalone or in Kubernetes.

MLLib also accepts munging, preprocessing, models’ training and scaled data predictions. If you wish to make structured streaming, you may use library-trained models to make predictions for the same. In addition, the provider offers a versatile machine learning API for related tasks such as clustering, regression and deep learning.

Other Features


The library comes with high-quality algorithms that are 100x faster than the famed MapReduce in the memory. It is 10x quicker than MapReduce on disk. They leverage iteration and gets you better results than the MapReduce one-pass approximations.

Workflow Utilities

Some of the workflow utilities include the following:

• Feature transformations that include hashing, standardization and normalization
• Hyper-parameter tuning and model evaluation
• ML persistence that includes loading and saving pipelines and models.
• Distributed linear algebra such as PCA and SVD
• Statistics tasks such as hypothesis testing and summary statistics


Some of the algorithms include:

• Association rules, frequent itemsets, and mining of the sequential pattern
• Clusterings such as Gaussian mixtures and K-means
• Topic modelling algorithm for LDA (latent Dirichlet allocation)
• Gradient-boosted trees, decision trees and random forests
• Recommendations using ALS (Alternating Least Squares)
• Regression algorithms for survival regression and generalized linear regression

This library is created and maintained as part of the Apache Spark program. With each Spark release, it is tested and updated.
Nach oben

Data Navigator Newsletter