What is MLLib?

MLLib stands for Apache Sparks Machine Learning Library. This library is designed for scalability and simplicity and can be combined with other tools. Sparks has created a library that is fast, compatible with many languages and scalable. This helps data scientists model and solve data problems instead of dealing with other complexities of distributed data, such as configuration and infrastructure. 

This scalable machine learning library consists of various learning programmes and Algorithmswhich include clustering, classification, regression, collaborative filtering and dimension reduction. It also includes several underlying optimisation primitives. MLLib integrates seamlessly with other Spark components, including DataFrames, Spark SQL and Spark Streaming. This library can be installed with Databrick's runtime. 

You can visit the library in Python, Java and Scala, all of which work in Spark applications. In Python, it interacts with NumPy (Spark 0.9) and R libraries (Spark 1.5). This also makes it possible to include the library in complete design workflows. You can use any source of Hadoop data such as local files, HBase and HDFS and integrate them into Hadoop workflows. This also works in Apache Mesos, in the cloud, independently or in Kubernetes. 

MLLib also accepts manipulation, preprocessing, training of models and scaled data predictions. If you want to stream in a structured way, you can use library-trained models to make predictions. In addition, the provider offers a versatile machine learning API for similar tasks such as clustering, regression and deep learning. 

Other functions of MLLib


The library has high-quality algorithms that are 100 times faster than the famous MapReduce in memory. They are 10 times faster than MapReduce on disk. They use iteration and achieve better results for you than MapReduce one-pass approximations. 

Workflow utilities

Some of the workflow utilities include the following: 

  • Function conversions such as hashing, standardisation and normalisation 
  • Hyperparameter tuning and model evaluation 
  • ML persistence, which includes loading and storing pipelines and models 
  • Distributed linear algebra such as PCA and SVD 
  • Statistical tasks such as hypothesis testing and summary statistics 


Some of the algorithms include: 

  • Assignment rules, frequent items and mining of the sequential pattern 
  • Clusterings such as Gaussian mixed distributions and K-means (values) 
  • Topic Modelling Algorithm for LDA (Latent Dirichlet Assignment) 
  • Gradient-supported trees, decision trees and random forests 
  • Recommendations using ALS (Alternating Least Squares) 
  • Regression algorithms for survival regression and general linear regression 

This library is created and maintained as part of the Apache Spark programme. It is tested and updated with each Spark version. 

Data Navigator Newsletter