What is Pandas?

Pandas is a widespread Open source library for data analysis and manipulation in the programming language Python. Due to the fast, flexible and meaningful data structures and efficient data analysis tools, it is often used in the areas of Data Science, machine learning and Deep Learning used.

It provides a fast and expressive way to manipulate and analyse structured data and is easy to learn for anyone familiar with Python programming. The integration with other libraries such as NumPy, Matplotlib and Seaborn also makes it a complete solution for the Data analysis and visualisation in Python.

The name "Pandas" is derived from the term "Panel Data", which refers to multi-dimensional data structures often used in econometrics. The library provides two primary data structures: Series (1-dimensional) and DataFrame (2-dimensional), both of which allow the display and manipulation of labelled data.

Examples of the use of Pandas (Python)

Data cleansing

Pandas offers various functions for cleaning and pre-processing data, such as handling missing values, converting data types, removing duplicates and handling outliers.

import pandas as pd
# Load a sample dataset with missing values
df = pd.read_csv("data.csv")
# Handle missing values by replacing them with the mean of the column
df.fillna(df.mean(), inplace=True)
# Convert column data type from string to integer
df['column_name'] = df['column_name'].astype(int)
# Remove duplicates from the DataFrame
df.drop_duplicates(inplace=True)
# Handle outliers by replacing them with the 95th percentile of the column
upper_bound = df['column_name'].quantile(0.95)
df['column_name'] = df['column_name'].apply(lambda x: upper_bound if x > upper_bound else x)

Data aggregation

It is possible to perform various operations to aggregate and summarise data, such as groupby, pivot tables and resampling. These operations can be helpful in transforming raw data into useful insights.

import pandas as pd
# Load a sample dataset
df = pd.read_csv("data.csv")
# Group the data by a categorical column and calculate the mean of a numeric column
grouped = df.groupby('column_name')['numeric_column'].mean()
# Create a pivot table to aggregate the data
pivot = df.pivot_table(index='column_name', values='numeric_column', aggfunc='mean')
# Resample the data to aggregate it by week
resampled = df.resample('W', on='date_column').sum()

Data visualisation

Pandas integrates well with popular data visualisation libraries such as Matplotlib, Seaborn and Plotly. This makes it easy to create bar charts, histograms, scatter plots and more to visualise and communicate insights from your data.

import pandas as pd
import matplotlib.pyplot as plt
# Load a sample dataset
df = pd.read_csv("data.csv")
# Create a bar plot to visualize the distribution of a categorical column
df['column_name'].value_counts().plot(kind='bar')
plt.show()
# Create a histogram to visualize the distribution of a numeric column
df['numeric_column'].plot(kind='hist')
plt.show()
# Create a scatter plot to visualize the relationship between two numeric columns
df.plot(x='numeric_column_1', y='numeric_column_2', kind='scatter')
plt.show()

Pandas vs. NumPy

Pandas differs in some respects from NumPy, another popular library for numerical calculations in Python. While NumPy provides basic numerical operations, Pandas offers more advanced data analysis and manipulation capabilities. NumPy works mainly with arrays, while Pandas works with rows and data frames that are labelled and allow mixed data types. Also, unlike NumPy, Pandas offers built-in handling of missing values.

Pandas vs. SQL

A significant difference between Pandas and SQL is that Pandas is a library for in-memory data processing, while SQL is a language for accessing and manipulating data stored in databases. SQL is better suited for working with large, persistently stored data sets, while Pandas is more flexible for fast and efficient data manipulation, exploration and analysis.