Pandas (Python)

What is Pandas?

Pandas is a widespread Open source library for data analysis and manipulation in the programming language Python. Due to the fast, flexible and meaningful data structures and efficient data analysis tools, it is often used in the areas of Data Science, machine learning and Deep Learning used.

It provides a fast and expressive way to manipulate and analyse structured data and is easy to learn for anyone familiar with Python programming. The integration with other libraries such as NumPy, Matplotlib and Seaborn also makes it a complete solution for the Data analysis and visualisation in Python.

The name "Pandas" is derived from the term "Panel Data", which refers to multi-dimensional data structures often used in econometrics. The library provides two primary data structures: Series (1-dimensional) and DataFrame (2-dimensional), both of which allow the display and manipulation of labelled data.

Examples of the use of Pandas (Python)

Data cleansing

Pandas offers various functions for cleaning and pre-processing data, such as handling missing values, converting data types, removing duplicates and handling outliers.

import pandas as pd
# Load a sample dataset with missing values
df = pd.read_csv("data.csv")
# Handle missing values by replacing them with the mean of the column
df.fillna(df.mean(), inplace=True)
# Convert column data type from string to integer
df['column_name'] = df['column_name'].astype(int)
# Remove duplicates from the DataFrame
df.drop_duplicates(inplace=True)
# Handle outliers by replacing them with the 95th percentile of the column
upper_bound = df['column_name'].quantile(0.95)
df['column_name'] = df['column_name'].apply(lambda x: upper_bound if x > upper_bound else x)

Data aggregation

It is possible to perform various operations to aggregate and summarise data, such as groupby, pivot tables and resampling. These operations can be helpful in transforming raw data into useful insights.

import pandas as pd
# Load a sample dataset
df = pd.read_csv("data.csv")
# Group the data by a categorical column and calculate the mean of a numeric column
grouped = df.groupby('column_name')['numeric_column'].mean()
# Create a pivot table to aggregate the data
pivot = df.pivot_table(index='column_name', values='numeric_column', aggfunc='mean')
# Resample the data to aggregate it by week
resampled = df.resample('W', on='date_column').sum()

Data visualisation

Pandas integrates well with popular data visualisation libraries such as Matplotlib, Seaborn and Plotly. This makes it easy to create bar charts, histograms, scatter plots and more to visualise and communicate insights from your data.

import pandas as pd
import matplotlib.pyplot as plt
# Load a sample dataset
df = pd.read_csv("data.csv")
# Create a bar plot to visualize the distribution of a categorical column
df['column_name'].value_counts().plot(kind='bar')
plt.show()
# Create a histogram to visualize the distribution of a numeric column
df['numeric_column'].plot(kind='hist')
plt.show()
# Create a scatter plot to visualize the relationship between two numeric columns
df.plot(x='numeric_column_1', y='numeric_column_2', kind='scatter')
plt.show()

Pandas vs. NumPy

Pandas differs in some respects from NumPy, another popular library for numerical calculations in Python. While NumPy provides basic numerical operations, Pandas offers more advanced data analysis and manipulation capabilities. NumPy works mainly with arrays, while Pandas works with rows and data frames that are labelled and allow mixed data types. Also, unlike NumPy, Pandas offers built-in handling of missing values.

Pandas vs. SQL

A significant difference between Pandas and SQL is that Pandas is a library for in-memory data processing, while SQL is a language for accessing and manipulating data stored in databases. SQL is better suited for working with large, persistently stored data sets, while Pandas is more flexible for fast and efficient data manipulation, exploration and analysis.

Application Programming Interface (API)

What is an application programming interface (API)?

API stands for Application Programming Interface and refers to a programming interface that enables the Communication between different applications enables. External programmes can gain access to certain components of a software via an API and transfer data.

Unlike with a binary interface, the programme connection takes place at source code level takes place. Operations are carried out via standard commands so that compatibility with different programming languages is guaranteed. Among other things, an API can be based on Databaseshard disks, graphics cards and user interfaces.

The advantage of a programming interface is the Easy integration of new application components into an existing system. In addition, APIs are usually documented in detail with their associated parameters.

How does an API work?

Programming interfaces (API) are used in particular by developers to allow their programmes to dock with another. A programming interface specifies how data can be received and sent. The commands and data types that an API accepts are defined in protocols. They are used by the corresponding components for uniform communication.

A basic distinction is made between Internal/private APIs and external/open APIs. Private programming interfaces can only be used by programmers within an organisation. This optimises work on internal company processes. In addition, they are protected from unauthorised access by certain security measures. External APIs are available to the public in directories for integration into other systems. However, sometimes the use of an API is restricted or subject to a fee.

Application areas for programming interfaces

APIs can be used to connect a wide range of processes:

Weather forecast

Global weather data from a wide range of international sources are retrieved via programming interfaces and can be displayed to the user via app on the smartphone.

Appointment booking

Service providers can use APIs to enable their customers to make bookings on online portals or search for specific services. These can be, for example, appointment information at doctors' surgeries or the comparison of flight prices. The website connects to the programming interfaces of the respective service providers and generates an overview with the most suitable options.

E-commerce

Retailers use APIs to control the inventory of their products and provide customers with information about availability.

What is the difference between API and REST API?

REST is an abbreviation for Representational State Transfer and refers to a software architecture, which is guided by the principles and behaviour of the World Wide Web. A REST API is a specific form of an APIused for data transfer on distributed systems. Compared to a general API, the REST architecture has the following features six design principleswhich must be adhered to by developers:

Uniform programming interface

The resources are accessible via a specific Uniform Resource Identifier (URI). Different operations can be performed using HTTP methods via the same URI. Suitable formats for resources are, for example, JSON, XML or text.

Independence of client and server

Client and server applications must be decoupled from each other. The client should need nothing more than the URI of the respective resource.

Cache

To increase the scalability of the server and improve the performance of the client, resources can be stored in the cache.

Statelessness

Rest APIs do not require information about sessions. If the server requires data about the client session, this is sent via a separate request.

Multi-layer system architecture

Between the client and the server, there may be a number of other applications that communicate with each other. The client cannot see through how many servers the response was transmitted.

Code on demand (optional)

In most cases, static resources are transferred via REST APIs. Sometimes, however, it can also be executable code such as Java applets. This should only be executed on demand.

First order logic

What is first-level predicate logic?

First-order logic (FOL) is a method based on mathematics for assigning unique properties to an object. Here, each sentence/statement is decomposed into its subject and its predicate. The relationship between them is done in first-level predicate logic by P(x), where P stands for predicate and the variable x for the corresponding subject.

It should be noted that the Predicates in First-Order Logic refer to only one subject at a time.. Unlike in linguistics, a predicate is not necessarily a verb, but merely provides relevant information about the subject in question. The use of the Predicates also allow relations to be established; for example, through comparisons (greater/smaller than, equal to, etc.).

In the first-level predicate logic, the Quantifiers and represented by the symbols ∀ (universal quantifier; read: "for all") and ∃ (existential quantifier; read: "it exists" or "for some"). The representation is done in First-Order Logic by mathematical symbols and consists of:

  • Terms: Human, animal, plant etc.
  • Names of objects. In the linguistic sense, these can be both objects and subjects!
  • Variables a, b, c, ..., x, y, z etc.

These stand for objects that are not yet known.

Predicates [red, fragrant, is a flower etc.] stand for properties and relations that are linguistically comparable to verbs or attributes.

Quantifiers [∀, ∃] allow statements about sets of objects for which the predicate applies.

Relations [∧ (and), ∨ (or), →(implies), ⇒ (follows from), ⇔ (is equivalent to), == (equality - operator)] give conclusions about relations.

Example of first level predicate logic

The rose is red.

P(x) = red(rose)

The rose is fragrant.

P(x) = fragrant(rose)

The rose is a flower.

P(x) = Flower(Rose)

We learn about the rose that it red is, Smells and a Flower is.

This results in ∀:

All Roses are red.

All Roses fragrant.

All Roses are Flowers.

However, not all roses are red and not every rose is fragrant.

That all roses are flowers, on the other hand, is a true statement.

∀(x) Rose(x) → Flower(x)

In order that the other two statements can be checked for their correctness, existential quantifiers are now used.

From the two statements:

"All the roses are red." and "All the roses are fragrant." are made by using ∃:

"Some roses are red." and "Some roses are fragrant."

To translate it into a first-order formula, we need to define a variable x:

A predicate A(x), where x the Rose and a predicate G(x), which corresponds for x is, red resp. Smells.

∃(x) Rose(x) → red(x)

resp.

∃(x) Rose(x) → fragrant(x)

This tells you that there are roses that are red are and roses exist that fragrant. It follows logically that there must also be roses that are not red or that are not fragrant.

Pathfinding

What is Pathfinding?

In computer science, pathfinding is understood to be algorithms with which the optimal path between two or more points is to be found. The optimal path can be defined on the basis of different parameters.

The optimal path always depends on the respective application. In addition to the shortest path, for example, the most cost-effective path can also be defined as the optimum. For example, other constraints such as avoiding certain waypoints or route sections can also influence the determination of the optimal path.

This behaviour is generally known in route planners when motorways or toll roads are to be avoided.

Algorithms

Depending on the requirements of the objective, various algorithms can be used within the framework of pathfinding.

3. A* algorithm This is what is known as an informed Search algorithmwhich determines the shortest graph between two points using an estimation function/heuristic. The search examines (from the starting point) the next node that is likely to lead quickly to the destination or reduce the distance to the destination node. If an examination of a node does not lead to the goal, it is marked as such and the search is continued with another node. According to this algorithm, the shortest path to the destination node is examined.

Another algorithm for determining the shortest path is the Dijkstra algorithm. This is not based on Heuristicsbut on the basis of the procedure that, starting from the starting node, all nodes with the shortest partial routes lead to each other to the optimal solution, in that the sum of the shortest partial routes leads to the overall shortest total path. Thus, the procedure works in the form of the most promising partial solution.

In contrast to the procedures described so far, the Bellman-Ford algorithm (or Moore-Bellman-Ford algorithm) also the consideration of graphs with negative edge weights to determine the shortest paths. This means that the costs (e.g. time) between two nodes can also be negative. However, it must be ensured that cycles through negative weights are excluded, as otherwise the path is reduced by repeatedly traversing the negative edge weights. All the approaches considered so far optimise the path as seen from a particular node.

The Min-plus matrix multiplication algorithm on the other hand, searches for the optimum of all node pairs in relation to each other.

The same applies to the Floyd-Warshall algorithm. The method makes use of the dynamic programming approach. In order to find the optimum, the overall problem is divided into similar subproblems and then, by solving and storing them, the overall problem is optimised. The operation is split into two, with Floyd's part ensuring that the shortest distances between nodes are calculated, while Warshall's part is responsible for constructing the shortest paths. The consideration of negative edge weights is also possible with this algorithm.

While some methods have the optimisation between two nodes as their objective, others optimise all nodes in relation to each other. This naturally leads to an increase in computing power. Therefore In addition to the consideration of the objective in the choice of the algorithm, the demand on resources is also a decisive factor.. In addition to the computing power, the required storage space and the runtime can also be relevant variables when choosing a method.

For some of the described methods, there are prefabricated algorithms that can be implemented in own solutions. For example, the library NetworkX can be implemented in Python and used as a framework for pathfinding problems.

Examples for the application of pathfinding in practice

The application possibilities of pathfinding are manifold. They range from simple as well as complex Controls and route planning in the computer game sector up to the solution of transport logistics problems and the optimisation of routing problems in the network sector.. To support the optimisation solutions, sub-areas of the Artificial intelligence be implemented.

As mentioned at the beginning, the optimum to be achieved can be defined individually. The limitation of the cost size can be represented by minimising the factor of time, money, intermediate stops and many other parameters.

PyTorch

PyTorch is a Open source framework for Machine Learning (machine learning) and is based on the programming language Python and the Torch library. It was developed in 2016 by a team of researchers for artificial intelligence by Facebook to improve the efficiency of developing and deploying research prototypes. PyTorch computes with tensors, which are accelerated by graphics processors (GPU for short). Over 200 different mathematical operations can be used with the framework.

Today, PyTorch is one of the most popular platforms for research in the field of Deep Learning and is mainly used for artificial intelligence (AI), data science and research. PyTorch is becoming increasingly popular because it makes it comparatively easy to create models for artificial neural networks (KNN) have created. PyTorch can also be used for reinforcement learning. It can be downloaded free of charge as open source from GitHub.

What is PyTorch Lightning?

PyTorch Lightning is a Open source library for Python and provides a high-level interface for PyTorch. The focus is on flexibility and performance to enable researchers, data scientists and machine learning engineers to create suitable and, most importantly, scalable ML systems. PyTorch Lightning is also available as open source for download from GitHub.

What are the features and benefits of PyTorch?

Dynamic graph calculation

The network behaviour can be changed spontaneously and the complete code does not have to be executed for this.

Automatic differentiation

Using backward sweeps in neural networks, the derivative of a function is calculated numerically.

User-friendly interface

It is called TorchScript and makes seamless switching between modes possible. It offers functionality, speed, flexibility and ease of use.

Python support

Since PyTorch is based on Python, it is easy to learn and programme and all libraries compatible with Python, such as NumPy or SciPy, can be used. Furthermore, uncomplicated debugging with Python tools is possible.

Scalability

It takes place on important Cloud platforms a good support and is therefore easy to scale.

Dataset and DataLoader

It is possible to create your own dataset for PyTorch to store all the necessary data. The dataset is managed by means of DataLoader. Among other things, the DataLoader can run through the data, manage batches and transform data.

In addition, PyTorch can export learning models in the Open Neural Network Exchange (ONNX) standard format and has a C++ front-end interface option.

What are examples of the use of PyTorch?

  • Object detection
  • Segmentation (semantic segmentation)
  • LSTM (Long Short-Term Memory)
  • Transformer

PyTorch vs. Tensorflow

Tensorflow is also a deep learning framework and was developed by Google. It has been around longer than PyTorch and therefore has a larger developer community and more documentation. Both frameworks have their advantages and disadvantages, as they are intended for different projects.

While Tensorflow defines the computational graphs in a static way, PyTorch takes a dynamic approach. Also, the dynamic graphs can be manipulated in real time with PyTorch and only at the end with Tensorflow. Therefore, PyTorch is particularly suitable for uncomplicated prototyping and research work due to its simple and easy handling. Tensorflow, on the other hand, is particularly suitable for projects that require scalable production models.

PyTorch vs. scikit-learn

Scikit-learn (also called Sklearn) is a free library for Python and specialises in machine learning. It offers a range of Classification-, Regression- and Clustering algorithms, such as Random Forest, Support vector machines or k-means. Scikit-learn provides an efficient and straightforward Data analysis and is particularly suitable for defining algorithms, but is rather unsuitable for end-to-end training of deep neural networks, for which, on the other hand, PyTorch can be used very well.