Plotting in with Pandas and Matplotlib

Plotting in with Pandas and Matplotlib#

In this tutorial, we’ll swiftly review the creation of various charts covered in our course lectures, including boxplots, histogram charts, barcharts, and more. While it’s possible to generate simple plots directly from pandas (you can practice for yourself), for finer control over multiple aspects of these plots, we’ll explore the utilization of the Matplotlib plotting module. Matplotlib stands out for its remarkable power and flexibility, as we’ll demonstrate throughout this tutorial.

Python has many nice, useful libraries that can be used for plotting. some of the most popular are as follows:

matplotlib: matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell, web application servers, and six graphical user interface toolkits.
seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
plotly: Plotly’s Python graphing library makes interactive, publication-quality graphs online. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.
bokeh: Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets.
ggplot: ggplot is a plotting system for Python based on R’s ggplot2 and the Grammar of Graphics. It is built for making profressional looking, plots quickly with minimal code.
holoviews: HoloViews is an open-source Python library designed to make data analysis and visualization seamless and simple. With HoloViews, you can usually express what you want to do in very few lines of code, letting you focus on what you are trying to explore and convey, not on the process of plotting.

Note

Explore the galleries and examples of different visualization libraries above to learn what’s possible to do in Python.

Anatomy of a Plot#

Before visualizing our data on a plot, let’s understand the components of a plot. We won’t delve into the details of different plot types in this tutorial but provide a brief introduction to various plots that can be created using Python and the essential elements of a plot.

Here is a list of several types of plots used to represent different kinds of data:

Despite the variety, most plots share common elements. Understanding basic terminology helps when creating or modifying plots. The figure below illustrates elements of a basic line plot.

Basic elements of a plot. (source: https://geo-python-site.readthedocs.io/en/latest/lessons/L7/plot-anatomy.html)

Common Plotting Terms#

These terms may vary slightly depending on the plotting library, and for this list, we use typical Matplotlib terms.

Term	Description
Axis	Graph axes, typically x, y, and z for 3D plots.
Title	Title of the plot.
Label	Name for the axis (e.g., xlabel or ylabel).
Legend	Legend for the plot.
Tick Label	Text or values represented on the axis.
Symbol	Symbol for data point(s) on a scatter plot, presented with different shapes/colors.
Size	Size of a point on a scatter plot, also used for text sizes on a plot.
Linestyle	The style of how the line should be drawn, solid or dashed, for example.
Linewidth	The width of a line on a plot.
Alpha	Transparency level of a filled element in a plot (values between 0.0 and 1.0).
Tick(s)	Refers to the tick marks on a plot.
Annotation	Text added to a plot.
Padding	The distance between an (axis/tick) label and the axis.

import warnings
warnings.filterwarnings("ignore")

Getting started#

Let’s start by importing pandas and matplotlib

# import required libraries

import pandas as pd
import matplotlib.pyplot as plt

Input data: Community Crime Statistics Map#

Our input data in this tutorial is a text file containing Building Permits by Community map in city of Calgary, Alberta, Canda retrieved from City of Calgary Open Data Portal:

File name: [Building_Permits_20240122.csv]
You can download the data from the link provided: City of Calgary Open Data Portal
Data is provides Building permit applications made to The City of Calgary’s Planning & Development department in 2023.
There are totally 22,073 rows and 30 columns in this dataset.

data = pd.read_csv('Building_Permits_20240122.csv')

data.head()

	PermitNum	StatusCurrent	AppliedDate	IssuedDate	CompletedDate	PermitType	PermitTypeMapped	PermitClass	PermitClassGroup	PermitClassMapped	...	CommunityCode	CommunityName	Latitude	Longitude	LocationCount	LocationTypes	LocationAddresses	LocationsWKT	LocationsGeoJSON	Point
0	BP2023-00001	Cancelled	2023/01/01	NaN	2023/08/04	Commercial / Multi Family Project	Building	3106 - Retail Shop	Commercial	Non-Residential	...	SNA	SUNALTA	51.037938	-114.095388	2.0	Titled Parcel;Building	1438 17 AV SW;1438 17 AV SW	MULTIPOINT (-114.09538783812974 51.03793752506...	{"type":"MultiPoint","coordinates":[[-114.0953...	POINT (-114.09538783812974 51.03793752506271)
1	BP2023-00002	Completed	2023/01/01	2023/01/24	2023/06/22	Residential Improvement Project	Building	1301 - Private Detached Garage	Garage	Residential	...	WWO	WOLF WILLOW	50.874928	-114.009338	2.0	Titled Parcel;Building	223 WOLF CREEK AV SE;223 WOLF CREEK AV SE	MULTIPOINT (-114.00933819969845 50.87492753090...	{"type":"MultiPoint","coordinates":[[-114.0093...	POINT (-114.00933819969845 50.874927530904024)
2	BP2023-00016	Completed	2023/01/02	2023/01/02	2023/05/12	Demolition	Demolition	1106 - House	Single Family	Residential	...	BRD	BRIDGELAND/RIVERSIDE	51.053518	-114.039212	1.0	Titled Parcel	205 9A ST NE	POINT (-114.03921222082343 51.05351846849098)	{"type":"Point","coordinates":[-114.0392122,51...	POINT (-114.03921222082343 51.05351846849098)
3	BP2023-00015	Completed	2023/01/02	2023/01/03	2023/11/16	Residential Improvement Project	Building	1301 - Private Detached Garage	Garage	Residential	...	DAL	DALHOUSIE	51.108398	-114.147277	2.0	Titled Parcel;Building	6204 DALBEATTIE HL NW;6204 DALBEATTIE HL NW	MULTIPOINT (-114.1472766668805 51.108397819725...	{"type":"MultiPoint","coordinates":[[-114.1472...	POINT (-114.1472766668805 51.10839781972519)
4	BP2023-00010	Issued Permit	2023/01/02	2023/01/03	NaN	Residential Improvement Project	Building	1101 - Basement Development	Single Family	Residential	...	SHN	SHAWNESSY	50.908529	-114.095202	2.0	Titled Parcel;Building	72 SHAWBROOKE MR SW;72 SHAWBROOKE MR SW	MULTIPOINT (-114.09520199036152 50.90852910976...	{"type":"MultiPoint","coordinates":[[-114.0952...	POINT (-114.09520199036152 50.90852910976423)

5 rows × 30 columns

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22073 entries, 0 to 22072
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 PermitNum          22073 non-null  object 
 StatusCurrent      22073 non-null  object 
 AppliedDate        22073 non-null  object 
 IssuedDate         19574 non-null  object 
 CompletedDate      10433 non-null  object 
 PermitType         22073 non-null  object 
 PermitTypeMapped   22073 non-null  object 
 PermitClass        22072 non-null  object 
 PermitClassGroup   22073 non-null  object 
 PermitClassMapped  22073 non-null  object 
WorkClass          22073 non-null  object 
WorkClassGroup     22073 non-null  object 
WorkClassMapped    21533 non-null  object 
Description        20973 non-null  object 
ApplicantName      13679 non-null  object 
ContractorName     13282 non-null  object 
HousingUnits       22073 non-null  int64  
EstProjectCost     19612 non-null  float64
TotalSqFt          5634 non-null   float64
OriginalAddress    22066 non-null  object 
CommunityCode      22066 non-null  object 
CommunityName      22066 non-null  object 
Latitude           22073 non-null  float64
Longitude          22073 non-null  float64
LocationCount      22066 non-null  float64
LocationTypes      22066 non-null  object 
LocationAddresses  22066 non-null  object 
LocationsWKT       22066 non-null  object 
LocationsGeoJSON   22066 non-null  object 
Point              22073 non-null  object 
dtypes: float64(5), int64(1), object(24)
memory usage: 5.1+ MB

Quickly clean the data and remove unnecessary information#

# keep only the columns we need

data = data[['PermitNum', 'StatusCurrent', 'AppliedDate', 'IssuedDate', 'CompletedDate', 'PermitType', 'PermitClass', 'PermitClassGroup', 'WorkClassGroup', 'HousingUnits', 'EstProjectCost', 'CommunityCode', 'Latitude', 'Longitude', 'LocationCount']]

data.head()

	PermitNum	StatusCurrent	AppliedDate	IssuedDate	CompletedDate	PermitType	PermitClass	PermitClassGroup	WorkClassGroup	EstProjectCost	CommunityCode	Latitude	Longitude	LocationCount
0	BP2023-00001	Cancelled	2023/01/01	NaN	2023/08/04	Commercial / Multi Family Project	3106 - Retail Shop	Commercial	Improvement	NaN	SNA	51.037938	-114.095388	2.0
1	BP2023-00002	Completed	2023/01/01	2023/01/24	2023/06/22	Residential Improvement Project	1301 - Private Detached Garage	Garage	New	42530.0	WWO	50.874928	-114.009338	2.0
2	BP2023-00016	Completed	2023/01/02	2023/01/02	2023/05/12	Demolition	1106 - House	Single Family	Demolition	NaN	BRD	51.053518	-114.039212	1.0
3	BP2023-00015	Completed	2023/01/02	2023/01/03	2023/11/16	Residential Improvement Project	1301 - Private Detached Garage	Garage	New	46291.0	DAL	51.108398	-114.147277	2.0
4	BP2023-00010	Issued Permit	2023/01/02	2023/01/03	NaN	Residential Improvement Project	1101 - Basement Development	Single Family	Improvement	56141.0	SHN	50.908529	-114.095202	2.0

# how many null values are there?   

data.isnull().sum()

PermitNum               0
StatusCurrent           0
AppliedDate             0
IssuedDate           2499
CompletedDate       11640
PermitType              0
PermitClass             1
PermitClassGroup        0
WorkClassGroup          0
HousingUnits            0
EstProjectCost       2461
CommunityCode           7
Latitude                0
Longitude               0
LocationCount           7
dtype: int64

As in the next steps we want to plot the total project costs, we need them to be complete. So we remove all the rows with missing EstProjectCost.

# remove rows with null values in the EstProjectCost column

data = data.dropna(subset=['EstProjectCost'])

# how many null values are there now?

data.isnull().sum()

PermitNum               0
StatusCurrent           0
AppliedDate             0
IssuedDate           1592
CompletedDate       10781
PermitType              0
PermitClass             0
PermitClassGroup        0
WorkClassGroup          0
HousingUnits            0
EstProjectCost          0
CommunityCode           6
Latitude                0
Longitude               0
LocationCount           6
dtype: int64

Convert the AppliedDate column to datetime format so afterward we can work with time data

# convert the AppliedDate columns to datetime format
data['AppliedDate'] = pd.to_datetime(data['AppliedDate'])

Explatory Analysis by plotting charts#

Now lets try to create some charts to get more familiar with thr data and get more insights about different stats of data

Line Chart#

In the first step we want to draw a line chart to see how many building permits have been issued in each day.

We proceed to extract the date component from the 'AppliedDate' column and count the number of rows for each unique date. The groupby() method groups the DataFrame by the date component, and size() function counts the number of rows for each unique date. We then use reset_index() to reset the index of the resulting DataFrame and name='permit_count' to assign a new name to the count column.

# Create a new column called Datetime that contains only the date portion of the AppliedDate column

data['Date_daily'] = data['AppliedDate'].dt.date

# Group the data by Date_daily and count the number of rows in each group
daily_counts = data.groupby(data['Date_daily']).size()

# reset the index of the daily_counts dataframe and name the columns permit_count
daily_counts = daily_counts.reset_index(name='permit_count')

daily_counts.head()

	Date_daily	permit_count
0	2023-01-01	1
1	2023-01-02	17
2	2023-01-03	44
3	2023-01-04	40
4	2023-01-05	43

We initialize a Matplotlib figure with plt.figure(figsize=(12, 6)), setting the size to 12 inches in width and 6 inches in height. This helps in controlling the visual aspect of our plot.

Using plt.plot(), we create a line chart. The 'Date_daily' column is plotted on the x-axis, and the 'permit_count' column is plotted on the y-axis. We specify the line style, markers, and color to customize the appearance of the plot.

Note

We set labels for the x-axis (‘Date’) and y-axis (‘Number of Permits’). Additionally, we provide a title for the plot using plt.xlabel(), plt.ylabel(), and plt.title().

To improve the readability of the x-axis labels, we rotate them by 90 degrees using plt.xticks(rotation=90).

Finally, we display the plot using plt.show().

# Plotting with Matplotlib
plt.figure(figsize=(12, 6))  # Set the size of the figure


# Plotting the line chart
plt.plot(daily_counts['Date_daily'], daily_counts['permit_count'], linestyle='-', color='blue')


# Setting labels and title
plt.xlabel('Date')
plt.ylabel('Number of Permits')
plt.title('Daily Building Permits Over Time')

# Rotating x-axis labels for better readability
plt.xticks(rotation=90)

# Adding grid for better readability
plt.grid(True)

# Display the plot
plt.tight_layout()  # Adjust layout for better appearance

# Display the plot
plt.show()

../../_images/d1f2c4e2c7de798a4e109705d8483cbbe41a60126be3d3bab5532ca2500f4840.png

Bar Chart#

In the next step we try to draw a bar chart to check how many permit have been issued for each Permit Class Group.

As before, we use groupby() to group the DataFrame by the 'PermitClassGroup' column and calculate the size (number of rows) for each class group. Similarly, reset_index() is used to reset the index and name=’Count’ assigns a new name to the count column.

# Extract date and count the number of rows for each category
category_counts = data.groupby('PermitClassGroup').size().reset_index(name='Count')

# show the category_counts dataframe
category_counts.head()

	PermitClassGroup	Count
0	Additional Dwelling	97
1	Apartment	353
2	Commercial	2119
3	Garage	2633
4	General	91

To create a meaningful bar plot, we sort the DataFrame in descending order based on the ‘Count’ column.

# Sorting the DataFrame by 'Count' in descending order
category_counts = category_counts.sort_values(by='Count', ascending=False)

# show the category_counts dataframe
category_counts.head()

	PermitClassGroup	Count
10	Single Family	9642
3	Garage	2633
2	Commercial	2119
9	Secondary Suites	1655
14	Two Family	1213

The bar plot is then generated using plt.bar(). We use 'PermitClassGroup' on the x-axis and 'Count' on the y-axis

# Plotting with Matplotlib
plt.figure(figsize=(12, 8))  # Set the size of the figure

# Creating a bar plot
plt.bar(category_counts['PermitClassGroup'], category_counts['Count'], color='skyblue')

# Setting labels and title
plt.xlabel('Permit Class Group')
plt.ylabel('Number of permits')
plt.title('Distribution of permits by Permit Class Group')

# Rotating x-axis labels for better readability
plt.xticks(rotation=45, ha='right')

# Display the plot
plt.tight_layout()
plt.show()

../../_images/e8d7d51e03077c2b73db49f19fd0021130f41c2af5764fde60edf7e736b0a5f8.png

Pie Chart#

Now let’s practice Pie chart.

We want to check the type of permit 'WorkClassGroup' to check whether the permit is for improvement of new building. To do so, same as before we use groupby() to group the dataframe based on 'WorkClassGroup' and then count the number of permit in each group.

# group data by WorkClassGroup and count the number of rows in each group
workclass_counts = data.groupby('WorkClassGroup').size().reset_index(name='Count')

# show the workclass_counts dataframe
workclass_counts.head()

	WorkClassGroup	Count
0	Improvement	9440
1	New	9662
2	Unspecified	510

Like before, we use matplotlib to plot the pir chart.

In piechart we can set different stting such as, legend, labels forn size and weight, …

# Plotting with Matplotlib
plt.figure(figsize=(6, 6))  # Set the size of the figure

# Creating a pie chart with various setups
labels = workclass_counts['WorkClassGroup']

# Setting custom colors, autopct format, and shadow
plt.pie(workclass_counts['Count'], 
        autopct='%1.1f%%', # 1 decimal place for percentage
        textprops={'fontsize': 12}, # Setting fontsize of text labels
        radius=0.9, # Setting the radius of the pie chart
        labels=labels, # Setting the labels for each section
        )

# Setting title
plt.title('Distribution of permits based on work class group', fontsize=12, fontweight='bold')

# Adding legend with custom position
plt.legend(labels, loc='upper right', bbox_to_anchor=(1.3, 0.9), title='Work Class Group', title_fontsize=12)

# Display the plot
plt.show()

../../_images/447239e721710efea99dcd499e0daed64a112814419856a6258890251c8c4875.png

Historgram#

We can use histogram chart to check the distribution of 'EstProjectCost' to see how the project prices are distributed.

Histogram plot divides the data into bins and displays the frequency (or count) of values within each bin.

We use plt.hist() to create the histogram.

The data['EstProjectCost'] variable contains the data we want to visualize.

'bins' parameter controls the number of bins in the histogram.
'color' parameter sets the color of the bars.
'edgecolor' parameter sets the color of the edges of the bars. –>

# plot histogram of the EstProjectCost column
plt.figure(figsize=(12, 6))  # Set the size of the figure


# Plotting the histogram
plt.hist(data['EstProjectCost'], 
         bins=20, 
         color='skyblue', 
         edgecolor='black')

# Setting labels and title
plt.xlabel('Estimated Project Cost')
plt.ylabel('Number of permits')
plt.title('Distribution of permits by estimated project cost')

# Display the plot
plt.show()

../../_images/6c0df1b1d70fa2a67badb4bb70a869eac84a5317c01edb94d242c4a6bfcaabf0.png

Note

As we can see, the plot does not show very good result due to the x-axis that contains sparse data (some of the prices are around 80,000,000 which are only few and make the chart not readable. One approach is that filter the dataframe to less sparse prices. For example we choose to keep those rows that contain price less than 200,000)

We also can play with number of bins to have better chart. Let’s change number of bins to 200.

# filter the data to keep only rows where EstProjectCost is less than 2,000,000
data_less_than_2000000 = data[data['EstProjectCost'] < 2000000]

# plot histogram of the EstProjectCost column
plt.figure(figsize=(12, 6))  # Set the size of the figure


# Plotting the histogram
plt.hist(data_less_than_2000000['EstProjectCost'], 
         bins=200, 
         color='skyblue', 
         edgecolor='black')

# Setting labels and title
plt.xlabel('Estimated Project Cost')
plt.ylabel('Number of permits')
plt.title('Distribution of permits by estimated project cost')

# Display the plot
plt.show()

../../_images/2d9304fef6706b16d2f4e29d3cce075c5520af55205028e3163abd3f8a43412b.png

Box Plot#

Lets look at the distribution of the EstProjectCost column using a boxplot

# Boxplot of the EstProjectCost column

plt.figure(figsize=(6, 6))  # Set the size of the figure

# Plotting the boxplot
plt.boxplot(data['EstProjectCost'])

# Setting labels and title
plt.ylabel('Estimated Project Cost')
plt.title('Distribution of permits by estimated project cost')

# Display the plot
plt.show()

../../_images/d3ebe14ea9146f9cf7ee754396f23dc76a59168b81c10b72ef6cf37468c75ed7.png

As we can see, the number of outliars is too much and make the plot not readable. We can use showfliers=False to hide the outliars.

# Boxplot of the EstProjectCost column

plt.figure(figsize=(5, 5))  # Set the size of the figure

# Plotting the boxplot
plt.boxplot(data['EstProjectCost'],
            showfliers=False,
            )

# Setting labels and title
plt.ylabel('Estimated Project Cost')
plt.xlabel('All permits')
plt.title('Distribution of permits by estimated project cost')

# Display the plot
plt.show()

../../_images/a844ca2f3c3a46cfbb97a19457d48a397482d72d3cd92f98022019beabcda93f.png

Map Visualization#

Now let’s see how to plot latitude and longitude points on a Folium map.

Folium is the primary library for creating interactive maps. Let’s start with importing required libraries

Show the distribution of points of Map#

import folium

Next, we calculate the center of the map using the mean of latitude and longitude values.

# Calculate the center of the map
map_center = [data['Latitude'].mean(), data['Longitude'].mean()]

# print the map_center variable
print(map_center)

[np.float64(51.02548747000798), np.float64(-114.01941494735246)]

We now create a Folium Figure 'f' with a specified width and height. Then, we create the Folium map, specifying the initial zoom level, and the center of the map, and assign it to the figure object by .add_to(f).

# Create a folium map object
f = folium.Figure(width=800, height=500) # set figure size

# Create the Folium map
my_map = folium.Map(location=map_center, zoom_start=10).add_to(f)

In this step, we iterate through each row in the DataFrame (see the example in the Processing data with pandas) and add a Circle Marker for each point on the map. We can customize the appearance of the Circle Marker using parameters like radius, color, fill, fill_color, and fill_opacity. The popup displays information about each point.

Note

We use the first 10,000 rows and show the first 10,000 points one the map to avoid making the map heavy and crashing the PC!

# iterate over rows in dataframe
for index, row in data[:10000].iterrows(): # iterate over rows in dataframe
    folium.CircleMarker(location=[row['Latitude'], row['Longitude']],
                        radius=1,  # Specify the radius of the circle
                        color='blue',  # Specify the color of the circle
                        fill=True,  # Set fill to True
                        fill_color='blue',  # Specify the fill color of the circle
                        fill_opacity=0.6,  # Specify the fill opacity
                        popup=f"Point {index + 1}"
                        ).add_to(my_map) # add circle one by one on the map
# display map
f

Plotting in with Pandas and Matplotlib

Contents

Plotting in with Pandas and Matplotlib#

Anatomy of a Plot#

Common Plotting Terms#

Getting started#

Input data: Community Crime Statistics Map#

Quickly clean the data and remove unnecessary information#

Explatory Analysis by plotting charts#

Line Chart#

Bar Chart#

Pie Chart#

Historgram#

Box Plot#

Map Visualization#

Show the distribution of points of Map#