Processing data with pandas#

Attention

These tutorials are meant to help you with assignment tasks by giving hints, not to teach Python programming in a detailed step-by-step manner.

As mentioned in the first lecture of the course, this course requires basic knowledge of Python programming language.

If you are not familiar with them, try catching up with the basics as fast as you can. Some useful resources are:

pandas online documentation

Python for Data Analysis (3rd Edition, 2022)

import warnings
warnings.filterwarnings("ignore")

Our assignment 1 is about to learn how to read and explore data files in Python. We will focus on using pandas which is an open-source package for data analysis in Python. pandas is an excellent toolkit for working with real world data that often have a tabular structure (rows and columns).

Input data: Community Crime Statistics Map#

Our input data in this tutorial is a text file containing community crimes statistics map in city of Calgary, Alberta, Canda retrieved from City of Calgary Open Data Portal:

  • File name: [Community_Crime_Statistics_20240120.csv] (you can have a look at the file before reading it in using pandas!)

  • You can download the data from the link provided: City of Calgary Open Data Portal

  • Data is provided monthly by the Calgary Police Service. And includes the location of crime, time, category, crime count, and resident count.

  • There are totally 67,262 rows and 10 columns in this dataset.

Loading Data#

Next, we wll read the input data file, and store the contents of that file in a variable called data Using the pandas.read_csv() function:

# Importing the libraries
import pandas as pd
# load data

# Read the file using pandas
data = pd.read_csv("Community_Crime_Statistics_20240120.csv", sep=',')

Reading different file formats

Check out the pandas documentation about input and output functions and Chapter 6 in McKinney (2022) for more details about reading data.

Let’s now print the dataframe and see what it looks like.

data
Sector Community Name Category Crime Count Resident Count Date Year Month ID Community Center Point
0 NORTHWEST ARBOUR LAKE Theft OF Vehicle 2 10619.0 2022/04 2022 APR 2022-APR-ARBOUR LAKE-Theft OF Vehicle POINT (-114.20767498075155 51.1325947114686)
1 CENTRE BANFF TRAIL Theft OF Vehicle 2 4153.0 2023/10 2023 OCT 2023-OCT-BANFF TRAIL-Theft OF Vehicle POINT (-114.11512839716917 51.07421633024228)
2 EAST DOVER Theft OF Vehicle 5 10351.0 2022/12 2022 DEC 2022-DEC-DOVER-Theft OF Vehicle POINT (-113.99305400906283 51.02256772250409)
3 CENTRE GREENVIEW Assault (Non-domestic) 2 1906.0 2020/12 2020 DEC 2020-DEC-GREENVIEW-Assault (Non-domestic) POINT (-114.05746990262463 51.09485613506574)
4 NORTHWEST HAMPTONS Theft FROM Vehicle 3 7382.0 2019/08 2019 AUG 2019-AUG-HAMPTONS-Theft FROM Vehicle POINT (-114.14668419231347 51.14509283969437)
... ... ... ... ... ... ... ... ... ... ...
67930 SOUTH SOMERSET Assault (Non-domestic) 2 8543.0 2020/05 2020 MAY 2020-MAY-SOMERSET-Assault (Non-domestic) POINT (-114.08138084069279 50.89891142227545)
67931 CENTRE LOWER MOUNT ROYAL Assault (Non-domestic) 1 3457.0 2023/07 2023 JUL 2023-JUL-LOWER MOUNT ROYAL-Assault (Non-domestic) POINT (-114.08649939756594 51.0365031977319)
67932 NORTHEAST MARLBOROUGH PARK Break & Enter - Dwelling 1 8523.0 2021/04 2021 APR 2021-APR-MARLBOROUGH PARK-Break & Enter - Dwel... POINT (-113.94671708302872 51.059688495333184)
67933 SOUTH DEER RIDGE Break & Enter - Commercial 1 3973.0 2020/08 2020 AUG 2020-AUG-DEER RIDGE-Break & Enter - Commercial POINT (-114.02023404987288 50.926573869162)
67934 NORTH SAGE HILL Theft FROM Vehicle 1 7924.0 2020/11 2020 NOV 2020-NOV-SAGE HILL-Theft FROM Vehicle POINT (-114.14068609335015 51.175616972779984)

67935 rows × 10 columns

We can use the data.head() function of the pandas DataFrame object to quickly check the top rows. We can also check the last rows of the data using data.tail()

# print the first 5 rows of data
data.head() 
Sector Community Name Category Crime Count Resident Count Date Year Month ID Community Center Point
0 NORTHWEST ARBOUR LAKE Theft OF Vehicle 2 10619.0 2022/04 2022 APR 2022-APR-ARBOUR LAKE-Theft OF Vehicle POINT (-114.20767498075155 51.1325947114686)
1 CENTRE BANFF TRAIL Theft OF Vehicle 2 4153.0 2023/10 2023 OCT 2023-OCT-BANFF TRAIL-Theft OF Vehicle POINT (-114.11512839716917 51.07421633024228)
2 EAST DOVER Theft OF Vehicle 5 10351.0 2022/12 2022 DEC 2022-DEC-DOVER-Theft OF Vehicle POINT (-113.99305400906283 51.02256772250409)
3 CENTRE GREENVIEW Assault (Non-domestic) 2 1906.0 2020/12 2020 DEC 2020-DEC-GREENVIEW-Assault (Non-domestic) POINT (-114.05746990262463 51.09485613506574)
4 NORTHWEST HAMPTONS Theft FROM Vehicle 3 7382.0 2019/08 2019 AUG 2019-AUG-HAMPTONS-Theft FROM Vehicle POINT (-114.14668419231347 51.14509283969437)
# print the last 5 rows of data
data.tail()
Sector Community Name Category Crime Count Resident Count Date Year Month ID Community Center Point
67930 SOUTH SOMERSET Assault (Non-domestic) 2 8543.0 2020/05 2020 MAY 2020-MAY-SOMERSET-Assault (Non-domestic) POINT (-114.08138084069279 50.89891142227545)
67931 CENTRE LOWER MOUNT ROYAL Assault (Non-domestic) 1 3457.0 2023/07 2023 JUL 2023-JUL-LOWER MOUNT ROYAL-Assault (Non-domestic) POINT (-114.08649939756594 51.0365031977319)
67932 NORTHEAST MARLBOROUGH PARK Break & Enter - Dwelling 1 8523.0 2021/04 2021 APR 2021-APR-MARLBOROUGH PARK-Break & Enter - Dwel... POINT (-113.94671708302872 51.059688495333184)
67933 SOUTH DEER RIDGE Break & Enter - Commercial 1 3973.0 2020/08 2020 AUG 2020-AUG-DEER RIDGE-Break & Enter - Commercial POINT (-114.02023404987288 50.926573869162)
67934 NORTH SAGE HILL Theft FROM Vehicle 1 7924.0 2020/11 2020 NOV 2020-NOV-SAGE HILL-Theft FROM Vehicle POINT (-114.14068609335015 51.175616972779984)

Let’s see some basic info about the data (The number of column, rows, the data type of each column etc.)

# Check the data info
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67935 entries, 0 to 67934
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Sector                  67904 non-null  object 
 1   Community Name          67935 non-null  object 
 2   Category                67935 non-null  object 
 3   Crime Count             67935 non-null  int64  
 4   Resident Count          67866 non-null  float64
 5   Date                    67935 non-null  object 
 6   Year                    67935 non-null  int64  
 7   Month                   67935 non-null  object 
 8   ID                      67935 non-null  object 
 9   Community Center Point  67904 non-null  object 
dtypes: float64(1), int64(2), object(7)
memory usage: 5.2+ MB

Also some basic stats about numerical values in columns that contain numbers (Crime count, Resident count, Year)

# Cehck the data description
data.describe()
Crime Count Resident Count Year
count 67935.000000 67866.000000 67935.000000
mean 2.880651 6500.622653 2020.460146
std 3.685793 5455.404451 1.712744
min 1.000000 0.000000 2018.000000
25% 1.000000 2263.000000 2019.000000
50% 2.000000 5957.000000 2020.000000
75% 3.000000 9244.000000 2022.000000
max 110.000000 25710.000000 2023.000000

Basic Calculations in DataFrame#

One of the most common things to do in pandas is to create new columns based on calculations between different variables (columns).

Let’s say we want to create a new column in which we want to calculate the “Crime Count +1”. We can create a new column new_crime_count in our DataFrame by specifying the name of the column and giving it some default value (in this case the decimal number 0.0).

# Define a new column "ew_crime_count"
data["new_crime_count"] = 0.0

# Check how the dataframe looks like:
data
Sector Community Name Category Crime Count Resident Count Date Year Month ID Community Center Point new_crime_count
0 NORTHWEST ARBOUR LAKE Theft OF Vehicle 2 10619.0 2022/04 2022 APR 2022-APR-ARBOUR LAKE-Theft OF Vehicle POINT (-114.20767498075155 51.1325947114686) 0.0
1 CENTRE BANFF TRAIL Theft OF Vehicle 2 4153.0 2023/10 2023 OCT 2023-OCT-BANFF TRAIL-Theft OF Vehicle POINT (-114.11512839716917 51.07421633024228) 0.0
2 EAST DOVER Theft OF Vehicle 5 10351.0 2022/12 2022 DEC 2022-DEC-DOVER-Theft OF Vehicle POINT (-113.99305400906283 51.02256772250409) 0.0
3 CENTRE GREENVIEW Assault (Non-domestic) 2 1906.0 2020/12 2020 DEC 2020-DEC-GREENVIEW-Assault (Non-domestic) POINT (-114.05746990262463 51.09485613506574) 0.0
4 NORTHWEST HAMPTONS Theft FROM Vehicle 3 7382.0 2019/08 2019 AUG 2019-AUG-HAMPTONS-Theft FROM Vehicle POINT (-114.14668419231347 51.14509283969437) 0.0
... ... ... ... ... ... ... ... ... ... ... ...
67930 SOUTH SOMERSET Assault (Non-domestic) 2 8543.0 2020/05 2020 MAY 2020-MAY-SOMERSET-Assault (Non-domestic) POINT (-114.08138084069279 50.89891142227545) 0.0
67931 CENTRE LOWER MOUNT ROYAL Assault (Non-domestic) 1 3457.0 2023/07 2023 JUL 2023-JUL-LOWER MOUNT ROYAL-Assault (Non-domestic) POINT (-114.08649939756594 51.0365031977319) 0.0
67932 NORTHEAST MARLBOROUGH PARK Break & Enter - Dwelling 1 8523.0 2021/04 2021 APR 2021-APR-MARLBOROUGH PARK-Break & Enter - Dwel... POINT (-113.94671708302872 51.059688495333184) 0.0
67933 SOUTH DEER RIDGE Break & Enter - Commercial 1 3973.0 2020/08 2020 AUG 2020-AUG-DEER RIDGE-Break & Enter - Commercial POINT (-114.02023404987288 50.926573869162) 0.0
67934 NORTH SAGE HILL Theft FROM Vehicle 1 7924.0 2020/11 2020 NOV 2020-NOV-SAGE HILL-Theft FROM Vehicle POINT (-114.14068609335015 51.175616972779984) 0.0

67935 rows × 11 columns

Now let’s update the column new_crime_count by calculating the the column Crime Count + 1.

# Calculate `Crime Count` + 1
data["new_crime_count"] = data["Crime Count"] + 1

# Check the result
data.head()
Sector Community Name Category Crime Count Resident Count Date Year Month ID Community Center Point new_crime_count
0 NORTHWEST ARBOUR LAKE Theft OF Vehicle 2 10619.0 2022/04 2022 APR 2022-APR-ARBOUR LAKE-Theft OF Vehicle POINT (-114.20767498075155 51.1325947114686) 3
1 CENTRE BANFF TRAIL Theft OF Vehicle 2 4153.0 2023/10 2023 OCT 2023-OCT-BANFF TRAIL-Theft OF Vehicle POINT (-114.11512839716917 51.07421633024228) 3
2 EAST DOVER Theft OF Vehicle 5 10351.0 2022/12 2022 DEC 2022-DEC-DOVER-Theft OF Vehicle POINT (-113.99305400906283 51.02256772250409) 6
3 CENTRE GREENVIEW Assault (Non-domestic) 2 1906.0 2020/12 2020 DEC 2020-DEC-GREENVIEW-Assault (Non-domestic) POINT (-114.05746990262463 51.09485613506574) 3
4 NORTHWEST HAMPTONS Theft FROM Vehicle 3 7382.0 2019/08 2019 AUG 2019-AUG-HAMPTONS-Theft FROM Vehicle POINT (-114.14668419231347 51.14509283969437) 4

You can even calculate the sum of two columns. For example: data["new_column"] = data["Crime Count"] + data["Resident Count"]

Selecting rows and columns#

One common way of selecting only specific rows from your DataFrame is done via this syntax: dataframe[start_index:stop_index]

Let’s select the first six rows and assign them to a variable called selection:

# Select first five rows of dataframe using row index values
selection = data[0:6]

# Print `selection`
selection
Sector Community Name Category Crime Count Resident Count Date Year Month ID Community Center Point new_crime_count
0 NORTHWEST ARBOUR LAKE Theft OF Vehicle 2 10619.0 2022/04 2022 APR 2022-APR-ARBOUR LAKE-Theft OF Vehicle POINT (-114.20767498075155 51.1325947114686) 3
1 CENTRE BANFF TRAIL Theft OF Vehicle 2 4153.0 2023/10 2023 OCT 2023-OCT-BANFF TRAIL-Theft OF Vehicle POINT (-114.11512839716917 51.07421633024228) 3
2 EAST DOVER Theft OF Vehicle 5 10351.0 2022/12 2022 DEC 2022-DEC-DOVER-Theft OF Vehicle POINT (-113.99305400906283 51.02256772250409) 6
3 CENTRE GREENVIEW Assault (Non-domestic) 2 1906.0 2020/12 2020 DEC 2020-DEC-GREENVIEW-Assault (Non-domestic) POINT (-114.05746990262463 51.09485613506574) 3
4 NORTHWEST HAMPTONS Theft FROM Vehicle 3 7382.0 2019/08 2019 AUG 2019-AUG-HAMPTONS-Theft FROM Vehicle POINT (-114.14668419231347 51.14509283969437) 4
5 SOUTH LAKE BONAVISTA Theft OF Vehicle 2 10293.0 2022/08 2022 AUG 2022-AUG-LAKE BONAVISTA-Theft OF Vehicle POINT (-114.05133665127042 50.939803390308995) 3

It is also possible to control which columns are chosen when selecting a subset of rows. In this case we will use pandas.DataFrame.loc which selects data based on axis labels (row labels and column labels) with this syntax : .loc[start_index:stop_index, list_of_columns]

Let’s select “Category” and “Crime Count” values from rows 0-5:

# Select the two column values on rows 0-5
selection = data.loc[0:5, ["Category", "Crime Count"]]

# Print `selection`
selection
Category Crime Count
0 Theft OF Vehicle 2
1 Theft OF Vehicle 2
2 Theft OF Vehicle 5
3 Assault (Non-domestic) 2
4 Theft FROM Vehicle 3
5 Theft OF Vehicle 2

Filtering and updating#

Filter Columns:#

Sometimes we don’t need all the information in the data. We can select specific columns based on the column values.

For example, I don’t need the collumn Month and want to keep other columns. We can drop the column by using the .drop() function. (e.g. data.drop('Month', axis=1, inplace=True))

axis=1 means drop the column, inplace=True means drop the column in the original data, (for droping roaws we can use axis=0 which means drop the row)

# drop the extra columns
data.drop(['Month'], axis=1, inplace=True)

Note

Alternatively, instead of using .drop() to drop unneccesary columns, we can keep all the columns that we need. The basic syntax is dataframe[value], where value can be a single column name, or a list of column names. In our example, instead of dropping Month, I can keep all other columns with the mentioned suntax.

# Keep all columns except the Month column
data = data[['Sector', 'Community Name', 'Category', 'Crime Count', 'Resident Count', 'Date', 'Year', 'ID', 'Community Center Point']]
data.head()
Sector Community Name Category Crime Count Resident Count Date Year ID Community Center Point
0 NORTHWEST ARBOUR LAKE Theft OF Vehicle 2 10619.0 2022/04 2022 2022-APR-ARBOUR LAKE-Theft OF Vehicle POINT (-114.20767498075155 51.1325947114686)
1 CENTRE BANFF TRAIL Theft OF Vehicle 2 4153.0 2023/10 2023 2023-OCT-BANFF TRAIL-Theft OF Vehicle POINT (-114.11512839716917 51.07421633024228)
2 EAST DOVER Theft OF Vehicle 5 10351.0 2022/12 2022 2022-DEC-DOVER-Theft OF Vehicle POINT (-113.99305400906283 51.02256772250409)
3 CENTRE GREENVIEW Assault (Non-domestic) 2 1906.0 2020/12 2020 2020-DEC-GREENVIEW-Assault (Non-domestic) POINT (-114.05746990262463 51.09485613506574)
4 NORTHWEST HAMPTONS Theft FROM Vehicle 3 7382.0 2019/08 2019 2019-AUG-HAMPTONS-Theft FROM Vehicle POINT (-114.14668419231347 51.14509283969437)

Filter Rows#

One really useful feature in pandas is the ability to easily filter and select rows based on a conditional statement.

# Select rows that are in 2023
crimes_2023 = data.loc[data["Year"] > 2022]

# Print `crimes_2023`
crimes_2023.head()
Sector Community Name Category Crime Count Resident Count Date Year ID Community Center Point
1 CENTRE BANFF TRAIL Theft OF Vehicle 2 4153.0 2023/10 2023 2023-OCT-BANFF TRAIL-Theft OF Vehicle POINT (-114.11512839716917 51.07421633024228)
10 CENTRE ROSEDALE Break & Enter - Dwelling 1 1560.0 2023/04 2023 2023-APR-ROSEDALE-Break & Enter - Dwelling POINT (-114.07820713414625 51.062686663226856)
15 CENTRE TUXEDO PARK Theft OF Vehicle 3 5326.0 2023/08 2023 2023-AUG-TUXEDO PARK-Theft OF Vehicle POINT (-114.06104493215598 51.07421777158121)
30 NORTHEAST CALGARY INTERNATIONAL AIRPORT Theft FROM Vehicle 8 0.0 2023/05 2023 2023-MAY-CALGARY INTERNATIONAL AIRPORT-Theft F... POINT (-114.01171665690258 51.123556453987604)
96 CENTRE ALTADORE Theft OF Vehicle 1 6942.0 2023/04 2023 2023-APR-ALTADORE-Theft OF Vehicle POINT (-114.10078239468564 51.015954118298694)

Or even multiple conditions at the same time:

# filter the data to show crime with crime count greater than 5 for year 2023
crime_2023_selected = data.loc[(data["Year"] > 2022) & (data["Crime Count"] >= 5)]

# Print `crime_2023_selected`
crime_2023_selected.head()
Sector Community Name Category Crime Count Resident Count Date Year ID Community Center Point
30 NORTHEAST CALGARY INTERNATIONAL AIRPORT Theft FROM Vehicle 8 0.0 2023/05 2023 2023-MAY-CALGARY INTERNATIONAL AIRPORT-Theft F... POINT (-114.01171665690258 51.123556453987604)
13017 SOUTHEAST AUBURN BAY Theft FROM Vehicle 5 17607.0 2023/03 2023 2023-MAR-AUBURN BAY-Theft FROM Vehicle POINT (-113.95779514351926 50.89269507687)
20753 SOUTHEAST SETON Theft OF Vehicle 5 1134.0 2023/10 2023 2023-OCT-SETON-Theft OF Vehicle POINT (-113.94793123802683 50.874157662841256)
20934 SOUTHEAST OGDEN Theft FROM Vehicle 5 8576.0 2023/09 2023 2023-SEP-OGDEN-Theft FROM Vehicle POINT (-114.01244227109757 50.991878337981746)
29683 SOUTHEAST OGDEN Theft FROM Vehicle 7 8576.0 2023/11 2023 2023-NOV-OGDEN-Theft FROM Vehicle POINT (-114.01244227109757 50.991878337981746)

Data Cleaning#

Dealing with missing data#

As you have learned, most of the time when we work with data, we face missing values or incomplete data. We can check the number of missing values in Dataframe with this syntax:

# print the number of missing values in each column
data.isnull().sum() 
Sector                    31
Community Name             0
Category                   0
Crime Count                0
Resident Count            69
Date                       0
Year                       0
ID                         0
Community Center Point    31
dtype: int64

As we can see, in Column “Sector”, “Resident Count”, and “Community Center Pint” we have 31, 69, 31 missing values respectively.

OThere are many way to handle missing values based on the application. One way of cleaning data is to remover rows with missing values. If we want to remove the missing values based on column (only remove those rows that are missing the value in specific column) we can use this syntax: .dropna(subset=['column_name'], inplace=True)

# drop the rows with missing values for column "Community Center Point"
data.dropna(subset=['Community Center Point'], inplace=True)
# print the number of missing values in each column
data.isnull().sum() 
Sector                     0
Community Name             0
Category                   0
Crime Count                0
Resident Count            38
Date                       0
Year                       0
ID                         0
Community Center Point     0
dtype: int64

As we can see, we remove all the rows that had missing values for column Community Center Point. If we want to remove all the row that has any missing value, no matter in which column, we can use this syntax:

# remove all the rows with missing values
data.dropna(inplace=True)
# print the number of missing values in each column
data.isnull().sum() 
Sector                    0
Community Name            0
Category                  0
Crime Count               0
Resident Count            0
Date                      0
Year                      0
ID                        0
Community Center Point    0
dtype: int64

Removing Duplicates#

Duplicate rows may be found in a DataFrame for any number of reasons.

The DataFrame method duplicated returns a Boolean Series indicating whether each row is a duplicate (its column values are exactly equal to those in an earlier row) or not:

# check for duplicate rows
data.duplicated()
0        False
1        False
2        False
3        False
4        False
         ...  
67930     True
67931     True
67932     True
67933     True
67934     True
Length: 67866, dtype: bool
# count all the duplicate rows
data.duplicated().sum()
np.int64(673)

We have 673 rows with identical values. Relatedly, drop_duplicates returns a DataFrame with rows where the duplicated array is False filtered out:

# drop the duplicate rows
data.drop_duplicates(inplace=True)

# print the number of duplicate rows
data.duplicated().sum()
np.int64(0)

Note

You can check for more infromation about data cleaning with pandas by looking at : Data Cleaning and Preparation

Data type conversions#

There are occasions where you’ll need to convert data stored within a Series to another data type, for example, from floating point to integer.

For values in pandas DataFrames and Series, we can use the .astype() method.

## Data type conversions

data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 67193 entries, 0 to 67261
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Sector                  67193 non-null  object 
 1   Community Name          67193 non-null  object 
 2   Category                67193 non-null  object 
 3   Crime Count             67193 non-null  int64  
 4   Resident Count          67193 non-null  float64
 5   Date                    67193 non-null  object 
 6   Year                    67193 non-null  int64  
 7   ID                      67193 non-null  object 
 8   Community Center Point  67193 non-null  object 
dtypes: float64(1), int64(2), object(6)
memory usage: 5.1+ MB

As we can see the type of “Resident Count” is float64. We want to convert this column type to integer:

# convert the "Resident Count" to integer
data["Resident Count"] = data["Resident Count"].astype(int)

# print the data info
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 67193 entries, 0 to 67261
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Sector                  67193 non-null  object
 1   Community Name          67193 non-null  object
 2   Category                67193 non-null  object
 3   Crime Count             67193 non-null  int64 
 4   Resident Count          67193 non-null  int64 
 5   Date                    67193 non-null  object
 6   Year                    67193 non-null  int64 
 7   ID                      67193 non-null  object
 8   Community Center Point  67193 non-null  object
dtypes: int64(3), object(6)
memory usage: 5.1+ MB

Datetime#

In pandas, we can also convert dates and times into a new data type called datetime using the pandas.to_datetime function.

# Convert character strings to datetime
data["NEW_DATE"] = pd.to_datetime(data["Date"])

# print the data info
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 67193 entries, 0 to 67261
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Sector                  67193 non-null  object        
 1   Community Name          67193 non-null  object        
 2   Category                67193 non-null  object        
 3   Crime Count             67193 non-null  int64         
 4   Resident Count          67193 non-null  int64         
 5   Date                    67193 non-null  object        
 6   Year                    67193 non-null  int64         
 7   ID                      67193 non-null  object        
 8   Community Center Point  67193 non-null  object        
 9   NEW_DATE                67193 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(3), object(6)
memory usage: 5.6+ MB

As we can see, the new column NEW_DATE has the type of “datetime64” which is the format of datetime in pandas. From now on pandas underestand that this is a date and time and the sequence in time series matter.

Pandas Series datetime properties

There are several methods available for accessing information about the properties of datetime values. You can read more about datetime properties from the pandas documentation.

With the datetime column, we can now extract different time units using the pandas.Series.dt accessor.

# print NEW_DATE column 
data["NEW_DATE"].dt.year
0        2022
1        2023
2        2022
3        2020
4        2019
         ... 
67257    2023
67258    2023
67259    2023
67260    2023
67261    2023
Name: NEW_DATE, Length: 67193, dtype: int32
# print NEW_DATE column 
data["NEW_DATE"].dt.month
0         4
1        10
2        12
3        12
4         8
         ..
67257    12
67258    12
67259    12
67260    12
67261    12
Name: NEW_DATE, Length: 67193, dtype: int32
# print NEW_DATE column 
data["NEW_DATE"].dt.day
0        1
1        1
2        1
3        1
4        1
        ..
67257    1
67258    1
67259    1
67260    1
67261    1
Name: NEW_DATE, Length: 67193, dtype: int32

Iterating over rows#

We can use an operation on the DataFrame one row at a time using a for loop and the iterrows() method. This will allow us to iterate row by row using iterrows() in a for loop to repeat a given process for each row in a pandas DataFrame. Please note that iterating over rows is a rather inefficient approach, but it is still useful to understand the logic behind the iteration.

When using the iterrows() method it is important to understand that iterrows() accesses not only the values of one row, but also the index of the row as well.

Let’s start with a simple for loop that goes through each row in our DataFrame.

# Iterate over the rows
for idx, row in data.iterrows():

    # Print the index value
    print(f"Index: {idx} \n\n")


    # Print the row
    print(f"Row :\n\n{row}\n")

    break
Index: 0 


Row :

Sector                                                       NORTHWEST
Community Name                                             ARBOUR LAKE
Category                                              Theft OF Vehicle
Crime Count                                                          2
Resident Count                                                   10619
Date                                                           2022/04
Year                                                              2022
ID                               2022-APR-ARBOUR LAKE-Theft OF Vehicle
Community Center Point    POINT (-114.20767498075155 51.1325947114686)
NEW_DATE                                           2022-04-01 00:00:00
Name: 0, dtype: object

Breaking a loop

When developing code in a for loop, you do not always need to go through the entire loop in order to test things out. The break statement in Python terminates the current loop whereever it is placed and we can use it here just to check out the values on the first row (based on the first iteration in the for loop. This can be helpful when working with a large data file or dataset, because you might not want to print thousands of values to the screen! For more information, check out this tutorial.

We can see that the idx variable indeed contains the index value at position 0 (the first row) and the row variable contains all the data from that given row stored as a pandas Series.

Let’s now as an example create an empty column “Year Check” and check if crime year is greater than 2022 in a new column “Year Check” we enter “new”, else “old”.

# iterate in the dataframe and check if year is greater than 2022 in a new column "Year Check" enter "new" else "old"
for idx, row in data.iterrows():
    if row["Year"] > 2022:
        data.loc[idx, "Year Check"] = "new"
    else:
        data.loc[idx, "Year Check"] = "old"
        

Finally, let’s see how our DataFrame looks like now after the calculations above.

# print the first 5 rows of data
data.head()
Sector Community Name Category Crime Count Resident Count Date Year ID Community Center Point NEW_DATE Year Check
0 NORTHWEST ARBOUR LAKE Theft OF Vehicle 2 10619 2022/04 2022 2022-APR-ARBOUR LAKE-Theft OF Vehicle POINT (-114.20767498075155 51.1325947114686) 2022-04-01 old
1 CENTRE BANFF TRAIL Theft OF Vehicle 2 4153 2023/10 2023 2023-OCT-BANFF TRAIL-Theft OF Vehicle POINT (-114.11512839716917 51.07421633024228) 2023-10-01 new
2 EAST DOVER Theft OF Vehicle 5 10351 2022/12 2022 2022-DEC-DOVER-Theft OF Vehicle POINT (-113.99305400906283 51.02256772250409) 2022-12-01 old
3 CENTRE GREENVIEW Assault (Non-domestic) 2 1906 2020/12 2020 2020-DEC-GREENVIEW-Assault (Non-domestic) POINT (-114.05746990262463 51.09485613506574) 2020-12-01 old
4 NORTHWEST HAMPTONS Theft FROM Vehicle 3 7382 2019/08 2019 2019-AUG-HAMPTONS-Theft FROM Vehicle POINT (-114.14668419231347 51.14509283969437) 2019-08-01 old

Aggregating data in pandas by grouping#

Here, we will learn how to use pandas.DataFrame.groupby, which is a handy method for combining large amounts of data and computing statistics for subgroups.

In our case, we will use the groupby method to calculate the average number of crimes for each year through these three steps:

  1. Grouping the data based on the year

  2. Calculating the average for each year (each group)

  3. Storing those values into a new DataFrame called grouped_data

We have rows of crime data, and several observations per month. Our goal is to create an aggreated data frame that would have only one row per year. (average crime count per year)

To condense our data to yearly average values we can group our data based on the unique year.

We can use NEW_DATE column, which we previously define and datetime format. We try to group this column on yearly basis and calculate the average number of crime for each group.

# group the data by "Date" on yearly basis and calculate average of "Crime Count"
grouped_data = data.groupby(data["NEW_DATE"].dt.year)

Now, let’s explore the new variable grouped_data.

# print grouped data
print(f"Type of grouped data is : {type(grouped_data)}")

print(f"Length of grouped Data: {len(grouped_data)}\n")
Type of grouped data is : <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
Length of grouped Data: 6

We have a new object with type DataFrameGroupBy with 6 groups. Length of the grouped object should be the same as the number of unique values in the column we used for grouping. For each unique value, there is a group of data. (in this example we have data for 6 years)

Let’s explore our grouped data even further.

We can check the “names” of each group.

# Print grouoped data keys

grouped_data.groups.keys()
dict_keys([2018, 2019, 2020, 2021, 2022, 2023])

Note

It is also possible to create combinations of two or more columns on-the-fly when grouping the data:

# Group the data 
grouped = data.groupby(['column1', 'column2'])

Accessing data for one group#

Let us now check the contents for the group representing 2023 (the name of that group is 2023. We can get the values of that year from the grouped object using the get_group() method.

# Specify a month (as character string)
year = 2023

# Select the group
group_2023 = grouped_data.get_group(year)
# Let's see what we have
group_2023
Sector Community Name Category Crime Count Resident Count Date Year ID Community Center Point NEW_DATE Year Check
1 CENTRE BANFF TRAIL Theft OF Vehicle 2 4153 2023/10 2023 2023-OCT-BANFF TRAIL-Theft OF Vehicle POINT (-114.11512839716917 51.07421633024228) 2023-10-01 new
10 CENTRE ROSEDALE Break & Enter - Dwelling 1 1560 2023/04 2023 2023-APR-ROSEDALE-Break & Enter - Dwelling POINT (-114.07820713414625 51.062686663226856) 2023-04-01 new
15 CENTRE TUXEDO PARK Theft OF Vehicle 3 5326 2023/08 2023 2023-AUG-TUXEDO PARK-Theft OF Vehicle POINT (-114.06104493215598 51.07421777158121) 2023-08-01 new
30 NORTHEAST CALGARY INTERNATIONAL AIRPORT Theft FROM Vehicle 8 0 2023/05 2023 2023-MAY-CALGARY INTERNATIONAL AIRPORT-Theft F... POINT (-114.01171665690258 51.123556453987604) 2023-05-01 new
96 CENTRE ALTADORE Theft OF Vehicle 1 6942 2023/04 2023 2023-APR-ALTADORE-Theft OF Vehicle POINT (-114.10078239468564 51.015954118298694) 2023-04-01 new
... ... ... ... ... ... ... ... ... ... ... ...
67257 EAST SOUTHVIEW Break & Enter - Commercial 1 1805 2023/12 2023 2023-DEC-SOUTHVIEW-Break & Enter - Commercial POINT (-113.99733916298928 51.03415221260387) 2023-12-01 new
67258 NORTH SAGE HILL Theft FROM Vehicle 1 7924 2023/12 2023 2023-DEC-SAGE HILL-Theft FROM Vehicle POINT (-114.14068609335015 51.175616972779984) 2023-12-01 new
67259 CENTRE WINSTON HEIGHTS/MOUNTVIEW Break & Enter - Commercial 1 3635 2023/12 2023 2023-DEC-WINSTON HEIGHTS/MOUNTVIEW-Break & Ent... POINT (-114.04184874950579 51.075298802175126) 2023-12-01 new
67260 CENTRE SOUTH CALGARY Street Robbery 1 4442 2023/12 2023 2023-DEC-SOUTH CALGARY-Street Robbery POINT (-114.10207470148995 51.02680215136445) 2023-12-01 new
67261 NORTHEAST SKYLINE EAST Break & Enter - Commercial 1 0 2023/12 2023 2023-DEC-SKYLINE EAST-Break & Enter - Commercial POINT (-114.03893128621468 51.10015261262966) 2023-12-01 new

10777 rows × 11 columns

Ahaa! As we can see, a single group contains a DataFrame with values only for that specific year. Awesome!

This is really useful, because we can now use all the familiar DataFrame methods for calculating statistics, etc. for each specific group. We can, for example, calculate the average values for all variables using the statistical functions that we have seen already (e.g. mean, std, min, max, median, etc.).

In our example, we can use mean() function to calculate average of Crime Count in each group . Let’s calculate the mean for group_2023:

# Calculate the mean values all at one go
mean_values = group_2023["Crime Count"].mean()

# Let's see what we have
print(mean_values)
2.53085274195045

Aggregation#

Above, we saw how you can access data from a single group. In order to get information about all groups (all years) we can use a aggregation in groups.

# aggregate the grouped data by mean on only "Crime Count" column
new_aggregate_data = grouped_data.aggregate({"Crime Count": "mean"})

# print the new aggregate data
new_aggregate_data
Crime Count
NEW_DATE
2018 3.048227
2019 3.239705
2020 2.913442
2021 2.634971
2022 2.876989
2023 2.530853
# we can also use the agg function to calculate on multiple columns
new_aggregate_data_02 = grouped_data.agg({"Crime Count": "sum", "Resident Count": "mean"})

# print the new aggregate data
new_aggregate_data_02
Crime Count Resident Count
NEW_DATE
2018 35332 6596.933224
2019 38154 6536.832555
2020 31774 6580.625527
2021 28376 6502.551676
2022 32720 6380.497142
2023 27275 6385.120627

So, we can see that in 2019 we had relatively more crime happening!

Change column names#

We can also change column names in our new aggregated dataframe:

# change the column name of "Crime Count" to "Total Crime Count" and "Resident Count" to "Average Resident Count"
new_aggregate_data_02.rename(columns={"Crime Count": "Total Crime Count", "Resident Count": "Average Resident Count"}, inplace=True)

# print the new aggregate data
new_aggregate_data_02
Total Crime Count Average Resident Count
NEW_DATE
2018 35332 6596.933224
2019 38154 6536.832555
2020 31774 6580.625527
2021 28376 6502.551676
2022 32720 6380.497142
2023 27275 6385.120627

Note

More infromation about pandas grouping and aggregating is available in Chapter 10: Data Aggregation and Group Operations

data
Sector Community Name Category Crime Count Resident Count Date Year ID Community Center Point NEW_DATE Year Check
0 NORTHWEST ARBOUR LAKE Theft OF Vehicle 2 10619 2022/04 2022 2022-APR-ARBOUR LAKE-Theft OF Vehicle POINT (-114.20767498075155 51.1325947114686) 2022-04-01 old
1 CENTRE BANFF TRAIL Theft OF Vehicle 2 4153 2023/10 2023 2023-OCT-BANFF TRAIL-Theft OF Vehicle POINT (-114.11512839716917 51.07421633024228) 2023-10-01 new
2 EAST DOVER Theft OF Vehicle 5 10351 2022/12 2022 2022-DEC-DOVER-Theft OF Vehicle POINT (-113.99305400906283 51.02256772250409) 2022-12-01 old
3 CENTRE GREENVIEW Assault (Non-domestic) 2 1906 2020/12 2020 2020-DEC-GREENVIEW-Assault (Non-domestic) POINT (-114.05746990262463 51.09485613506574) 2020-12-01 old
4 NORTHWEST HAMPTONS Theft FROM Vehicle 3 7382 2019/08 2019 2019-AUG-HAMPTONS-Theft FROM Vehicle POINT (-114.14668419231347 51.14509283969437) 2019-08-01 old
... ... ... ... ... ... ... ... ... ... ... ...
67257 EAST SOUTHVIEW Break & Enter - Commercial 1 1805 2023/12 2023 2023-DEC-SOUTHVIEW-Break & Enter - Commercial POINT (-113.99733916298928 51.03415221260387) 2023-12-01 new
67258 NORTH SAGE HILL Theft FROM Vehicle 1 7924 2023/12 2023 2023-DEC-SAGE HILL-Theft FROM Vehicle POINT (-114.14068609335015 51.175616972779984) 2023-12-01 new
67259 CENTRE WINSTON HEIGHTS/MOUNTVIEW Break & Enter - Commercial 1 3635 2023/12 2023 2023-DEC-WINSTON HEIGHTS/MOUNTVIEW-Break & Ent... POINT (-114.04184874950579 51.075298802175126) 2023-12-01 new
67260 CENTRE SOUTH CALGARY Street Robbery 1 4442 2023/12 2023 2023-DEC-SOUTH CALGARY-Street Robbery POINT (-114.10207470148995 51.02680215136445) 2023-12-01 new
67261 NORTHEAST SKYLINE EAST Break & Enter - Commercial 1 0 2023/12 2023 2023-DEC-SKYLINE EAST-Break & Enter - Commercial POINT (-114.03893128621468 51.10015261262966) 2023-12-01 new

67193 rows × 11 columns

More Parctice on Grouping and Aggregating#

We can aggregate DataFrame by more than two columns at the same time.

For example, we want to see the number of crimes by category in each year. We can use .groupby() on two columns Categor and NEW_DATE

# group the data by NEW_DATE yearly and Category and calculate the sum of "Crime Count"
grouped_data_02 = data.groupby([data["NEW_DATE"].dt.year, "Category"])

# aggregate the grouped data by sum on only "Crime Count" column
new_aggregate_data_03 = grouped_data_02.aggregate({"Crime Count": "sum"})

# change the column name of "Crime Count" to "Total Crime Count"
new_aggregate_data_03.rename(columns={"Crime Count": "Total Crime Count"}, inplace=True)
# print the new aggregate data
new_aggregate_data_03
Total Crime Count
NEW_DATE Category
2018 Assault (Non-domestic) 3828
Break & Enter - Commercial 5054
Break & Enter - Dwelling 2693
Break & Enter - Other Premises 2103
Commercial Robbery 305
Street Robbery 658
Theft FROM Vehicle 12624
Theft OF Vehicle 6056
Violence Other (Non-domestic) 2011
2019 Assault (Non-domestic) 3913
Break & Enter - Commercial 5782
Break & Enter - Dwelling 2401
Break & Enter - Other Premises 2303
Commercial Robbery 382
Street Robbery 649
Theft FROM Vehicle 14723
Theft OF Vehicle 6028
Violence Other (Non-domestic) 1973
2020 Assault (Non-domestic) 3529
Break & Enter - Commercial 4203
Break & Enter - Dwelling 1686
Break & Enter - Other Premises 2500
Commercial Robbery 281
Street Robbery 525
Theft FROM Vehicle 12387
Theft OF Vehicle 4649
Violence Other (Non-domestic) 2014
2021 Assault (Non-domestic) 4036
Break & Enter - Commercial 3212
Break & Enter - Dwelling 1577
Break & Enter - Other Premises 1835
Commercial Robbery 359
Street Robbery 575
Theft FROM Vehicle 10158
Theft OF Vehicle 4445
Violence Other (Non-domestic) 2179
2022 Assault (Non-domestic) 4139
Break & Enter - Commercial 4273
Break & Enter - Dwelling 1586
Break & Enter - Other Premises 1948
Commercial Robbery 391
Street Robbery 536
Theft FROM Vehicle 12344
Theft OF Vehicle 5419
Violence Other (Non-domestic) 2084
2023 Assault (Non-domestic) 4576
Break & Enter - Commercial 3344
Break & Enter - Dwelling 1525
Break & Enter - Other Premises 1298
Commercial Robbery 311
Street Robbery 693
Theft FROM Vehicle 8951
Theft OF Vehicle 4448
Violence Other (Non-domestic) 2129