Processing data with pandas

Processing data with pandas#

Attention

These tutorials are meant to help you with assignment tasks by giving hints, not to teach Python programming in a detailed step-by-step manner.

As mentioned in the first lecture of the course, this course requires basic knowledge of Python programming language.

If you are not familiar with them, try catching up with the basics as fast as you can. Some useful resources are:

pandas online documentation

Python for Data Analysis (3rd Edition, 2022)

import warnings
warnings.filterwarnings("ignore")

Our assignment 1 is about to learn how to read and explore data files in Python. We will focus on using pandas which is an open-source package for data analysis in Python. pandas is an excellent toolkit for working with real world data that often have a tabular structure (rows and columns).

Input data: Community Crime Statistics Map#

Our input data in this tutorial is a text file containing community crimes statistics map in city of Calgary, Alberta, Canda retrieved from City of Calgary Open Data Portal:

File name: [Community_Crime_Statistics_20240120.csv] (you can have a look at the file before reading it in using pandas!)
You can download the data from the link provided: City of Calgary Open Data Portal
Data is provided monthly by the Calgary Police Service. And includes the location of crime, time, category, crime count, and resident count.
There are totally 67,262 rows and 10 columns in this dataset.

Loading Data#

Next, we wll read the input data file, and store the contents of that file in a variable called data Using the pandas.read_csv() function:

# Importing the libraries
import pandas as pd

# load data

# Read the file using pandas
data = pd.read_csv("Community_Crime_Statistics_20240120.csv", sep=',')

Reading different file formats

Check out the pandas documentation about input and output functions and Chapter 6 in McKinney (2022) for more details about reading data.

Let’s now print the dataframe and see what it looks like.

data

	Sector	Community Name	Category	Crime Count	Resident Count	Date	Year	Month	ID	Community Center Point
0	NORTHWEST	ARBOUR LAKE	Theft OF Vehicle	2	10619.0	2022/04	2022	APR	2022-APR-ARBOUR LAKE-Theft OF Vehicle	POINT (-114.20767498075155 51.1325947114686)
1	CENTRE	BANFF TRAIL	Theft OF Vehicle	2	4153.0	2023/10	2023	OCT	2023-OCT-BANFF TRAIL-Theft OF Vehicle	POINT (-114.11512839716917 51.07421633024228)
2	EAST	DOVER	Theft OF Vehicle	5	10351.0	2022/12	2022	DEC	2022-DEC-DOVER-Theft OF Vehicle	POINT (-113.99305400906283 51.02256772250409)
3	CENTRE	GREENVIEW	Assault (Non-domestic)	2	1906.0	2020/12	2020	DEC	2020-DEC-GREENVIEW-Assault (Non-domestic)	POINT (-114.05746990262463 51.09485613506574)
4	NORTHWEST	HAMPTONS	Theft FROM Vehicle	3	7382.0	2019/08	2019	AUG	2019-AUG-HAMPTONS-Theft FROM Vehicle	POINT (-114.14668419231347 51.14509283969437)
...	...	...	...	...	...	...	...	...	...	...
67930	SOUTH	SOMERSET	Assault (Non-domestic)	2	8543.0	2020/05	2020	MAY	2020-MAY-SOMERSET-Assault (Non-domestic)	POINT (-114.08138084069279 50.89891142227545)
67931	CENTRE	LOWER MOUNT ROYAL	Assault (Non-domestic)	1	3457.0	2023/07	2023	JUL	2023-JUL-LOWER MOUNT ROYAL-Assault (Non-domestic)	POINT (-114.08649939756594 51.0365031977319)
67932	NORTHEAST	MARLBOROUGH PARK	Break & Enter - Dwelling	1	8523.0	2021/04	2021	APR	2021-APR-MARLBOROUGH PARK-Break & Enter - Dwel...	POINT (-113.94671708302872 51.059688495333184)
67933	SOUTH	DEER RIDGE	Break & Enter - Commercial	1	3973.0	2020/08	2020	AUG	2020-AUG-DEER RIDGE-Break & Enter - Commercial	POINT (-114.02023404987288 50.926573869162)
67934	NORTH	SAGE HILL	Theft FROM Vehicle	1	7924.0	2020/11	2020	NOV	2020-NOV-SAGE HILL-Theft FROM Vehicle	POINT (-114.14068609335015 51.175616972779984)

67935 rows × 10 columns

We can use the data.head() function of the pandas DataFrame object to quickly check the top rows. We can also check the last rows of the data using data.tail()

# print the first 5 rows of data
data.head() 

	Sector	Community Name	Category	Crime Count	Resident Count	Date	Year	Month	ID	Community Center Point
0	NORTHWEST	ARBOUR LAKE	Theft OF Vehicle	2	10619.0	2022/04	2022	APR	2022-APR-ARBOUR LAKE-Theft OF Vehicle	POINT (-114.20767498075155 51.1325947114686)
1	CENTRE	BANFF TRAIL	Theft OF Vehicle	2	4153.0	2023/10	2023	OCT	2023-OCT-BANFF TRAIL-Theft OF Vehicle	POINT (-114.11512839716917 51.07421633024228)
2	EAST	DOVER	Theft OF Vehicle	5	10351.0	2022/12	2022	DEC	2022-DEC-DOVER-Theft OF Vehicle	POINT (-113.99305400906283 51.02256772250409)
3	CENTRE	GREENVIEW	Assault (Non-domestic)	2	1906.0	2020/12	2020	DEC	2020-DEC-GREENVIEW-Assault (Non-domestic)	POINT (-114.05746990262463 51.09485613506574)
4	NORTHWEST	HAMPTONS	Theft FROM Vehicle	3	7382.0	2019/08	2019	AUG	2019-AUG-HAMPTONS-Theft FROM Vehicle	POINT (-114.14668419231347 51.14509283969437)

# print the last 5 rows of data
data.tail()

	Sector	Community Name	Category	Crime Count	Resident Count	Date	Year	Month	ID	Community Center Point
67930	SOUTH	SOMERSET	Assault (Non-domestic)	2	8543.0	2020/05	2020	MAY	2020-MAY-SOMERSET-Assault (Non-domestic)	POINT (-114.08138084069279 50.89891142227545)
67931	CENTRE	LOWER MOUNT ROYAL	Assault (Non-domestic)	1	3457.0	2023/07	2023	JUL	2023-JUL-LOWER MOUNT ROYAL-Assault (Non-domestic)	POINT (-114.08649939756594 51.0365031977319)
67932	NORTHEAST	MARLBOROUGH PARK	Break & Enter - Dwelling	1	8523.0	2021/04	2021	APR	2021-APR-MARLBOROUGH PARK-Break & Enter - Dwel...	POINT (-113.94671708302872 51.059688495333184)
67933	SOUTH	DEER RIDGE	Break & Enter - Commercial	1	3973.0	2020/08	2020	AUG	2020-AUG-DEER RIDGE-Break & Enter - Commercial	POINT (-114.02023404987288 50.926573869162)
67934	NORTH	SAGE HILL	Theft FROM Vehicle	1	7924.0	2020/11	2020	NOV	2020-NOV-SAGE HILL-Theft FROM Vehicle	POINT (-114.14068609335015 51.175616972779984)

Let’s see some basic info about the data (The number of column, rows, the data type of each column etc.)

# Check the data info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67935 entries, 0 to 67934
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Sector                  67904 non-null  object 
 1   Community Name          67935 non-null  object 
 2   Category                67935 non-null  object 
 3   Crime Count             67935 non-null  int64  
 4   Resident Count          67866 non-null  float64
 5   Date                    67935 non-null  object 
 6   Year                    67935 non-null  int64  
 7   Month                   67935 non-null  object 
 8   ID                      67935 non-null  object 
 9   Community Center Point  67904 non-null  object 
dtypes: float64(1), int64(2), object(7)
memory usage: 5.2+ MB

Also some basic stats about numerical values in columns that contain numbers (Crime count, Resident count, Year)

# Cehck the data description
data.describe()

	Crime Count	Resident Count	Year
count	67935.000000	67866.000000	67935.000000
mean	2.880651	6500.622653	2020.460146
std	3.685793	5455.404451	1.712744
min	1.000000	0.000000	2018.000000
25%	1.000000	2263.000000	2019.000000
50%	2.000000	5957.000000	2020.000000
75%	3.000000	9244.000000	2022.000000
max	110.000000	25710.000000	2023.000000

Basic Calculations in DataFrame#

One of the most common things to do in pandas is to create new columns based on calculations between different variables (columns).

Let’s say we want to create a new column in which we want to calculate the “Crime Count +1”. We can create a new column new_crime_count in our DataFrame by specifying the name of the column and giving it some default value (in this case the decimal number 0.0).

# Define a new column "ew_crime_count"
data["new_crime_count"] = 0.0

# Check how the dataframe looks like:
data

	Sector	Community Name	Category	Crime Count	Resident Count	Date	Year	Month	ID	Community Center Point	new_crime_count
0	NORTHWEST	ARBOUR LAKE	Theft OF Vehicle	2	10619.0	2022/04	2022	APR	2022-APR-ARBOUR LAKE-Theft OF Vehicle	POINT (-114.20767498075155 51.1325947114686)	0.0
1	CENTRE	BANFF TRAIL	Theft OF Vehicle	2	4153.0	2023/10	2023	OCT	2023-OCT-BANFF TRAIL-Theft OF Vehicle	POINT (-114.11512839716917 51.07421633024228)	0.0
2	EAST	DOVER	Theft OF Vehicle	5	10351.0	2022/12	2022	DEC	2022-DEC-DOVER-Theft OF Vehicle	POINT (-113.99305400906283 51.02256772250409)	0.0
3	CENTRE	GREENVIEW	Assault (Non-domestic)	2	1906.0	2020/12	2020	DEC	2020-DEC-GREENVIEW-Assault (Non-domestic)	POINT (-114.05746990262463 51.09485613506574)	0.0
4	NORTHWEST	HAMPTONS	Theft FROM Vehicle	3	7382.0	2019/08	2019	AUG	2019-AUG-HAMPTONS-Theft FROM Vehicle	POINT (-114.14668419231347 51.14509283969437)	0.0
...	...	...	...	...	...	...	...	...	...	...	...
67930	SOUTH	SOMERSET	Assault (Non-domestic)	2	8543.0	2020/05	2020	MAY	2020-MAY-SOMERSET-Assault (Non-domestic)	POINT (-114.08138084069279 50.89891142227545)	0.0
67931	CENTRE	LOWER MOUNT ROYAL	Assault (Non-domestic)	1	3457.0	2023/07	2023	JUL	2023-JUL-LOWER MOUNT ROYAL-Assault (Non-domestic)	POINT (-114.08649939756594 51.0365031977319)	0.0
67932	NORTHEAST	MARLBOROUGH PARK	Break & Enter - Dwelling	1	8523.0	2021/04	2021	APR	2021-APR-MARLBOROUGH PARK-Break & Enter - Dwel...	POINT (-113.94671708302872 51.059688495333184)	0.0
67933	SOUTH	DEER RIDGE	Break & Enter - Commercial	1	3973.0	2020/08	2020	AUG	2020-AUG-DEER RIDGE-Break & Enter - Commercial	POINT (-114.02023404987288 50.926573869162)	0.0
67934	NORTH	SAGE HILL	Theft FROM Vehicle	1	7924.0	2020/11	2020	NOV	2020-NOV-SAGE HILL-Theft FROM Vehicle	POINT (-114.14068609335015 51.175616972779984)	0.0

67935 rows × 11 columns

Now let’s update the column new_crime_count by calculating the the column Crime Count + 1.

# Calculate `Crime Count` + 1
data["new_crime_count"] = data["Crime Count"] + 1

# Check the result
data.head()

	Sector	Community Name	Category	Crime Count	Resident Count	Date	Year	Month	ID	Community Center Point	new_crime_count
0	NORTHWEST	ARBOUR LAKE	Theft OF Vehicle	2	10619.0	2022/04	2022	APR	2022-APR-ARBOUR LAKE-Theft OF Vehicle	POINT (-114.20767498075155 51.1325947114686)	3
1	CENTRE	BANFF TRAIL	Theft OF Vehicle	2	4153.0	2023/10	2023	OCT	2023-OCT-BANFF TRAIL-Theft OF Vehicle	POINT (-114.11512839716917 51.07421633024228)	3
2	EAST	DOVER	Theft OF Vehicle	5	10351.0	2022/12	2022	DEC	2022-DEC-DOVER-Theft OF Vehicle	POINT (-113.99305400906283 51.02256772250409)	6
3	CENTRE	GREENVIEW	Assault (Non-domestic)	2	1906.0	2020/12	2020	DEC	2020-DEC-GREENVIEW-Assault (Non-domestic)	POINT (-114.05746990262463 51.09485613506574)	3
4	NORTHWEST	HAMPTONS	Theft FROM Vehicle	3	7382.0	2019/08	2019	AUG	2019-AUG-HAMPTONS-Theft FROM Vehicle	POINT (-114.14668419231347 51.14509283969437)	4

You can even calculate the sum of two columns. For example: data["new_column"] = data["Crime Count"] + data["Resident Count"]

Selecting rows and columns#

One common way of selecting only specific rows from your DataFrame is done via this syntax: dataframe[start_index:stop_index]

Let’s select the first six rows and assign them to a variable called selection:

# Select first five rows of dataframe using row index values
selection = data[0:6]

# Print `selection`
selection

	Sector	Community Name	Category	Crime Count	Resident Count	Date	Year	Month	ID	Community Center Point	new_crime_count
0	NORTHWEST	ARBOUR LAKE	Theft OF Vehicle	2	10619.0	2022/04	2022	APR	2022-APR-ARBOUR LAKE-Theft OF Vehicle	POINT (-114.20767498075155 51.1325947114686)	3
1	CENTRE	BANFF TRAIL	Theft OF Vehicle	2	4153.0	2023/10	2023	OCT	2023-OCT-BANFF TRAIL-Theft OF Vehicle	POINT (-114.11512839716917 51.07421633024228)	3
2	EAST	DOVER	Theft OF Vehicle	5	10351.0	2022/12	2022	DEC	2022-DEC-DOVER-Theft OF Vehicle	POINT (-113.99305400906283 51.02256772250409)	6
3	CENTRE	GREENVIEW	Assault (Non-domestic)	2	1906.0	2020/12	2020	DEC	2020-DEC-GREENVIEW-Assault (Non-domestic)	POINT (-114.05746990262463 51.09485613506574)	3
4	NORTHWEST	HAMPTONS	Theft FROM Vehicle	3	7382.0	2019/08	2019	AUG	2019-AUG-HAMPTONS-Theft FROM Vehicle	POINT (-114.14668419231347 51.14509283969437)	4
5	SOUTH	LAKE BONAVISTA	Theft OF Vehicle	2	10293.0	2022/08	2022	AUG	2022-AUG-LAKE BONAVISTA-Theft OF Vehicle	POINT (-114.05133665127042 50.939803390308995)	3

It is also possible to control which columns are chosen when selecting a subset of rows. In this case we will use pandas.DataFrame.loc which selects data based on axis labels (row labels and column labels) with this syntax : .loc[start_index:stop_index, list_of_columns]

Let’s select “Category” and “Crime Count” values from rows 0-5:

# Select the two column values on rows 0-5
selection = data.loc[0:5, ["Category", "Crime Count"]]

# Print `selection`
selection

	Category	Crime Count
0	Theft OF Vehicle	2
1	Theft OF Vehicle	2
2	Theft OF Vehicle	5
3	Assault (Non-domestic)	2
4	Theft FROM Vehicle	3
5	Theft OF Vehicle	2

Filtering and updating#

Filter Columns:#

Sometimes we don’t need all the information in the data. We can select specific columns based on the column values.

For example, I don’t need the collumn Month and want to keep other columns. We can drop the column by using the .drop() function. (e.g. data.drop('Month', axis=1, inplace=True))

axis=1 means drop the column, inplace=True means drop the column in the original data, (for droping roaws we can use axis=0 which means drop the row)

# drop the extra columns
data.drop(['Month'], axis=1, inplace=True)

Note

Alternatively, instead of using .drop() to drop unneccesary columns, we can keep all the columns that we need. The basic syntax is dataframe[value], where value can be a single column name, or a list of column names. In our example, instead of dropping Month, I can keep all other columns with the mentioned suntax.

# Keep all columns except the Month column
data = data[['Sector', 'Community Name', 'Category', 'Crime Count', 'Resident Count', 'Date', 'Year', 'ID', 'Community Center Point']]

data.head()

	Sector	Community Name	Category	Crime Count	Resident Count	Date	Year	ID	Community Center Point
0	NORTHWEST	ARBOUR LAKE	Theft OF Vehicle	2	10619.0	2022/04	2022	2022-APR-ARBOUR LAKE-Theft OF Vehicle	POINT (-114.20767498075155 51.1325947114686)
1	CENTRE	BANFF TRAIL	Theft OF Vehicle	2	4153.0	2023/10	2023	2023-OCT-BANFF TRAIL-Theft OF Vehicle	POINT (-114.11512839716917 51.07421633024228)
2	EAST	DOVER	Theft OF Vehicle	5	10351.0	2022/12	2022	2022-DEC-DOVER-Theft OF Vehicle	POINT (-113.99305400906283 51.02256772250409)
3	CENTRE	GREENVIEW	Assault (Non-domestic)	2	1906.0	2020/12	2020	2020-DEC-GREENVIEW-Assault (Non-domestic)	POINT (-114.05746990262463 51.09485613506574)
4	NORTHWEST	HAMPTONS	Theft FROM Vehicle	3	7382.0	2019/08	2019	2019-AUG-HAMPTONS-Theft FROM Vehicle	POINT (-114.14668419231347 51.14509283969437)

Filter Rows#

One really useful feature in pandas is the ability to easily filter and select rows based on a conditional statement.

# Select rows that are in 2023
crimes_2023 = data.loc[data["Year"] > 2022]

# Print `crimes_2023`
crimes_2023.head()

	Sector	Community Name	Category	Crime Count	Resident Count	Date	Year	ID	Community Center Point
1	CENTRE	BANFF TRAIL	Theft OF Vehicle	2	4153.0	2023/10	2023	2023-OCT-BANFF TRAIL-Theft OF Vehicle	POINT (-114.11512839716917 51.07421633024228)
10	CENTRE	ROSEDALE	Break & Enter - Dwelling	1	1560.0	2023/04	2023	2023-APR-ROSEDALE-Break & Enter - Dwelling	POINT (-114.07820713414625 51.062686663226856)
15	CENTRE	TUXEDO PARK	Theft OF Vehicle	3	5326.0	2023/08	2023	2023-AUG-TUXEDO PARK-Theft OF Vehicle	POINT (-114.06104493215598 51.07421777158121)
30	NORTHEAST	CALGARY INTERNATIONAL AIRPORT	Theft FROM Vehicle	8	0.0	2023/05	2023	2023-MAY-CALGARY INTERNATIONAL AIRPORT-Theft F...	POINT (-114.01171665690258 51.123556453987604)
96	CENTRE	ALTADORE	Theft OF Vehicle	1	6942.0	2023/04	2023	2023-APR-ALTADORE-Theft OF Vehicle	POINT (-114.10078239468564 51.015954118298694)

Or even multiple conditions at the same time:

# filter the data to show crime with crime count greater than 5 for year 2023
crime_2023_selected = data.loc[(data["Year"] > 2022) & (data["Crime Count"] >= 5)]

# Print `crime_2023_selected`
crime_2023_selected.head()

	Sector	Community Name	Category	Crime Count	Resident Count	Date	Year	ID	Community Center Point
30	NORTHEAST	CALGARY INTERNATIONAL AIRPORT	Theft FROM Vehicle	8	0.0	2023/05	2023	2023-MAY-CALGARY INTERNATIONAL AIRPORT-Theft F...	POINT (-114.01171665690258 51.123556453987604)
13017	SOUTHEAST	AUBURN BAY	Theft FROM Vehicle	5	17607.0	2023/03	2023	2023-MAR-AUBURN BAY-Theft FROM Vehicle	POINT (-113.95779514351926 50.89269507687)
20753	SOUTHEAST	SETON	Theft OF Vehicle	5	1134.0	2023/10	2023	2023-OCT-SETON-Theft OF Vehicle	POINT (-113.94793123802683 50.874157662841256)
20934	SOUTHEAST	OGDEN	Theft FROM Vehicle	5	8576.0	2023/09	2023	2023-SEP-OGDEN-Theft FROM Vehicle	POINT (-114.01244227109757 50.991878337981746)
29683	SOUTHEAST	OGDEN	Theft FROM Vehicle	7	8576.0	2023/11	2023	2023-NOV-OGDEN-Theft FROM Vehicle	POINT (-114.01244227109757 50.991878337981746)

Data Cleaning#

Dealing with missing data#

As you have learned, most of the time when we work with data, we face missing values or incomplete data. We can check the number of missing values in Dataframe with this syntax:

# print the number of missing values in each column
data.isnull().sum() 

Sector                    31
Community Name             0
Category                   0
Crime Count                0
Resident Count            69
Date                       0
Year                       0
ID                         0
Community Center Point    31
dtype: int64

As we can see, in Column “Sector”, “Resident Count”, and “Community Center Pint” we have 31, 69, 31 missing values respectively.

OThere are many way to handle missing values based on the application. One way of cleaning data is to remover rows with missing values. If we want to remove the missing values based on column (only remove those rows that are missing the value in specific column) we can use this syntax: .dropna(subset=['column_name'], inplace=True)

# drop the rows with missing values for column "Community Center Point"
data.dropna(subset=['Community Center Point'], inplace=True)

# print the number of missing values in each column
data.isnull().sum() 

Sector                     0
Community Name             0
Category                   0
Crime Count                0
Resident Count            38
Date                       0
Year                       0
ID                         0
Community Center Point     0
dtype: int64

As we can see, we remove all the rows that had missing values for column Community Center Point. If we want to remove all the row that has any missing value, no matter in which column, we can use this syntax:

# remove all the rows with missing values
data.dropna(inplace=True)

# print the number of missing values in each column
data.isnull().sum() 

Sector                    0
Community Name            0
Category                  0
Crime Count               0
Resident Count            0
Date                      0
Year                      0
ID                        0
Community Center Point    0
dtype: int64

Removing Duplicates#

Duplicate rows may be found in a DataFrame for any number of reasons.

The DataFrame method duplicated returns a Boolean Series indicating whether each row is a duplicate (its column values are exactly equal to those in an earlier row) or not:

# check for duplicate rows
data.duplicated()

      False
      False
      False
      False
      False
         ...  
   True
   True
   True
   True
   True
Length: 67866, dtype: bool

# count all the duplicate rows
data.duplicated().sum()

np.int64(673)

We have 673 rows with identical values. Relatedly, drop_duplicates returns a DataFrame with rows where the duplicated array is False filtered out:

# drop the duplicate rows
data.drop_duplicates(inplace=True)

# print the number of duplicate rows
data.duplicated().sum()

np.int64(0)

Note

You can check for more infromation about data cleaning with pandas by looking at : Data Cleaning and Preparation

Data type conversions#

There are occasions where you’ll need to convert data stored within a Series to another data type, for example, from floating point to integer.

For values in pandas DataFrames and Series, we can use the .astype() method.

## Data type conversions

data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 67193 entries, 0 to 67261
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Sector                  67193 non-null  object 
 1   Community Name          67193 non-null  object 
 2   Category                67193 non-null  object 
 3   Crime Count             67193 non-null  int64  
 4   Resident Count          67193 non-null  float64
 5   Date                    67193 non-null  object 
 6   Year                    67193 non-null  int64  
 7   ID                      67193 non-null  object 
 8   Community Center Point  67193 non-null  object 
dtypes: float64(1), int64(2), object(6)
memory usage: 5.1+ MB

As we can see the type of “Resident Count” is float64. We want to convert this column type to integer:

# convert the "Resident Count" to integer
data["Resident Count"] = data["Resident Count"].astype(int)

# print the data info
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 67193 entries, 0 to 67261
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Sector                  67193 non-null  object
 1   Community Name          67193 non-null  object
 2   Category                67193 non-null  object
 3   Crime Count             67193 non-null  int64 
 4   Resident Count          67193 non-null  int64 
 5   Date                    67193 non-null  object
 6   Year                    67193 non-null  int64 
 7   ID                      67193 non-null  object
 8   Community Center Point  67193 non-null  object
dtypes: int64(3), object(6)
memory usage: 5.1+ MB

Datetime#

In pandas, we can also convert dates and times into a new data type called datetime using the pandas.to_datetime function.

# Convert character strings to datetime
data["NEW_DATE"] = pd.to_datetime(data["Date"])

# print the data info
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 67193 entries, 0 to 67261
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Sector                  67193 non-null  object        
 1   Community Name          67193 non-null  object        
 2   Category                67193 non-null  object        
 3   Crime Count             67193 non-null  int64         
 4   Resident Count          67193 non-null  int64         
 5   Date                    67193 non-null  object        
 6   Year                    67193 non-null  int64         
 7   ID                      67193 non-null  object        
 8   Community Center Point  67193 non-null  object        
 9   NEW_DATE                67193 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(3), object(6)
memory usage: 5.6+ MB

As we can see, the new column NEW_DATE has the type of “datetime64” which is the format of datetime in pandas. From now on pandas underestand that this is a date and time and the sequence in time series matter.

Pandas Series datetime properties

There are several methods available for accessing information about the properties of datetime values. You can read more about datetime properties from the pandas documentation.

With the datetime column, we can now extract different time units using the pandas.Series.dt accessor.

# print NEW_DATE column 
data["NEW_DATE"].dt.year

      2022
      2023
      2022
      2020
      2019
         ... 
  2023
  2023
  2023
  2023
  2023
Name: NEW_DATE, Length: 67193, dtype: int32

# print NEW_DATE column 
data["NEW_DATE"].dt.month

       4
      10
      12
      12
       8
         ..
  12
  12
  12
  12
  12
Name: NEW_DATE, Length: 67193, dtype: int32

# print NEW_DATE column 
data["NEW_DATE"].dt.day

      1
      1
      1
      1
      1
        ..
  1
  1
  1
  1
  1
Name: NEW_DATE, Length: 67193, dtype: int32

Iterating over rows#

We can use an operation on the DataFrame one row at a time using a for loop and the iterrows() method. This will allow us to iterate row by row using iterrows() in a for loop to repeat a given process for each row in a pandas DataFrame. Please note that iterating over rows is a rather inefficient approach, but it is still useful to understand the logic behind the iteration.

When using the iterrows() method it is important to understand that iterrows() accesses not only the values of one row, but also the index of the row as well.

Let’s start with a simple for loop that goes through each row in our DataFrame.

# Iterate over the rows
for idx, row in data.iterrows():

    # Print the index value
    print(f"Index: {idx} \n\n")


    # Print the row
    print(f"Row :\n\n{row}\n")

    break

Index: 0 


Row :

Sector                                                       NORTHWEST
Community Name                                             ARBOUR LAKE
Category                                              Theft OF Vehicle
Crime Count                                                          2
Resident Count                                                   10619
Date                                                           2022/04
Year                                                              2022
ID                               2022-APR-ARBOUR LAKE-Theft OF Vehicle
Community Center Point    POINT (-114.20767498075155 51.1325947114686)
NEW_DATE                                           2022-04-01 00:00:00
Name: 0, dtype: object

Breaking a loop

When developing code in a for loop, you do not always need to go through the entire loop in order to test things out. The break statement in Python terminates the current loop whereever it is placed and we can use it here just to check out the values on the first row (based on the first iteration in the for loop. This can be helpful when working with a large data file or dataset, because you might not want to print thousands of values to the screen! For more information, check out this tutorial.

We can see that the idx variable indeed contains the index value at position 0 (the first row) and the row variable contains all the data from that given row stored as a pandas Series.

Let’s now as an example create an empty column “Year Check” and check if crime year is greater than 2022 in a new column “Year Check” we enter “new”, else “old”.

# iterate in the dataframe and check if year is greater than 2022 in a new column "Year Check" enter "new" else "old"
for idx, row in data.iterrows():
    if row["Year"] > 2022:
        data.loc[idx, "Year Check"] = "new"
    else:
        data.loc[idx, "Year Check"] = "old"
        

Finally, let’s see how our DataFrame looks like now after the calculations above.

# print the first 5 rows of data
data.head()

	Sector	Community Name	Category	Crime Count	Resident Count	Date	Year	ID	Community Center Point	NEW_DATE	Year Check
0	NORTHWEST	ARBOUR LAKE	Theft OF Vehicle	2	10619	2022/04	2022	2022-APR-ARBOUR LAKE-Theft OF Vehicle	POINT (-114.20767498075155 51.1325947114686)	2022-04-01	old
1	CENTRE	BANFF TRAIL	Theft OF Vehicle	2	4153	2023/10	2023	2023-OCT-BANFF TRAIL-Theft OF Vehicle	POINT (-114.11512839716917 51.07421633024228)	2023-10-01	new
2	EAST	DOVER	Theft OF Vehicle	5	10351	2022/12	2022	2022-DEC-DOVER-Theft OF Vehicle	POINT (-113.99305400906283 51.02256772250409)	2022-12-01	old
3	CENTRE	GREENVIEW	Assault (Non-domestic)	2	1906	2020/12	2020	2020-DEC-GREENVIEW-Assault (Non-domestic)	POINT (-114.05746990262463 51.09485613506574)	2020-12-01	old
4	NORTHWEST	HAMPTONS	Theft FROM Vehicle	3	7382	2019/08	2019	2019-AUG-HAMPTONS-Theft FROM Vehicle	POINT (-114.14668419231347 51.14509283969437)	2019-08-01	old

Aggregating data in pandas by grouping#

Here, we will learn how to use pandas.DataFrame.groupby, which is a handy method for combining large amounts of data and computing statistics for subgroups.

In our case, we will use the groupby method to calculate the average number of crimes for each year through these three steps:

Grouping the data based on the year
Calculating the average for each year (each group)
Storing those values into a new DataFrame called grouped_data

We have rows of crime data, and several observations per month. Our goal is to create an aggreated data frame that would have only one row per year. (average crime count per year)

To condense our data to yearly average values we can group our data based on the unique year.

We can use NEW_DATE column, which we previously define and datetime format. We try to group this column on yearly basis and calculate the average number of crime for each group.

# group the data by "Date" on yearly basis and calculate average of "Crime Count"
grouped_data = data.groupby(data["NEW_DATE"].dt.year)

Now, let’s explore the new variable grouped_data.

# print grouped data
print(f"Type of grouped data is : {type(grouped_data)}")

print(f"Length of grouped Data: {len(grouped_data)}\n")

Type of grouped data is : <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
Length of grouped Data: 6

We have a new object with type DataFrameGroupBy with 6 groups. Length of the grouped object should be the same as the number of unique values in the column we used for grouping. For each unique value, there is a group of data. (in this example we have data for 6 years)

Let’s explore our grouped data even further.

We can check the “names” of each group.

# Print grouoped data keys

grouped_data.groups.keys()

dict_keys([2018, 2019, 2020, 2021, 2022, 2023])

Note

It is also possible to create combinations of two or more columns on-the-fly when grouping the data:

# Group the data 
grouped = data.groupby(['column1', 'column2'])

Accessing data for one group#

Let us now check the contents for the group representing 2023 (the name of that group is 2023. We can get the values of that year from the grouped object using the get_group() method.

# Specify a month (as character string)
year = 2023

# Select the group
group_2023 = grouped_data.get_group(year)

# Let's see what we have
group_2023

	Sector	Community Name	Category	Crime Count	Resident Count	Date	Year	ID	Community Center Point	NEW_DATE	Year Check
1	CENTRE	BANFF TRAIL	Theft OF Vehicle	2	4153	2023/10	2023	2023-OCT-BANFF TRAIL-Theft OF Vehicle	POINT (-114.11512839716917 51.07421633024228)	2023-10-01	new
10	CENTRE	ROSEDALE	Break & Enter - Dwelling	1	1560	2023/04	2023	2023-APR-ROSEDALE-Break & Enter - Dwelling	POINT (-114.07820713414625 51.062686663226856)	2023-04-01	new
15	CENTRE	TUXEDO PARK	Theft OF Vehicle	3	5326	2023/08	2023	2023-AUG-TUXEDO PARK-Theft OF Vehicle	POINT (-114.06104493215598 51.07421777158121)	2023-08-01	new
30	NORTHEAST	CALGARY INTERNATIONAL AIRPORT	Theft FROM Vehicle	8	0	2023/05	2023	2023-MAY-CALGARY INTERNATIONAL AIRPORT-Theft F...	POINT (-114.01171665690258 51.123556453987604)	2023-05-01	new
96	CENTRE	ALTADORE	Theft OF Vehicle	1	6942	2023/04	2023	2023-APR-ALTADORE-Theft OF Vehicle	POINT (-114.10078239468564 51.015954118298694)	2023-04-01	new
...	...	...	...	...	...	...	...	...	...	...	...
67257	EAST	SOUTHVIEW	Break & Enter - Commercial	1	1805	2023/12	2023	2023-DEC-SOUTHVIEW-Break & Enter - Commercial	POINT (-113.99733916298928 51.03415221260387)	2023-12-01	new
67258	NORTH	SAGE HILL	Theft FROM Vehicle	1	7924	2023/12	2023	2023-DEC-SAGE HILL-Theft FROM Vehicle	POINT (-114.14068609335015 51.175616972779984)	2023-12-01	new
67259	CENTRE	WINSTON HEIGHTS/MOUNTVIEW	Break & Enter - Commercial	1	3635	2023/12	2023	2023-DEC-WINSTON HEIGHTS/MOUNTVIEW-Break & Ent...	POINT (-114.04184874950579 51.075298802175126)	2023-12-01	new
67260	CENTRE	SOUTH CALGARY	Street Robbery	1	4442	2023/12	2023	2023-DEC-SOUTH CALGARY-Street Robbery	POINT (-114.10207470148995 51.02680215136445)	2023-12-01	new
67261	NORTHEAST	SKYLINE EAST	Break & Enter - Commercial	1	0	2023/12	2023	2023-DEC-SKYLINE EAST-Break & Enter - Commercial	POINT (-114.03893128621468 51.10015261262966)	2023-12-01	new

10777 rows × 11 columns

Ahaa! As we can see, a single group contains a DataFrame with values only for that specific year. Awesome!

This is really useful, because we can now use all the familiar DataFrame methods for calculating statistics, etc. for each specific group. We can, for example, calculate the average values for all variables using the statistical functions that we have seen already (e.g. mean, std, min, max, median, etc.).

In our example, we can use mean() function to calculate average of Crime Count in each group . Let’s calculate the mean for group_2023:

# Calculate the mean values all at one go
mean_values = group_2023["Crime Count"].mean()

# Let's see what we have
print(mean_values)

2.53085274195045

Aggregation#

Above, we saw how you can access data from a single group. In order to get information about all groups (all years) we can use a aggregation in groups.

# aggregate the grouped data by mean on only "Crime Count" column
new_aggregate_data = grouped_data.aggregate({"Crime Count": "mean"})

# print the new aggregate data
new_aggregate_data

	Crime Count
NEW_DATE
2018	3.048227
2019	3.239705
2020	2.913442
2021	2.634971
2022	2.876989
2023	2.530853

# we can also use the agg function to calculate on multiple columns
new_aggregate_data_02 = grouped_data.agg({"Crime Count": "sum", "Resident Count": "mean"})

# print the new aggregate data
new_aggregate_data_02

	Crime Count	Resident Count
NEW_DATE
2018	35332	6596.933224
2019	38154	6536.832555
2020	31774	6580.625527
2021	28376	6502.551676
2022	32720	6380.497142
2023	27275	6385.120627

So, we can see that in 2019 we had relatively more crime happening!

Change column names#

We can also change column names in our new aggregated dataframe:

# change the column name of "Crime Count" to "Total Crime Count" and "Resident Count" to "Average Resident Count"
new_aggregate_data_02.rename(columns={"Crime Count": "Total Crime Count", "Resident Count": "Average Resident Count"}, inplace=True)

# print the new aggregate data
new_aggregate_data_02

	Total Crime Count	Average Resident Count
NEW_DATE
2018	35332	6596.933224
2019	38154	6536.832555
2020	31774	6580.625527
2021	28376	6502.551676
2022	32720	6380.497142
2023	27275	6385.120627

Note

More infromation about pandas grouping and aggregating is available in Chapter 10: Data Aggregation and Group Operations

data

	Sector	Community Name	Category	Crime Count	Resident Count	Date	Year	ID	Community Center Point	NEW_DATE	Year Check
0	NORTHWEST	ARBOUR LAKE	Theft OF Vehicle	2	10619	2022/04	2022	2022-APR-ARBOUR LAKE-Theft OF Vehicle	POINT (-114.20767498075155 51.1325947114686)	2022-04-01	old
1	CENTRE	BANFF TRAIL	Theft OF Vehicle	2	4153	2023/10	2023	2023-OCT-BANFF TRAIL-Theft OF Vehicle	POINT (-114.11512839716917 51.07421633024228)	2023-10-01	new
2	EAST	DOVER	Theft OF Vehicle	5	10351	2022/12	2022	2022-DEC-DOVER-Theft OF Vehicle	POINT (-113.99305400906283 51.02256772250409)	2022-12-01	old
3	CENTRE	GREENVIEW	Assault (Non-domestic)	2	1906	2020/12	2020	2020-DEC-GREENVIEW-Assault (Non-domestic)	POINT (-114.05746990262463 51.09485613506574)	2020-12-01	old
4	NORTHWEST	HAMPTONS	Theft FROM Vehicle	3	7382	2019/08	2019	2019-AUG-HAMPTONS-Theft FROM Vehicle	POINT (-114.14668419231347 51.14509283969437)	2019-08-01	old
...	...	...	...	...	...	...	...	...	...	...	...
67257	EAST	SOUTHVIEW	Break & Enter - Commercial	1	1805	2023/12	2023	2023-DEC-SOUTHVIEW-Break & Enter - Commercial	POINT (-113.99733916298928 51.03415221260387)	2023-12-01	new
67258	NORTH	SAGE HILL	Theft FROM Vehicle	1	7924	2023/12	2023	2023-DEC-SAGE HILL-Theft FROM Vehicle	POINT (-114.14068609335015 51.175616972779984)	2023-12-01	new
67259	CENTRE	WINSTON HEIGHTS/MOUNTVIEW	Break & Enter - Commercial	1	3635	2023/12	2023	2023-DEC-WINSTON HEIGHTS/MOUNTVIEW-Break & Ent...	POINT (-114.04184874950579 51.075298802175126)	2023-12-01	new
67260	CENTRE	SOUTH CALGARY	Street Robbery	1	4442	2023/12	2023	2023-DEC-SOUTH CALGARY-Street Robbery	POINT (-114.10207470148995 51.02680215136445)	2023-12-01	new
67261	NORTHEAST	SKYLINE EAST	Break & Enter - Commercial	1	0	2023/12	2023	2023-DEC-SKYLINE EAST-Break & Enter - Commercial	POINT (-114.03893128621468 51.10015261262966)	2023-12-01	new

67193 rows × 11 columns

More Parctice on Grouping and Aggregating#

We can aggregate DataFrame by more than two columns at the same time.

For example, we want to see the number of crimes by category in each year. We can use .groupby() on two columns Categor and NEW_DATE

# group the data by NEW_DATE yearly and Category and calculate the sum of "Crime Count"
grouped_data_02 = data.groupby([data["NEW_DATE"].dt.year, "Category"])

# aggregate the grouped data by sum on only "Crime Count" column
new_aggregate_data_03 = grouped_data_02.aggregate({"Crime Count": "sum"})

# change the column name of "Crime Count" to "Total Crime Count"
new_aggregate_data_03.rename(columns={"Crime Count": "Total Crime Count"}, inplace=True)

# print the new aggregate data
new_aggregate_data_03

		Total Crime Count
NEW_DATE	Category
2018	Assault (Non-domestic)	3828
	Break & Enter - Commercial	5054
	Break & Enter - Dwelling	2693
	Break & Enter - Other Premises	2103
	Commercial Robbery	305
	Street Robbery	658
	Theft FROM Vehicle	12624
	Theft OF Vehicle	6056
	Violence Other (Non-domestic)	2011
2019	Assault (Non-domestic)	3913
	Break & Enter - Commercial	5782
	Break & Enter - Dwelling	2401
	Break & Enter - Other Premises	2303
	Commercial Robbery	382
	Street Robbery	649
	Theft FROM Vehicle	14723
	Theft OF Vehicle	6028
	Violence Other (Non-domestic)	1973
2020	Assault (Non-domestic)	3529
	Break & Enter - Commercial	4203
	Break & Enter - Dwelling	1686
	Break & Enter - Other Premises	2500
	Commercial Robbery	281
	Street Robbery	525
	Theft FROM Vehicle	12387
	Theft OF Vehicle	4649
	Violence Other (Non-domestic)	2014
2021	Assault (Non-domestic)	4036
	Break & Enter - Commercial	3212
	Break & Enter - Dwelling	1577
	Break & Enter - Other Premises	1835
	Commercial Robbery	359
	Street Robbery	575
	Theft FROM Vehicle	10158
	Theft OF Vehicle	4445
	Violence Other (Non-domestic)	2179
2022	Assault (Non-domestic)	4139
	Break & Enter - Commercial	4273
	Break & Enter - Dwelling	1586
	Break & Enter - Other Premises	1948
	Commercial Robbery	391
	Street Robbery	536
	Theft FROM Vehicle	12344
	Theft OF Vehicle	5419
	Violence Other (Non-domestic)	2084
2023	Assault (Non-domestic)	4576
	Break & Enter - Commercial	3344
	Break & Enter - Dwelling	1525
	Break & Enter - Other Premises	1298
	Commercial Robbery	311
	Street Robbery	693
	Theft FROM Vehicle	8951
	Theft OF Vehicle	4448
	Violence Other (Non-domestic)	2129

Processing data with pandas

Contents

Processing data with pandas#

Input data: Community Crime Statistics Map#

Loading Data#

Basic Calculations in DataFrame#

Selecting rows and columns#

Filtering and updating#

Filter Columns:#

Filter Rows#

Data Cleaning#

Dealing with missing data#

Removing Duplicates#

Data type conversions#

Datetime#

Iterating over rows#

Aggregating data in pandas by grouping#

Accessing data for one group#

Aggregation#

Change column names#

More Parctice on Grouping and Aggregating#