Processing data with pandas#
Attention
These tutorials are meant to help you with assignment tasks by giving hints, not to teach Python programming in a detailed step-by-step manner.
As mentioned in the first lecture of the course, this course requires basic knowledge of Python programming language.
If you are not familiar with them, try catching up with the basics as fast as you can. Some useful resources are:
import warnings
warnings.filterwarnings("ignore")
Our assignment 1 is about to learn how to read and explore data files in Python. We will focus on using pandas which is an open-source package for data analysis in Python. pandas is an excellent toolkit for working with real world data that often have a tabular structure (rows and columns).
Input data: Community Crime Statistics Map#
Our input data in this tutorial is a text file containing community crimes statistics map in city of Calgary, Alberta, Canda retrieved from City of Calgary Open Data Portal:
File name: [Community_Crime_Statistics_20240120.csv] (you can have a look at the file before reading it in using pandas!)
You can download the data from the link provided: City of Calgary Open Data Portal
Data is provided monthly by the Calgary Police Service. And includes the location of crime, time, category, crime count, and resident count.
There are totally 67,262 rows and 10 columns in this dataset.
Loading Data#
Next, we wll read the input data file, and store the contents of that file in a variable called data Using the pandas.read_csv() function:
# Importing the libraries
import pandas as pd
# load data
# Read the file using pandas
data = pd.read_csv("Community_Crime_Statistics_20240120.csv", sep=',')
Reading different file formats
Check out the pandas documentation about input and output functions and Chapter 6 in McKinney (2022) for more details about reading data.
Let’s now print the dataframe and see what it looks like.
data
| Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | Month | ID | Community Center Point | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NORTHWEST | ARBOUR LAKE | Theft OF Vehicle | 2 | 10619.0 | 2022/04 | 2022 | APR | 2022-APR-ARBOUR LAKE-Theft OF Vehicle | POINT (-114.20767498075155 51.1325947114686) |
| 1 | CENTRE | BANFF TRAIL | Theft OF Vehicle | 2 | 4153.0 | 2023/10 | 2023 | OCT | 2023-OCT-BANFF TRAIL-Theft OF Vehicle | POINT (-114.11512839716917 51.07421633024228) |
| 2 | EAST | DOVER | Theft OF Vehicle | 5 | 10351.0 | 2022/12 | 2022 | DEC | 2022-DEC-DOVER-Theft OF Vehicle | POINT (-113.99305400906283 51.02256772250409) |
| 3 | CENTRE | GREENVIEW | Assault (Non-domestic) | 2 | 1906.0 | 2020/12 | 2020 | DEC | 2020-DEC-GREENVIEW-Assault (Non-domestic) | POINT (-114.05746990262463 51.09485613506574) |
| 4 | NORTHWEST | HAMPTONS | Theft FROM Vehicle | 3 | 7382.0 | 2019/08 | 2019 | AUG | 2019-AUG-HAMPTONS-Theft FROM Vehicle | POINT (-114.14668419231347 51.14509283969437) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 67930 | SOUTH | SOMERSET | Assault (Non-domestic) | 2 | 8543.0 | 2020/05 | 2020 | MAY | 2020-MAY-SOMERSET-Assault (Non-domestic) | POINT (-114.08138084069279 50.89891142227545) |
| 67931 | CENTRE | LOWER MOUNT ROYAL | Assault (Non-domestic) | 1 | 3457.0 | 2023/07 | 2023 | JUL | 2023-JUL-LOWER MOUNT ROYAL-Assault (Non-domestic) | POINT (-114.08649939756594 51.0365031977319) |
| 67932 | NORTHEAST | MARLBOROUGH PARK | Break & Enter - Dwelling | 1 | 8523.0 | 2021/04 | 2021 | APR | 2021-APR-MARLBOROUGH PARK-Break & Enter - Dwel... | POINT (-113.94671708302872 51.059688495333184) |
| 67933 | SOUTH | DEER RIDGE | Break & Enter - Commercial | 1 | 3973.0 | 2020/08 | 2020 | AUG | 2020-AUG-DEER RIDGE-Break & Enter - Commercial | POINT (-114.02023404987288 50.926573869162) |
| 67934 | NORTH | SAGE HILL | Theft FROM Vehicle | 1 | 7924.0 | 2020/11 | 2020 | NOV | 2020-NOV-SAGE HILL-Theft FROM Vehicle | POINT (-114.14068609335015 51.175616972779984) |
67935 rows × 10 columns
We can use the data.head() function of the pandas DataFrame object to quickly check the top rows. We can also check the last rows of the data using data.tail()
# print the first 5 rows of data
data.head()
| Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | Month | ID | Community Center Point | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NORTHWEST | ARBOUR LAKE | Theft OF Vehicle | 2 | 10619.0 | 2022/04 | 2022 | APR | 2022-APR-ARBOUR LAKE-Theft OF Vehicle | POINT (-114.20767498075155 51.1325947114686) |
| 1 | CENTRE | BANFF TRAIL | Theft OF Vehicle | 2 | 4153.0 | 2023/10 | 2023 | OCT | 2023-OCT-BANFF TRAIL-Theft OF Vehicle | POINT (-114.11512839716917 51.07421633024228) |
| 2 | EAST | DOVER | Theft OF Vehicle | 5 | 10351.0 | 2022/12 | 2022 | DEC | 2022-DEC-DOVER-Theft OF Vehicle | POINT (-113.99305400906283 51.02256772250409) |
| 3 | CENTRE | GREENVIEW | Assault (Non-domestic) | 2 | 1906.0 | 2020/12 | 2020 | DEC | 2020-DEC-GREENVIEW-Assault (Non-domestic) | POINT (-114.05746990262463 51.09485613506574) |
| 4 | NORTHWEST | HAMPTONS | Theft FROM Vehicle | 3 | 7382.0 | 2019/08 | 2019 | AUG | 2019-AUG-HAMPTONS-Theft FROM Vehicle | POINT (-114.14668419231347 51.14509283969437) |
# print the last 5 rows of data
data.tail()
| Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | Month | ID | Community Center Point | |
|---|---|---|---|---|---|---|---|---|---|---|
| 67930 | SOUTH | SOMERSET | Assault (Non-domestic) | 2 | 8543.0 | 2020/05 | 2020 | MAY | 2020-MAY-SOMERSET-Assault (Non-domestic) | POINT (-114.08138084069279 50.89891142227545) |
| 67931 | CENTRE | LOWER MOUNT ROYAL | Assault (Non-domestic) | 1 | 3457.0 | 2023/07 | 2023 | JUL | 2023-JUL-LOWER MOUNT ROYAL-Assault (Non-domestic) | POINT (-114.08649939756594 51.0365031977319) |
| 67932 | NORTHEAST | MARLBOROUGH PARK | Break & Enter - Dwelling | 1 | 8523.0 | 2021/04 | 2021 | APR | 2021-APR-MARLBOROUGH PARK-Break & Enter - Dwel... | POINT (-113.94671708302872 51.059688495333184) |
| 67933 | SOUTH | DEER RIDGE | Break & Enter - Commercial | 1 | 3973.0 | 2020/08 | 2020 | AUG | 2020-AUG-DEER RIDGE-Break & Enter - Commercial | POINT (-114.02023404987288 50.926573869162) |
| 67934 | NORTH | SAGE HILL | Theft FROM Vehicle | 1 | 7924.0 | 2020/11 | 2020 | NOV | 2020-NOV-SAGE HILL-Theft FROM Vehicle | POINT (-114.14068609335015 51.175616972779984) |
Let’s see some basic info about the data (The number of column, rows, the data type of each column etc.)
# Check the data info
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67935 entries, 0 to 67934
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sector 67904 non-null object
1 Community Name 67935 non-null object
2 Category 67935 non-null object
3 Crime Count 67935 non-null int64
4 Resident Count 67866 non-null float64
5 Date 67935 non-null object
6 Year 67935 non-null int64
7 Month 67935 non-null object
8 ID 67935 non-null object
9 Community Center Point 67904 non-null object
dtypes: float64(1), int64(2), object(7)
memory usage: 5.2+ MB
Also some basic stats about numerical values in columns that contain numbers (Crime count, Resident count, Year)
# Cehck the data description
data.describe()
| Crime Count | Resident Count | Year | |
|---|---|---|---|
| count | 67935.000000 | 67866.000000 | 67935.000000 |
| mean | 2.880651 | 6500.622653 | 2020.460146 |
| std | 3.685793 | 5455.404451 | 1.712744 |
| min | 1.000000 | 0.000000 | 2018.000000 |
| 25% | 1.000000 | 2263.000000 | 2019.000000 |
| 50% | 2.000000 | 5957.000000 | 2020.000000 |
| 75% | 3.000000 | 9244.000000 | 2022.000000 |
| max | 110.000000 | 25710.000000 | 2023.000000 |
Basic Calculations in DataFrame#
One of the most common things to do in pandas is to create new columns based on calculations between different variables (columns).
Let’s say we want to create a new column in which we want to calculate the “Crime Count +1”. We can create a new column new_crime_count in our DataFrame by specifying the name of the column and giving it some default value (in this case the decimal number 0.0).
# Define a new column "ew_crime_count"
data["new_crime_count"] = 0.0
# Check how the dataframe looks like:
data
| Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | Month | ID | Community Center Point | new_crime_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NORTHWEST | ARBOUR LAKE | Theft OF Vehicle | 2 | 10619.0 | 2022/04 | 2022 | APR | 2022-APR-ARBOUR LAKE-Theft OF Vehicle | POINT (-114.20767498075155 51.1325947114686) | 0.0 |
| 1 | CENTRE | BANFF TRAIL | Theft OF Vehicle | 2 | 4153.0 | 2023/10 | 2023 | OCT | 2023-OCT-BANFF TRAIL-Theft OF Vehicle | POINT (-114.11512839716917 51.07421633024228) | 0.0 |
| 2 | EAST | DOVER | Theft OF Vehicle | 5 | 10351.0 | 2022/12 | 2022 | DEC | 2022-DEC-DOVER-Theft OF Vehicle | POINT (-113.99305400906283 51.02256772250409) | 0.0 |
| 3 | CENTRE | GREENVIEW | Assault (Non-domestic) | 2 | 1906.0 | 2020/12 | 2020 | DEC | 2020-DEC-GREENVIEW-Assault (Non-domestic) | POINT (-114.05746990262463 51.09485613506574) | 0.0 |
| 4 | NORTHWEST | HAMPTONS | Theft FROM Vehicle | 3 | 7382.0 | 2019/08 | 2019 | AUG | 2019-AUG-HAMPTONS-Theft FROM Vehicle | POINT (-114.14668419231347 51.14509283969437) | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 67930 | SOUTH | SOMERSET | Assault (Non-domestic) | 2 | 8543.0 | 2020/05 | 2020 | MAY | 2020-MAY-SOMERSET-Assault (Non-domestic) | POINT (-114.08138084069279 50.89891142227545) | 0.0 |
| 67931 | CENTRE | LOWER MOUNT ROYAL | Assault (Non-domestic) | 1 | 3457.0 | 2023/07 | 2023 | JUL | 2023-JUL-LOWER MOUNT ROYAL-Assault (Non-domestic) | POINT (-114.08649939756594 51.0365031977319) | 0.0 |
| 67932 | NORTHEAST | MARLBOROUGH PARK | Break & Enter - Dwelling | 1 | 8523.0 | 2021/04 | 2021 | APR | 2021-APR-MARLBOROUGH PARK-Break & Enter - Dwel... | POINT (-113.94671708302872 51.059688495333184) | 0.0 |
| 67933 | SOUTH | DEER RIDGE | Break & Enter - Commercial | 1 | 3973.0 | 2020/08 | 2020 | AUG | 2020-AUG-DEER RIDGE-Break & Enter - Commercial | POINT (-114.02023404987288 50.926573869162) | 0.0 |
| 67934 | NORTH | SAGE HILL | Theft FROM Vehicle | 1 | 7924.0 | 2020/11 | 2020 | NOV | 2020-NOV-SAGE HILL-Theft FROM Vehicle | POINT (-114.14068609335015 51.175616972779984) | 0.0 |
67935 rows × 11 columns
Now let’s update the column new_crime_count by calculating the the column Crime Count + 1.
# Calculate `Crime Count` + 1
data["new_crime_count"] = data["Crime Count"] + 1
# Check the result
data.head()
| Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | Month | ID | Community Center Point | new_crime_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NORTHWEST | ARBOUR LAKE | Theft OF Vehicle | 2 | 10619.0 | 2022/04 | 2022 | APR | 2022-APR-ARBOUR LAKE-Theft OF Vehicle | POINT (-114.20767498075155 51.1325947114686) | 3 |
| 1 | CENTRE | BANFF TRAIL | Theft OF Vehicle | 2 | 4153.0 | 2023/10 | 2023 | OCT | 2023-OCT-BANFF TRAIL-Theft OF Vehicle | POINT (-114.11512839716917 51.07421633024228) | 3 |
| 2 | EAST | DOVER | Theft OF Vehicle | 5 | 10351.0 | 2022/12 | 2022 | DEC | 2022-DEC-DOVER-Theft OF Vehicle | POINT (-113.99305400906283 51.02256772250409) | 6 |
| 3 | CENTRE | GREENVIEW | Assault (Non-domestic) | 2 | 1906.0 | 2020/12 | 2020 | DEC | 2020-DEC-GREENVIEW-Assault (Non-domestic) | POINT (-114.05746990262463 51.09485613506574) | 3 |
| 4 | NORTHWEST | HAMPTONS | Theft FROM Vehicle | 3 | 7382.0 | 2019/08 | 2019 | AUG | 2019-AUG-HAMPTONS-Theft FROM Vehicle | POINT (-114.14668419231347 51.14509283969437) | 4 |
You can even calculate the sum of two columns. For example: data["new_column"] = data["Crime Count"] + data["Resident Count"]
Selecting rows and columns#
One common way of selecting only specific rows from your DataFrame is done via this syntax: dataframe[start_index:stop_index]
Let’s select the first six rows and assign them to a variable called selection:
# Select first five rows of dataframe using row index values
selection = data[0:6]
# Print `selection`
selection
| Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | Month | ID | Community Center Point | new_crime_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NORTHWEST | ARBOUR LAKE | Theft OF Vehicle | 2 | 10619.0 | 2022/04 | 2022 | APR | 2022-APR-ARBOUR LAKE-Theft OF Vehicle | POINT (-114.20767498075155 51.1325947114686) | 3 |
| 1 | CENTRE | BANFF TRAIL | Theft OF Vehicle | 2 | 4153.0 | 2023/10 | 2023 | OCT | 2023-OCT-BANFF TRAIL-Theft OF Vehicle | POINT (-114.11512839716917 51.07421633024228) | 3 |
| 2 | EAST | DOVER | Theft OF Vehicle | 5 | 10351.0 | 2022/12 | 2022 | DEC | 2022-DEC-DOVER-Theft OF Vehicle | POINT (-113.99305400906283 51.02256772250409) | 6 |
| 3 | CENTRE | GREENVIEW | Assault (Non-domestic) | 2 | 1906.0 | 2020/12 | 2020 | DEC | 2020-DEC-GREENVIEW-Assault (Non-domestic) | POINT (-114.05746990262463 51.09485613506574) | 3 |
| 4 | NORTHWEST | HAMPTONS | Theft FROM Vehicle | 3 | 7382.0 | 2019/08 | 2019 | AUG | 2019-AUG-HAMPTONS-Theft FROM Vehicle | POINT (-114.14668419231347 51.14509283969437) | 4 |
| 5 | SOUTH | LAKE BONAVISTA | Theft OF Vehicle | 2 | 10293.0 | 2022/08 | 2022 | AUG | 2022-AUG-LAKE BONAVISTA-Theft OF Vehicle | POINT (-114.05133665127042 50.939803390308995) | 3 |
It is also possible to control which columns are chosen when selecting a subset of rows. In this case we will use pandas.DataFrame.loc which selects data based on axis labels (row labels and column labels) with this syntax : .loc[start_index:stop_index, list_of_columns]
Let’s select “Category” and “Crime Count” values from rows 0-5:
# Select the two column values on rows 0-5
selection = data.loc[0:5, ["Category", "Crime Count"]]
# Print `selection`
selection
| Category | Crime Count | |
|---|---|---|
| 0 | Theft OF Vehicle | 2 |
| 1 | Theft OF Vehicle | 2 |
| 2 | Theft OF Vehicle | 5 |
| 3 | Assault (Non-domestic) | 2 |
| 4 | Theft FROM Vehicle | 3 |
| 5 | Theft OF Vehicle | 2 |
Filtering and updating#
Filter Columns:#
Sometimes we don’t need all the information in the data. We can select specific columns based on the column values.
For example, I don’t need the collumn Month and want to keep other columns. We can drop the column by using the .drop() function. (e.g. data.drop('Month', axis=1, inplace=True))
axis=1 means drop the column, inplace=True means drop the column in the original data, (for droping roaws we can use axis=0 which means drop the row)
# drop the extra columns
data.drop(['Month'], axis=1, inplace=True)
Note
Alternatively, instead of using .drop() to drop unneccesary columns, we can keep all the columns that we need. The basic syntax is dataframe[value], where value can be a single column name, or a list of column names. In our example, instead of dropping Month, I can keep all other columns with the mentioned suntax.
# Keep all columns except the Month column
data = data[['Sector', 'Community Name', 'Category', 'Crime Count', 'Resident Count', 'Date', 'Year', 'ID', 'Community Center Point']]
data.head()
| Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | ID | Community Center Point | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | NORTHWEST | ARBOUR LAKE | Theft OF Vehicle | 2 | 10619.0 | 2022/04 | 2022 | 2022-APR-ARBOUR LAKE-Theft OF Vehicle | POINT (-114.20767498075155 51.1325947114686) |
| 1 | CENTRE | BANFF TRAIL | Theft OF Vehicle | 2 | 4153.0 | 2023/10 | 2023 | 2023-OCT-BANFF TRAIL-Theft OF Vehicle | POINT (-114.11512839716917 51.07421633024228) |
| 2 | EAST | DOVER | Theft OF Vehicle | 5 | 10351.0 | 2022/12 | 2022 | 2022-DEC-DOVER-Theft OF Vehicle | POINT (-113.99305400906283 51.02256772250409) |
| 3 | CENTRE | GREENVIEW | Assault (Non-domestic) | 2 | 1906.0 | 2020/12 | 2020 | 2020-DEC-GREENVIEW-Assault (Non-domestic) | POINT (-114.05746990262463 51.09485613506574) |
| 4 | NORTHWEST | HAMPTONS | Theft FROM Vehicle | 3 | 7382.0 | 2019/08 | 2019 | 2019-AUG-HAMPTONS-Theft FROM Vehicle | POINT (-114.14668419231347 51.14509283969437) |
Filter Rows#
One really useful feature in pandas is the ability to easily filter and select rows based on a conditional statement.
# Select rows that are in 2023
crimes_2023 = data.loc[data["Year"] > 2022]
# Print `crimes_2023`
crimes_2023.head()
| Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | ID | Community Center Point | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | CENTRE | BANFF TRAIL | Theft OF Vehicle | 2 | 4153.0 | 2023/10 | 2023 | 2023-OCT-BANFF TRAIL-Theft OF Vehicle | POINT (-114.11512839716917 51.07421633024228) |
| 10 | CENTRE | ROSEDALE | Break & Enter - Dwelling | 1 | 1560.0 | 2023/04 | 2023 | 2023-APR-ROSEDALE-Break & Enter - Dwelling | POINT (-114.07820713414625 51.062686663226856) |
| 15 | CENTRE | TUXEDO PARK | Theft OF Vehicle | 3 | 5326.0 | 2023/08 | 2023 | 2023-AUG-TUXEDO PARK-Theft OF Vehicle | POINT (-114.06104493215598 51.07421777158121) |
| 30 | NORTHEAST | CALGARY INTERNATIONAL AIRPORT | Theft FROM Vehicle | 8 | 0.0 | 2023/05 | 2023 | 2023-MAY-CALGARY INTERNATIONAL AIRPORT-Theft F... | POINT (-114.01171665690258 51.123556453987604) |
| 96 | CENTRE | ALTADORE | Theft OF Vehicle | 1 | 6942.0 | 2023/04 | 2023 | 2023-APR-ALTADORE-Theft OF Vehicle | POINT (-114.10078239468564 51.015954118298694) |
Or even multiple conditions at the same time:
# filter the data to show crime with crime count greater than 5 for year 2023
crime_2023_selected = data.loc[(data["Year"] > 2022) & (data["Crime Count"] >= 5)]
# Print `crime_2023_selected`
crime_2023_selected.head()
| Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | ID | Community Center Point | |
|---|---|---|---|---|---|---|---|---|---|
| 30 | NORTHEAST | CALGARY INTERNATIONAL AIRPORT | Theft FROM Vehicle | 8 | 0.0 | 2023/05 | 2023 | 2023-MAY-CALGARY INTERNATIONAL AIRPORT-Theft F... | POINT (-114.01171665690258 51.123556453987604) |
| 13017 | SOUTHEAST | AUBURN BAY | Theft FROM Vehicle | 5 | 17607.0 | 2023/03 | 2023 | 2023-MAR-AUBURN BAY-Theft FROM Vehicle | POINT (-113.95779514351926 50.89269507687) |
| 20753 | SOUTHEAST | SETON | Theft OF Vehicle | 5 | 1134.0 | 2023/10 | 2023 | 2023-OCT-SETON-Theft OF Vehicle | POINT (-113.94793123802683 50.874157662841256) |
| 20934 | SOUTHEAST | OGDEN | Theft FROM Vehicle | 5 | 8576.0 | 2023/09 | 2023 | 2023-SEP-OGDEN-Theft FROM Vehicle | POINT (-114.01244227109757 50.991878337981746) |
| 29683 | SOUTHEAST | OGDEN | Theft FROM Vehicle | 7 | 8576.0 | 2023/11 | 2023 | 2023-NOV-OGDEN-Theft FROM Vehicle | POINT (-114.01244227109757 50.991878337981746) |
Data Cleaning#
Dealing with missing data#
As you have learned, most of the time when we work with data, we face missing values or incomplete data. We can check the number of missing values in Dataframe with this syntax:
# print the number of missing values in each column
data.isnull().sum()
Sector 31
Community Name 0
Category 0
Crime Count 0
Resident Count 69
Date 0
Year 0
ID 0
Community Center Point 31
dtype: int64
As we can see, in Column “Sector”, “Resident Count”, and “Community Center Pint” we have 31, 69, 31 missing values respectively.
OThere are many way to handle missing values based on the application. One way of cleaning data is to remover rows with missing values. If we want to remove the missing values based on column (only remove those rows that are missing the value in specific column) we can use this syntax: .dropna(subset=['column_name'], inplace=True)
# drop the rows with missing values for column "Community Center Point"
data.dropna(subset=['Community Center Point'], inplace=True)
# print the number of missing values in each column
data.isnull().sum()
Sector 0
Community Name 0
Category 0
Crime Count 0
Resident Count 38
Date 0
Year 0
ID 0
Community Center Point 0
dtype: int64
As we can see, we remove all the rows that had missing values for column Community Center Point.
If we want to remove all the row that has any missing value, no matter in which column, we can use this syntax:
# remove all the rows with missing values
data.dropna(inplace=True)
# print the number of missing values in each column
data.isnull().sum()
Sector 0
Community Name 0
Category 0
Crime Count 0
Resident Count 0
Date 0
Year 0
ID 0
Community Center Point 0
dtype: int64
Removing Duplicates#
Duplicate rows may be found in a DataFrame for any number of reasons.
The DataFrame method duplicated returns a Boolean Series indicating whether each row is a duplicate (its column values are exactly equal to those in an earlier row) or not:
# check for duplicate rows
data.duplicated()
0 False
1 False
2 False
3 False
4 False
...
67930 True
67931 True
67932 True
67933 True
67934 True
Length: 67866, dtype: bool
# count all the duplicate rows
data.duplicated().sum()
np.int64(673)
We have 673 rows with identical values. Relatedly, drop_duplicates returns a DataFrame with rows where the duplicated array is False filtered out:
# drop the duplicate rows
data.drop_duplicates(inplace=True)
# print the number of duplicate rows
data.duplicated().sum()
np.int64(0)
Note
You can check for more infromation about data cleaning with pandas by looking at : Data Cleaning and Preparation
Data type conversions#
There are occasions where you’ll need to convert data stored within a Series to another data type, for example, from floating point to integer.
For values in pandas DataFrames and Series, we can use the .astype() method.
## Data type conversions
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 67193 entries, 0 to 67261
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sector 67193 non-null object
1 Community Name 67193 non-null object
2 Category 67193 non-null object
3 Crime Count 67193 non-null int64
4 Resident Count 67193 non-null float64
5 Date 67193 non-null object
6 Year 67193 non-null int64
7 ID 67193 non-null object
8 Community Center Point 67193 non-null object
dtypes: float64(1), int64(2), object(6)
memory usage: 5.1+ MB
As we can see the type of “Resident Count” is float64. We want to convert this column type to integer:
# convert the "Resident Count" to integer
data["Resident Count"] = data["Resident Count"].astype(int)
# print the data info
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 67193 entries, 0 to 67261
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sector 67193 non-null object
1 Community Name 67193 non-null object
2 Category 67193 non-null object
3 Crime Count 67193 non-null int64
4 Resident Count 67193 non-null int64
5 Date 67193 non-null object
6 Year 67193 non-null int64
7 ID 67193 non-null object
8 Community Center Point 67193 non-null object
dtypes: int64(3), object(6)
memory usage: 5.1+ MB
Datetime#
In pandas, we can also convert dates and times into a new data type called datetime using the pandas.to_datetime function.
# Convert character strings to datetime
data["NEW_DATE"] = pd.to_datetime(data["Date"])
# print the data info
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 67193 entries, 0 to 67261
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sector 67193 non-null object
1 Community Name 67193 non-null object
2 Category 67193 non-null object
3 Crime Count 67193 non-null int64
4 Resident Count 67193 non-null int64
5 Date 67193 non-null object
6 Year 67193 non-null int64
7 ID 67193 non-null object
8 Community Center Point 67193 non-null object
9 NEW_DATE 67193 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(3), object(6)
memory usage: 5.6+ MB
As we can see, the new column NEW_DATE has the type of “datetime64” which is the format of datetime in pandas. From now on pandas underestand that this is a date and time and the sequence in time series matter.
Pandas Series datetime properties
There are several methods available for accessing information about the properties of datetime values. You can read more about datetime properties from the pandas documentation.
With the datetime column, we can now extract different time units using the pandas.Series.dt accessor.
# print NEW_DATE column
data["NEW_DATE"].dt.year
0 2022
1 2023
2 2022
3 2020
4 2019
...
67257 2023
67258 2023
67259 2023
67260 2023
67261 2023
Name: NEW_DATE, Length: 67193, dtype: int32
# print NEW_DATE column
data["NEW_DATE"].dt.month
0 4
1 10
2 12
3 12
4 8
..
67257 12
67258 12
67259 12
67260 12
67261 12
Name: NEW_DATE, Length: 67193, dtype: int32
# print NEW_DATE column
data["NEW_DATE"].dt.day
0 1
1 1
2 1
3 1
4 1
..
67257 1
67258 1
67259 1
67260 1
67261 1
Name: NEW_DATE, Length: 67193, dtype: int32
Iterating over rows#
We can use an operation on the DataFrame one row at a time using a for loop and the iterrows() method. This will allow us to iterate row by row using iterrows() in a for loop to repeat a given process for each row in a pandas DataFrame. Please note that iterating over rows is a rather inefficient approach, but it is still useful to understand the logic behind the iteration.
When using the iterrows() method it is important to understand that iterrows() accesses not only the values of one row, but also the index of the row as well.
Let’s start with a simple for loop that goes through each row in our DataFrame.
# Iterate over the rows
for idx, row in data.iterrows():
# Print the index value
print(f"Index: {idx} \n\n")
# Print the row
print(f"Row :\n\n{row}\n")
break
Index: 0
Row :
Sector NORTHWEST
Community Name ARBOUR LAKE
Category Theft OF Vehicle
Crime Count 2
Resident Count 10619
Date 2022/04
Year 2022
ID 2022-APR-ARBOUR LAKE-Theft OF Vehicle
Community Center Point POINT (-114.20767498075155 51.1325947114686)
NEW_DATE 2022-04-01 00:00:00
Name: 0, dtype: object
Breaking a loop
When developing code in a for loop, you do not always need to go through the entire loop in order to test things out.
The break statement in Python terminates the current loop whereever it is placed and we can use it here just to check out the values on the first row (based on the first iteration in the for loop.
This can be helpful when working with a large data file or dataset, because you might not want to print thousands of values to the screen!
For more information, check out this tutorial.
We can see that the idx variable indeed contains the index value at position 0 (the first row) and the row variable contains all the data from that given row stored as a pandas Series.
Let’s now as an example create an empty column “Year Check” and check if crime year is greater than 2022 in a new column “Year Check” we enter “new”, else “old”.
# iterate in the dataframe and check if year is greater than 2022 in a new column "Year Check" enter "new" else "old"
for idx, row in data.iterrows():
if row["Year"] > 2022:
data.loc[idx, "Year Check"] = "new"
else:
data.loc[idx, "Year Check"] = "old"
Finally, let’s see how our DataFrame looks like now after the calculations above.
# print the first 5 rows of data
data.head()
| Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | ID | Community Center Point | NEW_DATE | Year Check | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NORTHWEST | ARBOUR LAKE | Theft OF Vehicle | 2 | 10619 | 2022/04 | 2022 | 2022-APR-ARBOUR LAKE-Theft OF Vehicle | POINT (-114.20767498075155 51.1325947114686) | 2022-04-01 | old |
| 1 | CENTRE | BANFF TRAIL | Theft OF Vehicle | 2 | 4153 | 2023/10 | 2023 | 2023-OCT-BANFF TRAIL-Theft OF Vehicle | POINT (-114.11512839716917 51.07421633024228) | 2023-10-01 | new |
| 2 | EAST | DOVER | Theft OF Vehicle | 5 | 10351 | 2022/12 | 2022 | 2022-DEC-DOVER-Theft OF Vehicle | POINT (-113.99305400906283 51.02256772250409) | 2022-12-01 | old |
| 3 | CENTRE | GREENVIEW | Assault (Non-domestic) | 2 | 1906 | 2020/12 | 2020 | 2020-DEC-GREENVIEW-Assault (Non-domestic) | POINT (-114.05746990262463 51.09485613506574) | 2020-12-01 | old |
| 4 | NORTHWEST | HAMPTONS | Theft FROM Vehicle | 3 | 7382 | 2019/08 | 2019 | 2019-AUG-HAMPTONS-Theft FROM Vehicle | POINT (-114.14668419231347 51.14509283969437) | 2019-08-01 | old |
Aggregating data in pandas by grouping#
Here, we will learn how to use pandas.DataFrame.groupby, which is a handy method for combining large amounts of data and computing statistics for subgroups.
In our case, we will use the groupby method to calculate the average number of crimes for each year through these three steps:
Grouping the data based on the year
Calculating the average for each year (each group)
Storing those values into a new DataFrame called
grouped_data
We have rows of crime data, and several observations per month. Our goal is to create an aggreated data frame that would have only one row per year. (average crime count per year)
To condense our data to yearly average values we can group our data based on the unique year.
We can use NEW_DATE column, which we previously define and datetime format. We try to group this column on yearly basis and calculate the average number of crime for each group.
# group the data by "Date" on yearly basis and calculate average of "Crime Count"
grouped_data = data.groupby(data["NEW_DATE"].dt.year)
Now, let’s explore the new variable grouped_data.
# print grouped data
print(f"Type of grouped data is : {type(grouped_data)}")
print(f"Length of grouped Data: {len(grouped_data)}\n")
Type of grouped data is : <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
Length of grouped Data: 6
We have a new object with type DataFrameGroupBy with 6 groups. Length of the grouped object should be the same as the number of unique values in the column we used for grouping. For each unique value, there is a group of data. (in this example we have data for 6 years)
Let’s explore our grouped data even further.
We can check the “names” of each group.
# Print grouoped data keys
grouped_data.groups.keys()
dict_keys([2018, 2019, 2020, 2021, 2022, 2023])
Note
It is also possible to create combinations of two or more columns on-the-fly when grouping the data:
# Group the data
grouped = data.groupby(['column1', 'column2'])
Accessing data for one group#
Let us now check the contents for the group representing 2023 (the name of that group is 2023. We can get the values of that year from the grouped object using the get_group() method.
# Specify a month (as character string)
year = 2023
# Select the group
group_2023 = grouped_data.get_group(year)
# Let's see what we have
group_2023
| Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | ID | Community Center Point | NEW_DATE | Year Check | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | CENTRE | BANFF TRAIL | Theft OF Vehicle | 2 | 4153 | 2023/10 | 2023 | 2023-OCT-BANFF TRAIL-Theft OF Vehicle | POINT (-114.11512839716917 51.07421633024228) | 2023-10-01 | new |
| 10 | CENTRE | ROSEDALE | Break & Enter - Dwelling | 1 | 1560 | 2023/04 | 2023 | 2023-APR-ROSEDALE-Break & Enter - Dwelling | POINT (-114.07820713414625 51.062686663226856) | 2023-04-01 | new |
| 15 | CENTRE | TUXEDO PARK | Theft OF Vehicle | 3 | 5326 | 2023/08 | 2023 | 2023-AUG-TUXEDO PARK-Theft OF Vehicle | POINT (-114.06104493215598 51.07421777158121) | 2023-08-01 | new |
| 30 | NORTHEAST | CALGARY INTERNATIONAL AIRPORT | Theft FROM Vehicle | 8 | 0 | 2023/05 | 2023 | 2023-MAY-CALGARY INTERNATIONAL AIRPORT-Theft F... | POINT (-114.01171665690258 51.123556453987604) | 2023-05-01 | new |
| 96 | CENTRE | ALTADORE | Theft OF Vehicle | 1 | 6942 | 2023/04 | 2023 | 2023-APR-ALTADORE-Theft OF Vehicle | POINT (-114.10078239468564 51.015954118298694) | 2023-04-01 | new |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 67257 | EAST | SOUTHVIEW | Break & Enter - Commercial | 1 | 1805 | 2023/12 | 2023 | 2023-DEC-SOUTHVIEW-Break & Enter - Commercial | POINT (-113.99733916298928 51.03415221260387) | 2023-12-01 | new |
| 67258 | NORTH | SAGE HILL | Theft FROM Vehicle | 1 | 7924 | 2023/12 | 2023 | 2023-DEC-SAGE HILL-Theft FROM Vehicle | POINT (-114.14068609335015 51.175616972779984) | 2023-12-01 | new |
| 67259 | CENTRE | WINSTON HEIGHTS/MOUNTVIEW | Break & Enter - Commercial | 1 | 3635 | 2023/12 | 2023 | 2023-DEC-WINSTON HEIGHTS/MOUNTVIEW-Break & Ent... | POINT (-114.04184874950579 51.075298802175126) | 2023-12-01 | new |
| 67260 | CENTRE | SOUTH CALGARY | Street Robbery | 1 | 4442 | 2023/12 | 2023 | 2023-DEC-SOUTH CALGARY-Street Robbery | POINT (-114.10207470148995 51.02680215136445) | 2023-12-01 | new |
| 67261 | NORTHEAST | SKYLINE EAST | Break & Enter - Commercial | 1 | 0 | 2023/12 | 2023 | 2023-DEC-SKYLINE EAST-Break & Enter - Commercial | POINT (-114.03893128621468 51.10015261262966) | 2023-12-01 | new |
10777 rows × 11 columns
Ahaa! As we can see, a single group contains a DataFrame with values only for that specific year. Awesome!
This is really useful, because we can now use all the familiar DataFrame methods for calculating statistics, etc. for each specific group. We can, for example, calculate the average values for all variables using the statistical functions that we have seen already (e.g. mean, std, min, max, median, etc.).
In our example, we can use mean() function to calculate average of Crime Count in each group . Let’s calculate the mean for group_2023:
# Calculate the mean values all at one go
mean_values = group_2023["Crime Count"].mean()
# Let's see what we have
print(mean_values)
2.53085274195045
Aggregation#
Above, we saw how you can access data from a single group. In order to get information about all groups (all years) we can use a aggregation in groups.
# aggregate the grouped data by mean on only "Crime Count" column
new_aggregate_data = grouped_data.aggregate({"Crime Count": "mean"})
# print the new aggregate data
new_aggregate_data
| Crime Count | |
|---|---|
| NEW_DATE | |
| 2018 | 3.048227 |
| 2019 | 3.239705 |
| 2020 | 2.913442 |
| 2021 | 2.634971 |
| 2022 | 2.876989 |
| 2023 | 2.530853 |
# we can also use the agg function to calculate on multiple columns
new_aggregate_data_02 = grouped_data.agg({"Crime Count": "sum", "Resident Count": "mean"})
# print the new aggregate data
new_aggregate_data_02
| Crime Count | Resident Count | |
|---|---|---|
| NEW_DATE | ||
| 2018 | 35332 | 6596.933224 |
| 2019 | 38154 | 6536.832555 |
| 2020 | 31774 | 6580.625527 |
| 2021 | 28376 | 6502.551676 |
| 2022 | 32720 | 6380.497142 |
| 2023 | 27275 | 6385.120627 |
So, we can see that in 2019 we had relatively more crime happening!
Change column names#
We can also change column names in our new aggregated dataframe:
# change the column name of "Crime Count" to "Total Crime Count" and "Resident Count" to "Average Resident Count"
new_aggregate_data_02.rename(columns={"Crime Count": "Total Crime Count", "Resident Count": "Average Resident Count"}, inplace=True)
# print the new aggregate data
new_aggregate_data_02
| Total Crime Count | Average Resident Count | |
|---|---|---|
| NEW_DATE | ||
| 2018 | 35332 | 6596.933224 |
| 2019 | 38154 | 6536.832555 |
| 2020 | 31774 | 6580.625527 |
| 2021 | 28376 | 6502.551676 |
| 2022 | 32720 | 6380.497142 |
| 2023 | 27275 | 6385.120627 |
Note
More infromation about pandas grouping and aggregating is available in Chapter 10: Data Aggregation and Group Operations
data
| Sector | Community Name | Category | Crime Count | Resident Count | Date | Year | ID | Community Center Point | NEW_DATE | Year Check | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NORTHWEST | ARBOUR LAKE | Theft OF Vehicle | 2 | 10619 | 2022/04 | 2022 | 2022-APR-ARBOUR LAKE-Theft OF Vehicle | POINT (-114.20767498075155 51.1325947114686) | 2022-04-01 | old |
| 1 | CENTRE | BANFF TRAIL | Theft OF Vehicle | 2 | 4153 | 2023/10 | 2023 | 2023-OCT-BANFF TRAIL-Theft OF Vehicle | POINT (-114.11512839716917 51.07421633024228) | 2023-10-01 | new |
| 2 | EAST | DOVER | Theft OF Vehicle | 5 | 10351 | 2022/12 | 2022 | 2022-DEC-DOVER-Theft OF Vehicle | POINT (-113.99305400906283 51.02256772250409) | 2022-12-01 | old |
| 3 | CENTRE | GREENVIEW | Assault (Non-domestic) | 2 | 1906 | 2020/12 | 2020 | 2020-DEC-GREENVIEW-Assault (Non-domestic) | POINT (-114.05746990262463 51.09485613506574) | 2020-12-01 | old |
| 4 | NORTHWEST | HAMPTONS | Theft FROM Vehicle | 3 | 7382 | 2019/08 | 2019 | 2019-AUG-HAMPTONS-Theft FROM Vehicle | POINT (-114.14668419231347 51.14509283969437) | 2019-08-01 | old |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 67257 | EAST | SOUTHVIEW | Break & Enter - Commercial | 1 | 1805 | 2023/12 | 2023 | 2023-DEC-SOUTHVIEW-Break & Enter - Commercial | POINT (-113.99733916298928 51.03415221260387) | 2023-12-01 | new |
| 67258 | NORTH | SAGE HILL | Theft FROM Vehicle | 1 | 7924 | 2023/12 | 2023 | 2023-DEC-SAGE HILL-Theft FROM Vehicle | POINT (-114.14068609335015 51.175616972779984) | 2023-12-01 | new |
| 67259 | CENTRE | WINSTON HEIGHTS/MOUNTVIEW | Break & Enter - Commercial | 1 | 3635 | 2023/12 | 2023 | 2023-DEC-WINSTON HEIGHTS/MOUNTVIEW-Break & Ent... | POINT (-114.04184874950579 51.075298802175126) | 2023-12-01 | new |
| 67260 | CENTRE | SOUTH CALGARY | Street Robbery | 1 | 4442 | 2023/12 | 2023 | 2023-DEC-SOUTH CALGARY-Street Robbery | POINT (-114.10207470148995 51.02680215136445) | 2023-12-01 | new |
| 67261 | NORTHEAST | SKYLINE EAST | Break & Enter - Commercial | 1 | 0 | 2023/12 | 2023 | 2023-DEC-SKYLINE EAST-Break & Enter - Commercial | POINT (-114.03893128621468 51.10015261262966) | 2023-12-01 | new |
67193 rows × 11 columns
More Parctice on Grouping and Aggregating#
We can aggregate DataFrame by more than two columns at the same time.
For example, we want to see the number of crimes by category in each year.
We can use .groupby() on two columns Categor and NEW_DATE
# group the data by NEW_DATE yearly and Category and calculate the sum of "Crime Count"
grouped_data_02 = data.groupby([data["NEW_DATE"].dt.year, "Category"])
# aggregate the grouped data by sum on only "Crime Count" column
new_aggregate_data_03 = grouped_data_02.aggregate({"Crime Count": "sum"})
# change the column name of "Crime Count" to "Total Crime Count"
new_aggregate_data_03.rename(columns={"Crime Count": "Total Crime Count"}, inplace=True)
# print the new aggregate data
new_aggregate_data_03
| Total Crime Count | ||
|---|---|---|
| NEW_DATE | Category | |
| 2018 | Assault (Non-domestic) | 3828 |
| Break & Enter - Commercial | 5054 | |
| Break & Enter - Dwelling | 2693 | |
| Break & Enter - Other Premises | 2103 | |
| Commercial Robbery | 305 | |
| Street Robbery | 658 | |
| Theft FROM Vehicle | 12624 | |
| Theft OF Vehicle | 6056 | |
| Violence Other (Non-domestic) | 2011 | |
| 2019 | Assault (Non-domestic) | 3913 |
| Break & Enter - Commercial | 5782 | |
| Break & Enter - Dwelling | 2401 | |
| Break & Enter - Other Premises | 2303 | |
| Commercial Robbery | 382 | |
| Street Robbery | 649 | |
| Theft FROM Vehicle | 14723 | |
| Theft OF Vehicle | 6028 | |
| Violence Other (Non-domestic) | 1973 | |
| 2020 | Assault (Non-domestic) | 3529 |
| Break & Enter - Commercial | 4203 | |
| Break & Enter - Dwelling | 1686 | |
| Break & Enter - Other Premises | 2500 | |
| Commercial Robbery | 281 | |
| Street Robbery | 525 | |
| Theft FROM Vehicle | 12387 | |
| Theft OF Vehicle | 4649 | |
| Violence Other (Non-domestic) | 2014 | |
| 2021 | Assault (Non-domestic) | 4036 |
| Break & Enter - Commercial | 3212 | |
| Break & Enter - Dwelling | 1577 | |
| Break & Enter - Other Premises | 1835 | |
| Commercial Robbery | 359 | |
| Street Robbery | 575 | |
| Theft FROM Vehicle | 10158 | |
| Theft OF Vehicle | 4445 | |
| Violence Other (Non-domestic) | 2179 | |
| 2022 | Assault (Non-domestic) | 4139 |
| Break & Enter - Commercial | 4273 | |
| Break & Enter - Dwelling | 1586 | |
| Break & Enter - Other Premises | 1948 | |
| Commercial Robbery | 391 | |
| Street Robbery | 536 | |
| Theft FROM Vehicle | 12344 | |
| Theft OF Vehicle | 5419 | |
| Violence Other (Non-domestic) | 2084 | |
| 2023 | Assault (Non-domestic) | 4576 |
| Break & Enter - Commercial | 3344 | |
| Break & Enter - Dwelling | 1525 | |
| Break & Enter - Other Premises | 1298 | |
| Commercial Robbery | 311 | |
| Street Robbery | 693 | |
| Theft FROM Vehicle | 8951 | |
| Theft OF Vehicle | 4448 | |
| Violence Other (Non-domestic) | 2129 |