3. Filtering Data

Filtering means limiting rows and/or columns. Filtering is clearly central to any data analysis.

3.1. Preliminaries

I include the data import and library import commands at the start of each lesson so that the lessons are self-contained.

import pandas as pd
bank = pd.read_csv('Data/Bank.csv')

3.2. Boolean vectors

Filtering in Pandas relies heavily on the concept of a Boolean vectors. To illustrate, consider the following expression:

bank['Gender'] == 'Female'
0      False
1       True
2       True
3       True
4      False
       ...  
203    False
204    False
205    False
206    False
207     True
Name: Gender, Length: 208, dtype: bool

The expression tests whether each value of the Gender column is equal to the string “Female”. As in R and other languages, the logical comparison operator is double equals (“==”). The result of the expression is a vector of trues and falses corresponding to whether each of the 209 values of Gender is equal to “Female”.

Now, once we have the vector of 209 values of true or false (the Boolean vector), we can apply that to the original data frame. If the first value in the Boolean vector is true, the first row of the data frame is returned; if the first value is false, the row is skipped.

This is what it looks like:

bank[bank['Gender'] == "Female"]
Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob Salary
1 2 1 1 81 57 Female 1 No 39.1
2 3 1 1 83 60 Female 0 No 33.2
3 4 2 1 87 55 Female 7 No 30.6
5 6 3 1 92 71 Female 0 No 30.5
6 7 3 1 91 68 Female 0 No 30.0
... ... ... ... ... ... ... ... ... ...
186 187 5 5 86 58 Female 2 No 50.0
187 188 5 5 83 49 Female 2 No 61.8
188 189 4 5 79 52 Female 0 No 43.0
190 191 5 5 86 58 Female 6 No 58.5
207 208 5 6 62 33 Female 0 No 30.0

140 rows × 9 columns

If you find this notation too cumbersome, you can assign the results to a separate data frame that contains only the 140 female employees:

FemaleEmployees = bank[bank['Gender'] == "Female"]
type(FemaleEmployees)
pandas.core.frame.DataFrame

The Python type() method is used to make sure the result is a Pandas data frame. It is, so we can calculate the mean salary for female employees:

FemaleEmployees['Salary'].mean()
37.20992857142858

Python has some basic built-in functions that can be applied to the core data types, such as integers, floating point numbers, and so on. For example, if I want to take the result of mean() and round it to two decimals, I can wrap the whole expression inside the round() function:

round(FemaleEmployees['Salary'].mean(),2)
37.21

3.3. Complex filtering criteria

Boolean vectors can be created by combining conditions with & (and) and | (or). The only trick is that each condition must be in parentheses:

(bank['Gender'] == 'Female') & (bank['JobGrade'] == 1)
0      False
1       True
2       True
3       True
4      False
       ...  
203    False
204    False
205    False
206    False
207    False
Length: 208, dtype: bool

The vector can then be applied to the whole data set to filter the data frame to female employees with job grade 1. Rather than listing the results, I call the shape property to confirm that only 48 employees are included in the resulting vector.

bank[(bank['Gender'] == 'Female') & (bank['JobGrade'] == 1)].shape
(48, 9)

3.4. Filtering by a list

It is not uncommon when you have categorical data to need to filter or recode on a specific list of values. To reuse the example used previously, assume we want to create a list of managerial employees. The easiest way to do this is to use a greater-than condition:

bank[bank['JobGrade'] >= 4]
Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob Salary
145 146 5 4 90 62 Male 3 No 44.5
146 147 5 4 91 65 Male 1 No 41.0
147 148 5 4 89 58 Male 3 No 44.0
148 149 5 4 89 65 Male 0 No 44.0
149 150 5 4 90 63 Female 4 No 42.5
... ... ... ... ... ... ... ... ... ...
203 204 3 6 61 35 Male 0 No 95.0
204 205 5 6 59 34 Male 0 No 97.0
205 206 5 6 63 33 Male 0 No 88.0
206 207 5 6 60 36 Male 0 No 94.0
207 208 5 6 62 33 Female 0 No 30.0

63 rows × 9 columns

But an alternative approach (and the only approach that works with categorical data) is to create a list and use the isin() method to check membership in the list. This gives the same result as above.

mgmt = [4,5,6]
bank[bank['JobGrade'].isin(mgmt)]
Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob Salary
145 146 5 4 90 62 Male 3 No 44.5
146 147 5 4 91 65 Male 1 No 41.0
147 148 5 4 89 58 Male 3 No 44.0
148 149 5 4 89 65 Male 0 No 44.0
149 150 5 4 90 63 Female 4 No 42.5
... ... ... ... ... ... ... ... ... ...
203 204 3 6 61 35 Male 0 No 95.0
204 205 5 6 59 34 Male 0 No 97.0
205 206 5 6 63 33 Male 0 No 88.0
206 207 5 6 60 36 Male 0 No 94.0
207 208 5 6 62 33 Female 0 No 30.0

63 rows × 9 columns