3. Filtering Data¶

Filtering means limiting rows and/or columns. Filtering is clearly central to any data analysis.

3.1. Preliminaries¶

I include the data import and library import commands at the start of each lesson so that the lessons are self-contained.

import pandas as pd
bank = pd.read_csv('Data/Bank.csv')

3.2. Boolean vectors¶

Filtering in Pandas relies heavily on the concept of a Boolean vectors. To illustrate, consider the following expression:

bank['Gender'] == 'Female'

0      False
1       True
2       True
3       True
4      False
       ...  
203    False
204    False
205    False
206    False
207     True
Name: Gender, Length: 208, dtype: bool

The expression tests whether each value of the Gender column is equal to the string “Female”. As in R and other languages, the logical comparison operator is double equals (“==”). The result of the expression is a vector of trues and falses corresponding to whether each of the 209 values of Gender is equal to “Female”.

Now, once we have the vector of 209 values of true or false (the Boolean vector), we can apply that to the original data frame. If the first value in the Boolean vector is true, the first row of the data frame is returned; if the first value is false, the row is skipped.

This is what it looks like:

bank[bank['Gender'] == "Female"]

	Employee	EducLev	JobGrade	YrHired	YrBorn	Gender	YrsPrior	PCJob	Salary
1	2	1	1	81	57	Female	1	No	39.1
2	3	1	1	83	60	Female	0	No	33.2
3	4	2	1	87	55	Female	7	No	30.6
5	6	3	1	92	71	Female	0	No	30.5
6	7	3	1	91	68	Female	0	No	30.0
...	...	...	...	...	...	...	...	...	...
186	187	5	5	86	58	Female	2	No	50.0
187	188	5	5	83	49	Female	2	No	61.8
188	189	4	5	79	52	Female	0	No	43.0
190	191	5	5	86	58	Female	6	No	58.5
207	208	5	6	62	33	Female	0	No	30.0

140 rows × 9 columns

If you find this notation too cumbersome, you can assign the results to a separate data frame that contains only the 140 female employees:

FemaleEmployees = bank[bank['Gender'] == "Female"]
type(FemaleEmployees)

pandas.core.frame.DataFrame

The Python type() method is used to make sure the result is a Pandas data frame. It is, so we can calculate the mean salary for female employees:

FemaleEmployees['Salary'].mean()

37.20992857142858

Python has some basic built-in functions that can be applied to the core data types, such as integers, floating point numbers, and so on. For example, if I want to take the result of mean() and round it to two decimals, I can wrap the whole expression inside the round() function:

round(FemaleEmployees['Salary'].mean(),2)

37.21

3.3. Complex filtering criteria¶

Boolean vectors can be created by combining conditions with & (and) and | (or). The only trick is that each condition must be in parentheses:

(bank['Gender'] == 'Female') & (bank['JobGrade'] == 1)

0      False
1       True
2       True
3       True
4      False
       ...  
203    False
204    False
205    False
206    False
207    False
Length: 208, dtype: bool

The vector can then be applied to the whole data set to filter the data frame to female employees with job grade 1. Rather than listing the results, I call the shape property to confirm that only 48 employees are included in the resulting vector.

bank[(bank['Gender'] == 'Female') & (bank['JobGrade'] == 1)].shape

(48, 9)

3.4. Filtering by a list¶

It is not uncommon when you have categorical data to need to filter or recode on a specific list of values. To reuse the example used previously, assume we want to create a list of managerial employees. The easiest way to do this is to use a greater-than condition:

bank[bank['JobGrade'] >= 4]

	Employee	EducLev	JobGrade	YrHired	YrBorn	Gender	YrsPrior	PCJob	Salary
145	146	5	4	90	62	Male	3	No	44.5
146	147	5	4	91	65	Male	1	No	41.0
147	148	5	4	89	58	Male	3	No	44.0
148	149	5	4	89	65	Male	0	No	44.0
149	150	5	4	90	63	Female	4	No	42.5
...	...	...	...	...	...	...	...	...	...
203	204	3	6	61	35	Male	0	No	95.0
204	205	5	6	59	34	Male	0	No	97.0
205	206	5	6	63	33	Male	0	No	88.0
206	207	5	6	60	36	Male	0	No	94.0
207	208	5	6	62	33	Female	0	No	30.0

63 rows × 9 columns

But an alternative approach (and the only approach that works with categorical data) is to create a list and use the isin() method to check membership in the list. This gives the same result as above.

mgmt = [4,5,6]
bank[bank['JobGrade'].isin(mgmt)]

	Employee	EducLev	JobGrade	YrHired	YrBorn	Gender	YrsPrior	PCJob	Salary
145	146	5	4	90	62	Male	3	No	44.5
146	147	5	4	91	65	Male	1	No	41.0
147	148	5	4	89	58	Male	3	No	44.0
148	149	5	4	89	65	Male	0	No	44.0
149	150	5	4	90	63	Female	4	No	42.5
...	...	...	...	...	...	...	...	...	...
203	204	3	6	61	35	Male	0	No	95.0
204	205	5	6	59	34	Male	0	No	97.0
205	206	5	6	63	33	Male	0	No	88.0
206	207	5	6	60	36	Male	0	No	94.0
207	208	5	6	62	33	Female	0	No	30.0