{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Filtering Data\n", "Filtering means limiting rows and/or columns. Filtering is clearly central to any data analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preliminaries\n", "I include the data import and library import commands at the start of each lesson so that the lessons are self-contained." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "bank = pd.read_csv('Data/Bank.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Boolean vectors\n", "Filtering in Pandas relies heavily on the concept of a _Boolean vectors_. To illustrate, consider the following expression:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 True\n", "3 True\n", "4 False\n", " ... \n", "203 False\n", "204 False\n", "205 False\n", "206 False\n", "207 True\n", "Name: Gender, Length: 208, dtype: bool" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bank['Gender'] == 'Female'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The expression tests whether each value of the Gender column is equal to the string \"Female\". As in R and other languages, the logical comparison operator is double equals (\"==\"). The result of the expression is a vector of trues and falses corresponding to whether each of the 209 values of Gender is equal to \"Female\".\n", "\n", "Now, once we have the vector of 209 values of true or false (the Boolean vector), we can apply that to the original data frame. If the first value in the Boolean vector is true, the first row of the data frame is returned; if the first value is false, the row is skipped.\n", "\n", "This is what it looks like:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EmployeeEducLevJobGradeYrHiredYrBornGenderYrsPriorPCJobSalary
12118157Female1No39.1
23118360Female0No33.2
34218755Female7No30.6
56319271Female0No30.5
67319168Female0No30.0
..............................
186187558658Female2No50.0
187188558349Female2No61.8
188189457952Female0No43.0
190191558658Female6No58.5
207208566233Female0No30.0
\n", "

140 rows × 9 columns

\n", "
" ], "text/plain": [ " Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob \\\n", "1 2 1 1 81 57 Female 1 No \n", "2 3 1 1 83 60 Female 0 No \n", "3 4 2 1 87 55 Female 7 No \n", "5 6 3 1 92 71 Female 0 No \n", "6 7 3 1 91 68 Female 0 No \n", ".. ... ... ... ... ... ... ... ... \n", "186 187 5 5 86 58 Female 2 No \n", "187 188 5 5 83 49 Female 2 No \n", "188 189 4 5 79 52 Female 0 No \n", "190 191 5 5 86 58 Female 6 No \n", "207 208 5 6 62 33 Female 0 No \n", "\n", " Salary \n", "1 39.1 \n", "2 33.2 \n", "3 30.6 \n", "5 30.5 \n", "6 30.0 \n", ".. ... \n", "186 50.0 \n", "187 61.8 \n", "188 43.0 \n", "190 58.5 \n", "207 30.0 \n", "\n", "[140 rows x 9 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bank[bank['Gender'] == \"Female\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you find this notation too cumbersome, you can assign the results to a separate data frame that contains only the 140 female employees:\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FemaleEmployees = bank[bank['Gender'] == \"Female\"]\n", "type(FemaleEmployees)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Python `type()` method is used to make sure the result is a Pandas data frame. It is, so we can calculate the mean salary for female employees:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "37.20992857142858" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "FemaleEmployees['Salary'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python has some basic built-in functions that can be applied to the core data types, such as integers, floating point numbers, and so on. For example, if I want to take the result of `mean()` and round it to two decimals, I can wrap the whole expression inside the `round()` function:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "37.21" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "round(FemaleEmployees['Salary'].mean(),2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Complex filtering criteria\n", "Boolean vectors can be created by combining conditions with & (and) and | (or). The only trick is that each condition must be in parentheses:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 True\n", "3 True\n", "4 False\n", " ... \n", "203 False\n", "204 False\n", "205 False\n", "206 False\n", "207 False\n", "Length: 208, dtype: bool" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(bank['Gender'] == 'Female') & (bank['JobGrade'] == 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The vector can then be applied to the whole data set to filter the data frame to female employees with job grade 1. Rather than listing the results, I call the `shape` property to confirm that only 48 employees are included in the resulting vector.\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(48, 9)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bank[(bank['Gender'] == 'Female') & (bank['JobGrade'] == 1)].shape\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering by a list\n", "It is not uncommon when you have categorical data to need to filter or recode on a specific list of values. To reuse the example used previously, assume we want to create a list of managerial employees. The easiest way to do this is to use a greater-than condition:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EmployeeEducLevJobGradeYrHiredYrBornGenderYrsPriorPCJobSalary
145146549062Male3No44.5
146147549165Male1No41.0
147148548958Male3No44.0
148149548965Male0No44.0
149150549063Female4No42.5
..............................
203204366135Male0No95.0
204205565934Male0No97.0
205206566333Male0No88.0
206207566036Male0No94.0
207208566233Female0No30.0
\n", "

63 rows × 9 columns

\n", "
" ], "text/plain": [ " Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob \\\n", "145 146 5 4 90 62 Male 3 No \n", "146 147 5 4 91 65 Male 1 No \n", "147 148 5 4 89 58 Male 3 No \n", "148 149 5 4 89 65 Male 0 No \n", "149 150 5 4 90 63 Female 4 No \n", ".. ... ... ... ... ... ... ... ... \n", "203 204 3 6 61 35 Male 0 No \n", "204 205 5 6 59 34 Male 0 No \n", "205 206 5 6 63 33 Male 0 No \n", "206 207 5 6 60 36 Male 0 No \n", "207 208 5 6 62 33 Female 0 No \n", "\n", " Salary \n", "145 44.5 \n", "146 41.0 \n", "147 44.0 \n", "148 44.0 \n", "149 42.5 \n", ".. ... \n", "203 95.0 \n", "204 97.0 \n", "205 88.0 \n", "206 94.0 \n", "207 30.0 \n", "\n", "[63 rows x 9 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bank[bank['JobGrade'] >= 4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But an alternative approach (and the _only_ approach that works with categorical data) is to create a list and use the `isin()` method to check membership in the list. This gives the same result as above." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EmployeeEducLevJobGradeYrHiredYrBornGenderYrsPriorPCJobSalary
145146549062Male3No44.5
146147549165Male1No41.0
147148548958Male3No44.0
148149548965Male0No44.0
149150549063Female4No42.5
..............................
203204366135Male0No95.0
204205565934Male0No97.0
205206566333Male0No88.0
206207566036Male0No94.0
207208566233Female0No30.0
\n", "

63 rows × 9 columns

\n", "
" ], "text/plain": [ " Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob \\\n", "145 146 5 4 90 62 Male 3 No \n", "146 147 5 4 91 65 Male 1 No \n", "147 148 5 4 89 58 Male 3 No \n", "148 149 5 4 89 65 Male 0 No \n", "149 150 5 4 90 63 Female 4 No \n", ".. ... ... ... ... ... ... ... ... \n", "203 204 3 6 61 35 Male 0 No \n", "204 205 5 6 59 34 Male 0 No \n", "205 206 5 6 63 33 Male 0 No \n", "206 207 5 6 60 36 Male 0 No \n", "207 208 5 6 62 33 Female 0 No \n", "\n", " Salary \n", "145 44.5 \n", "146 41.0 \n", "147 44.0 \n", "148 44.0 \n", "149 42.5 \n", ".. ... \n", "203 95.0 \n", "204 97.0 \n", "205 88.0 \n", "206 94.0 \n", "207 30.0 \n", "\n", "[63 rows x 9 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mgmt = [4,5,6]\n", "bank[bank['JobGrade'].isin(mgmt)]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }