{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Filtering Data\n", "Filtering means limiting rows and/or columns. Filtering is clearly central to any data analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preliminaries\n", "I include the data import and library import commands at the start of each lesson so that the lessons are self-contained." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "bank = pd.read_csv('Data/Bank.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Boolean vectors\n", "Filtering in Pandas relies heavily on the concept of a _Boolean vectors_. To illustrate, consider the following expression:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 True\n", "3 True\n", "4 False\n", " ... \n", "203 False\n", "204 False\n", "205 False\n", "206 False\n", "207 True\n", "Name: Gender, Length: 208, dtype: bool" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bank['Gender'] == 'Female'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The expression tests whether each value of the Gender column is equal to the string \"Female\". As in R and other languages, the logical comparison operator is double equals (\"==\"). The result of the expression is a vector of trues and falses corresponding to whether each of the 209 values of Gender is equal to \"Female\".\n", "\n", "Now, once we have the vector of 209 values of true or false (the Boolean vector), we can apply that to the original data frame. If the first value in the Boolean vector is true, the first row of the data frame is returned; if the first value is false, the row is skipped.\n", "\n", "This is what it looks like:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Employee | \n", "EducLev | \n", "JobGrade | \n", "YrHired | \n", "YrBorn | \n", "Gender | \n", "YrsPrior | \n", "PCJob | \n", "Salary | \n", "
---|---|---|---|---|---|---|---|---|---|
1 | \n", "2 | \n", "1 | \n", "1 | \n", "81 | \n", "57 | \n", "Female | \n", "1 | \n", "No | \n", "39.1 | \n", "
2 | \n", "3 | \n", "1 | \n", "1 | \n", "83 | \n", "60 | \n", "Female | \n", "0 | \n", "No | \n", "33.2 | \n", "
3 | \n", "4 | \n", "2 | \n", "1 | \n", "87 | \n", "55 | \n", "Female | \n", "7 | \n", "No | \n", "30.6 | \n", "
5 | \n", "6 | \n", "3 | \n", "1 | \n", "92 | \n", "71 | \n", "Female | \n", "0 | \n", "No | \n", "30.5 | \n", "
6 | \n", "7 | \n", "3 | \n", "1 | \n", "91 | \n", "68 | \n", "Female | \n", "0 | \n", "No | \n", "30.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
186 | \n", "187 | \n", "5 | \n", "5 | \n", "86 | \n", "58 | \n", "Female | \n", "2 | \n", "No | \n", "50.0 | \n", "
187 | \n", "188 | \n", "5 | \n", "5 | \n", "83 | \n", "49 | \n", "Female | \n", "2 | \n", "No | \n", "61.8 | \n", "
188 | \n", "189 | \n", "4 | \n", "5 | \n", "79 | \n", "52 | \n", "Female | \n", "0 | \n", "No | \n", "43.0 | \n", "
190 | \n", "191 | \n", "5 | \n", "5 | \n", "86 | \n", "58 | \n", "Female | \n", "6 | \n", "No | \n", "58.5 | \n", "
207 | \n", "208 | \n", "5 | \n", "6 | \n", "62 | \n", "33 | \n", "Female | \n", "0 | \n", "No | \n", "30.0 | \n", "
140 rows × 9 columns
\n", "\n", " | Employee | \n", "EducLev | \n", "JobGrade | \n", "YrHired | \n", "YrBorn | \n", "Gender | \n", "YrsPrior | \n", "PCJob | \n", "Salary | \n", "
---|---|---|---|---|---|---|---|---|---|
145 | \n", "146 | \n", "5 | \n", "4 | \n", "90 | \n", "62 | \n", "Male | \n", "3 | \n", "No | \n", "44.5 | \n", "
146 | \n", "147 | \n", "5 | \n", "4 | \n", "91 | \n", "65 | \n", "Male | \n", "1 | \n", "No | \n", "41.0 | \n", "
147 | \n", "148 | \n", "5 | \n", "4 | \n", "89 | \n", "58 | \n", "Male | \n", "3 | \n", "No | \n", "44.0 | \n", "
148 | \n", "149 | \n", "5 | \n", "4 | \n", "89 | \n", "65 | \n", "Male | \n", "0 | \n", "No | \n", "44.0 | \n", "
149 | \n", "150 | \n", "5 | \n", "4 | \n", "90 | \n", "63 | \n", "Female | \n", "4 | \n", "No | \n", "42.5 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
203 | \n", "204 | \n", "3 | \n", "6 | \n", "61 | \n", "35 | \n", "Male | \n", "0 | \n", "No | \n", "95.0 | \n", "
204 | \n", "205 | \n", "5 | \n", "6 | \n", "59 | \n", "34 | \n", "Male | \n", "0 | \n", "No | \n", "97.0 | \n", "
205 | \n", "206 | \n", "5 | \n", "6 | \n", "63 | \n", "33 | \n", "Male | \n", "0 | \n", "No | \n", "88.0 | \n", "
206 | \n", "207 | \n", "5 | \n", "6 | \n", "60 | \n", "36 | \n", "Male | \n", "0 | \n", "No | \n", "94.0 | \n", "
207 | \n", "208 | \n", "5 | \n", "6 | \n", "62 | \n", "33 | \n", "Female | \n", "0 | \n", "No | \n", "30.0 | \n", "
63 rows × 9 columns
\n", "\n", " | Employee | \n", "EducLev | \n", "JobGrade | \n", "YrHired | \n", "YrBorn | \n", "Gender | \n", "YrsPrior | \n", "PCJob | \n", "Salary | \n", "
---|---|---|---|---|---|---|---|---|---|
145 | \n", "146 | \n", "5 | \n", "4 | \n", "90 | \n", "62 | \n", "Male | \n", "3 | \n", "No | \n", "44.5 | \n", "
146 | \n", "147 | \n", "5 | \n", "4 | \n", "91 | \n", "65 | \n", "Male | \n", "1 | \n", "No | \n", "41.0 | \n", "
147 | \n", "148 | \n", "5 | \n", "4 | \n", "89 | \n", "58 | \n", "Male | \n", "3 | \n", "No | \n", "44.0 | \n", "
148 | \n", "149 | \n", "5 | \n", "4 | \n", "89 | \n", "65 | \n", "Male | \n", "0 | \n", "No | \n", "44.0 | \n", "
149 | \n", "150 | \n", "5 | \n", "4 | \n", "90 | \n", "63 | \n", "Female | \n", "4 | \n", "No | \n", "42.5 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
203 | \n", "204 | \n", "3 | \n", "6 | \n", "61 | \n", "35 | \n", "Male | \n", "0 | \n", "No | \n", "95.0 | \n", "
204 | \n", "205 | \n", "5 | \n", "6 | \n", "59 | \n", "34 | \n", "Male | \n", "0 | \n", "No | \n", "97.0 | \n", "
205 | \n", "206 | \n", "5 | \n", "6 | \n", "63 | \n", "33 | \n", "Male | \n", "0 | \n", "No | \n", "88.0 | \n", "
206 | \n", "207 | \n", "5 | \n", "6 | \n", "60 | \n", "36 | \n", "Male | \n", "0 | \n", "No | \n", "94.0 | \n", "
207 | \n", "208 | \n", "5 | \n", "6 | \n", "62 | \n", "33 | \n", "Female | \n", "0 | \n", "No | \n", "30.0 | \n", "
63 rows × 9 columns
\n", "