{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Importing Data\n",
    "One of the most magical things about R is its ability to handle matrices of data using the data frame object (or tibble in the Tidyverse). The `pandas` library gives Python much the same functionality.\n",
    "\n",
    "Recall that a data frame is an object consisting of rows and columns (just like a spreadsheet or database table). In addition to storing the data, the data frame object also has useful properties and methods that can be accessed through code.  Thus, the first step in data analysis is load data from a file into a data frame object.\n",
    "\n",
    "## Loading the Pandas library\n",
    "A Python library contains things (objects, code) that is not part of the core Python language but is nonetheless useful to some community of users.  Libraries save us from re-inventing the wheel: Once someone has created a library and made it available for download, we can use the contents of the library in our own Python programs.\n",
    "\n",
    "For example, to use the `pandas.DataFrame` object in our programs, we must first import the Pandas library into our environment.\n",
    "\n",
    "Start by creating a new notebook in Jupyter and moving to the first cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is now possible to directly reference `pandas.DataFrame` in my scripts and Jupyter will know what I am talking about.  Note, however, that programmers are lazy, and typing the word \"pandas\" over and over again is seen as an unnecessary burden.  As such, programmers typically _alias_ the library when they import it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, \"pd\" is the alias for \"pandas\". So I can write `pd.DataFrame` in my code rather than `pandas.DataFrame`.  Not sure what I will do with all the time I save.\n",
    "\n",
    "Finally, if I plan to use the DataFrame object frequently in my code, I can call it out by name using the `from` keyword:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now I can type `DataFrame` rather than `pd.DataFrame`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a DataFrame object\n",
    "Python is an object-orient programming language.  You create an _instance_ of an _object_ by calling its *constructor*, typically with some arguments.\n",
    "\n",
    "To illustrate, let's create a data frame called \"bank\" that contains data from a CSV file stored on my computer in the \"Data\" folder.  To do this, we call Panda's `read_csv` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Employee</th>\n",
       "      <th>EducLev</th>\n",
       "      <th>JobGrade</th>\n",
       "      <th>YrHired</th>\n",
       "      <th>YrBorn</th>\n",
       "      <th>Gender</th>\n",
       "      <th>YrsPrior</th>\n",
       "      <th>PCJob</th>\n",
       "      <th>Salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>92</td>\n",
       "      <td>69</td>\n",
       "      <td>Male</td>\n",
       "      <td>1</td>\n",
       "      <td>No</td>\n",
       "      <td>32.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>81</td>\n",
       "      <td>57</td>\n",
       "      <td>Female</td>\n",
       "      <td>1</td>\n",
       "      <td>No</td>\n",
       "      <td>39.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>83</td>\n",
       "      <td>60</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "      <td>33.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>87</td>\n",
       "      <td>55</td>\n",
       "      <td>Female</td>\n",
       "      <td>7</td>\n",
       "      <td>No</td>\n",
       "      <td>30.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>92</td>\n",
       "      <td>67</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "      <td>29.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Employee  EducLev  JobGrade  YrHired  YrBorn  Gender  YrsPrior PCJob  \\\n",
       "0         1        3         1       92      69    Male         1    No   \n",
       "1         2        1         1       81      57  Female         1    No   \n",
       "2         3        1         1       83      60  Female         0    No   \n",
       "3         4        2         1       87      55  Female         7    No   \n",
       "4         5        3         1       92      67    Male         0    No   \n",
       "\n",
       "   Salary  \n",
       "0    32.0  \n",
       "1    39.1  \n",
       "2    33.2  \n",
       "3    30.6  \n",
       "4    29.0  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bank = pd.read_csv('Data/Bank.csv')\n",
    "bank.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the `head(n)` method can be called on the new DataFrame `bank`. Head simply prints out the first _n_ rows of the data frame so you can see if it imported okay."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    <B>Hint:</B>\n",
    "    Pointing to a file on your computer is always tricky when you don't have a pop-up file system chooser like you do in most Windows applications. Some basic hints:\n",
    "<OL>\n",
    "    <LI>If practical, put the data file in the same folder as your Jupyter notebook.  In the example above, I created a subfolder called \"Data\" and put the bank data there.  This makes it easier for you to remember (and type) the path to the data file.</LI>\n",
    "    <LI>If your data must reside somewhere else on your computer or network: Be very careful when typing.  Cut and paste the path from Windows Explorer if required.</LI>\n",
    "    <LI>Beware the slashes/backslashes. Unix-like operating systems seperate directory names with a slash \"/\" whereas Microsoft Windows uses a backslash \"\\\". In general, Python seperates folders with forward slashes.  The backward slash is used in Python as the 'escape' character.  If you cut and paste from Windows Explorer, you must either change all the backslashes to slashes or add the \"r\" prefix to your pathname to indicate a \"raw\" string of text (Python will ingore the special meaning of backslashes).  So the following should work for a more complex file location:\n",
    "    <OL>\n",
    "        <LI><code>pd.read_csv('C:/Users/Michael Brydon/Data/Bank.csv\")</code> &mdash; all forward slashes</LI>\n",
    "        <LI><code>pd.read_csv(r'C:\\Users\\Michael Brydon\\Data/Bank.csv\")</code> &mdash; mixture of slashes and backslashes with the \"r\" prefix</LI>\n",
    "     </OL>\n",
    "</OL>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a DataFrame object from an Excel file\n",
    "The Pandas library supports importing from many different data sources, including CSV, Excel, JSON, or even direct database access over a network.\n",
    "\n",
    "Here we use `read_excel` instead of `read_csv`.  It is the same data, but in a different file format on my computer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Employee</th>\n",
       "      <th>EducLev</th>\n",
       "      <th>JobGrade</th>\n",
       "      <th>YrHired</th>\n",
       "      <th>YrBorn</th>\n",
       "      <th>Gender</th>\n",
       "      <th>YrsPrior</th>\n",
       "      <th>PCJob</th>\n",
       "      <th>Salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>92</td>\n",
       "      <td>69</td>\n",
       "      <td>Male</td>\n",
       "      <td>1</td>\n",
       "      <td>No</td>\n",
       "      <td>32.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>81</td>\n",
       "      <td>57</td>\n",
       "      <td>Female</td>\n",
       "      <td>1</td>\n",
       "      <td>No</td>\n",
       "      <td>39.1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>83</td>\n",
       "      <td>60</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "      <td>33.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>87</td>\n",
       "      <td>55</td>\n",
       "      <td>Female</td>\n",
       "      <td>7</td>\n",
       "      <td>No</td>\n",
       "      <td>30.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>92</td>\n",
       "      <td>67</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "      <td>29.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Employee  EducLev  JobGrade  YrHired  YrBorn  Gender  YrsPrior PCJob  \\\n",
       "0         1        3         1       92      69    Male         1    No   \n",
       "1         2        1         1       81      57  Female         1    No   \n",
       "2         3        1         1       83      60  Female         0    No   \n",
       "3         4        2         1       87      55  Female         7    No   \n",
       "4         5        3         1       92      67    Male         0    No   \n",
       "\n",
       "   Salary  \n",
       "0    32.0  \n",
       "1    39.1  \n",
       "2    33.2  \n",
       "3    30.6  \n",
       "4    29.0  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bankxl = pd.read_excel('Data/Bank.xlsx')\n",
    "bankxl.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Running `head()` after importing is a good idea just to make sure you imported what you think you imported.  If not, you will have to either:\n",
    "- edit your source file (Excel) to make sure it has the necessary tabular structure (with the first row as headings, and so on)\n",
    "- add arguments to your `read_...` method to make sure you are starting at the correct row, reading the correct spreadsheet, and so on.\n",
    "\n",
    "Here is where Google-ing and learning Python go hand in hand.  If worst comes to worst, read [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confirming data frame properties\n",
    "As noted previously, objects in Python expose useful properties and methods.  For example, we can confirm the size of the `bank` data frame with the `shape` property, which gives us the number of rows (209) and columns (9):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(208, 9)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bank.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the `describe()` method to generate some summary statistics:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Employee</th>\n",
       "      <th>EducLev</th>\n",
       "      <th>JobGrade</th>\n",
       "      <th>YrHired</th>\n",
       "      <th>YrBorn</th>\n",
       "      <th>YrsPrior</th>\n",
       "      <th>Salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>208.000000</td>\n",
       "      <td>208.000000</td>\n",
       "      <td>208.000000</td>\n",
       "      <td>208.000000</td>\n",
       "      <td>208.000000</td>\n",
       "      <td>208.000000</td>\n",
       "      <td>208.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>104.500000</td>\n",
       "      <td>3.158654</td>\n",
       "      <td>2.759615</td>\n",
       "      <td>85.326923</td>\n",
       "      <td>54.605769</td>\n",
       "      <td>2.375000</td>\n",
       "      <td>39.921923</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>60.188592</td>\n",
       "      <td>1.467464</td>\n",
       "      <td>1.566529</td>\n",
       "      <td>6.987832</td>\n",
       "      <td>10.318988</td>\n",
       "      <td>3.135237</td>\n",
       "      <td>11.256154</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>56.000000</td>\n",
       "      <td>30.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>26.700000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>52.750000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>82.000000</td>\n",
       "      <td>47.750000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>33.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>104.500000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>87.000000</td>\n",
       "      <td>56.500000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>37.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>156.250000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>90.000000</td>\n",
       "      <td>63.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>44.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>208.000000</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>93.000000</td>\n",
       "      <td>73.000000</td>\n",
       "      <td>18.000000</td>\n",
       "      <td>97.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Employee     EducLev    JobGrade     YrHired      YrBorn    YrsPrior  \\\n",
       "count  208.000000  208.000000  208.000000  208.000000  208.000000  208.000000   \n",
       "mean   104.500000    3.158654    2.759615   85.326923   54.605769    2.375000   \n",
       "std     60.188592    1.467464    1.566529    6.987832   10.318988    3.135237   \n",
       "min      1.000000    1.000000    1.000000   56.000000   30.000000    0.000000   \n",
       "25%     52.750000    2.000000    1.000000   82.000000   47.750000    0.000000   \n",
       "50%    104.500000    3.000000    3.000000   87.000000   56.500000    1.000000   \n",
       "75%    156.250000    5.000000    4.000000   90.000000   63.000000    4.000000   \n",
       "max    208.000000    5.000000    6.000000   93.000000   73.000000   18.000000   \n",
       "\n",
       "           Salary  \n",
       "count  208.000000  \n",
       "mean    39.921923  \n",
       "std     11.256154  \n",
       "min     26.700000  \n",
       "25%     33.000000  \n",
       "50%     37.000000  \n",
       "75%     44.000000  \n",
       "max     97.000000  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bank.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, notice the difference between a _property_ like `shape` with no parentheses and a _method_ like `describe()` with parentheses for passing arguments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    <B>Hint:</B>\n",
    "     Press the Shift-Tab key while within the parantheses of a Python method for a list of the possible arguments.  Hit Shift-Tab twice to get more detailed help.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Second, by default, the `describe()` method only summarizes the numerical columns. Recall that our data frame consists of nine columns.  To see the issue, run the `info()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 208 entries, 0 to 207\n",
      "Data columns (total 9 columns):\n",
      " #   Column    Non-Null Count  Dtype  \n",
      "---  ------    --------------  -----  \n",
      " 0   Employee  208 non-null    int64  \n",
      " 1   EducLev   208 non-null    int64  \n",
      " 2   JobGrade  208 non-null    int64  \n",
      " 3   YrHired   208 non-null    int64  \n",
      " 4   YrBorn    208 non-null    int64  \n",
      " 5   Gender    208 non-null    object \n",
      " 6   YrsPrior  208 non-null    int64  \n",
      " 7   PCJob     208 non-null    object \n",
      " 8   Salary    208 non-null    float64\n",
      "dtypes: float64(1), int64(6), object(2)\n",
      "memory usage: 14.8+ KB\n"
     ]
    }
   ],
   "source": [
    "bank.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we see the nine columns in the data frame listed by data type. Six are integer values, one (Salary) is a floating point number, and two (Gender, PCJob) are \"Objects\" (string objects):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Converting strings to categories\n",
    "Recall in R that we made a distinction between _strings_ (character data) like a person's name and *factors*, which look like strings but have a limited set of repeated values.  Color is a factor, for example, because it takes on a few known values like \"red\", \"green\", and \"blue\".  Similarly, a column called `car_brand` would be a factor because it would contain a finite number of string values, such as \"Ford\", \"Tesla\", \"Ferarri\".\n",
    "\n",
    "Pandas permits the same distinction, but instead uses the term _category_ instead of _factor_.  This is telling: R uses the old statistical technical term \"factor\" whereas Pandas/Python uses the more straightforward term \"category\". This is the difference between the two languages in a nutshell.\n",
    "\n",
    "The process for replacing the two (string) \"Object\" columns with categories is similar to the one we used in R.  The key is understanding how to reference columns in Python.  Two possibilities:\n",
    "- square bracket notation: `bank['Gender']`\n",
    "- dot notation: `bank.Gender`\n",
    "\n",
    "Of these two, square bracket notation is slightly more flexible because it permits column names with spaces, e.g., `dataframe['column name']`.  The dot notation of this would fail because Python has no way of knowing what the space after \"column\" means: `dataframe.column name`.\n",
    "\n",
    "Once we know how to reference a column (or a \"Series\" in Pandas-speak), we can run the type conversion method and specify \"category\" as the output data type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        Male\n",
       "1      Female\n",
       "2      Female\n",
       "3      Female\n",
       "4        Male\n",
       "        ...  \n",
       "203      Male\n",
       "204      Male\n",
       "205      Male\n",
       "206      Male\n",
       "207    Female\n",
       "Name: Gender, Length: 208, dtype: category\n",
       "Categories (2, object): [Female, Male]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bank['Gender'].astype('category')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note at the bottom of the summary that the `Gender` category consists of two values: Female and Male.\n",
    "\n",
    "Of course, running the command above just lists the converted values; it does not do anything to the \"bank\" data frame. To _replace_ the existing column in the data frame, we use the assignment operator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "bank['Gender'] = bank['Gender'].astype('category')\n",
    "bank['PCJob'] = bank['PCJob'].astype('category')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can confirm the result by running the `describe()` method again, but this time using an argument to ensure we get only the categorical columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Gender</th>\n",
       "      <th>PCJob</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>208</td>\n",
       "      <td>208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>140</td>\n",
       "      <td>189</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        Gender PCJob\n",
       "count      208   208\n",
       "unique       2     2\n",
       "top     Female    No\n",
       "freq       140   189"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bank.describe(include='category')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This output tells us that we have 208 rows of data.  Both the \"Gender\" and \"PCJob\" columns have two unique categorical values and the most frequent are \"Female\" and \"No\" respectively. 140 of the 208 employees are female while 189/208 do not  have PC jobs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Type conversion using programming constructs\n",
    "As an aside, it is possible to do the same type conversions _en masse_ in Python using core programming constructs such as lists, iterators, and looping.  Specifically, for a larger data set with many columns, it might make sense to do the following:\n",
    "1. make a list of the columns to convert\n",
    "2. iterate through each column making the conversion assignment, as above\n",
    "3. loop through the values of the variable `col` until all the columns are converted\n",
    "\n",
    "Use of an iterated list is not really worth it in this case (two lines of code using either approach):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 208 entries, 0 to 207\n",
      "Data columns (total 9 columns):\n",
      " #   Column    Non-Null Count  Dtype   \n",
      "---  ------    --------------  -----   \n",
      " 0   Employee  208 non-null    int64   \n",
      " 1   EducLev   208 non-null    int64   \n",
      " 2   JobGrade  208 non-null    int64   \n",
      " 3   YrHired   208 non-null    int64   \n",
      " 4   YrBorn    208 non-null    int64   \n",
      " 5   Gender    208 non-null    category\n",
      " 6   YrsPrior  208 non-null    int64   \n",
      " 7   PCJob     208 non-null    category\n",
      " 8   Salary    208 non-null    float64 \n",
      "dtypes: category(2), float64(1), int64(6)\n",
      "memory usage: 12.1 KB\n"
     ]
    }
   ],
   "source": [
    "for col in ['Gender', 'PCJob']:\n",
    "    bankxl[col] = bankxl[col].astype('category')\n",
    "    \n",
    "bank.info()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}