{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gap Analysis with Categorical Variables\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preliminaries¶\n",
    "I include the data import and library import commands at the start of each lesson so that the lessons are self-contained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from scipy import stats\n",
    "bank = pd.read_csv('Data/Bank.csv')\n",
    "\n",
    "# Recode JobGrade to Manager\n",
    "grades = [1,2,3,4,5,6]\n",
    "status = [\"non-mgmt\", \"non-mgmt\", \"non-mgmt\", \"non-mgmt\", \"mgmt\", \"mgmt\"]\n",
    "bank['Manager'] = bank['JobGrade'].replace(grades, status)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a contingency table\n",
    "Pandas has a very simple contingency table feature.  Below, I specify the two variables of interest (Gender and Manager) and set `margins=True` so I get marginal totals (\"All\")."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Manager</th>\n",
       "      <th>mgmt</th>\n",
       "      <th>non-mgmt</th>\n",
       "      <th>All</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Gender</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Female</th>\n",
       "      <td>10</td>\n",
       "      <td>130</td>\n",
       "      <td>140</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Male</th>\n",
       "      <td>25</td>\n",
       "      <td>43</td>\n",
       "      <td>68</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>All</th>\n",
       "      <td>35</td>\n",
       "      <td>173</td>\n",
       "      <td>208</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Manager  mgmt  non-mgmt  All\n",
       "Gender                      \n",
       "Female     10       130  140\n",
       "Male       25        43   68\n",
       "All        35       173  208"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "contab_freq = pd.crosstab(\n",
    "    bank['Gender'],\n",
    "    bank['Manager'],\n",
    "    margins = True\n",
    "   )\n",
    "contab_freq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Showing row percentages\n",
    "Typically, showing frequencies is less useful than relative frequencies.  Here, I am interested in the row percentages: what is the probability that a female is a manager versus the probability a male is a manager.\n",
    "\n",
    "We can get relative frequencies using the `normalize` argument. If `normalize  = True`, then we get the relative frequency in each cell relative to the total number of employees.  This is not very useful.  What we want instead is to normalize by row.  The parameter for this is: `normalize = 'index'`.  Why \"index\" instead of \"row\"?  Because each row has a row number (or index)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>Manager</th>\n",
       "      <th>mgmt</th>\n",
       "      <th>non-mgmt</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Gender</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Female</th>\n",
       "      <td>0.071429</td>\n",
       "      <td>0.928571</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Male</th>\n",
       "      <td>0.367647</td>\n",
       "      <td>0.632353</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>All</th>\n",
       "      <td>0.168269</td>\n",
       "      <td>0.831731</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Manager      mgmt  non-mgmt\n",
       "Gender                     \n",
       "Female   0.071429  0.928571\n",
       "Male     0.367647  0.632353\n",
       "All      0.168269  0.831731"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "conttab_relfreq = pd.crosstab(\n",
    "    bank['Gender'],\n",
    "    bank['Manager'],\n",
    "    margins = True,\n",
    "    normalize='index'\n",
    "   )\n",
    "conttab_relfreq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, each row sums to 100%.  Thus, for the total set of female employees, 7\\% are managers and 94\\% are non-managers.  For males, 37\\% are managers and 63\\% are non-managers.  The advantage of this presentation is that these percentages are directly comparable even though the majority (140/208) employees of the bank are female.\n",
    "\n",
    "## Chi-squared test of independence\n",
    "The row percentages leave us with the impression that managerial status _depends_ on gender.  We can test this more formally using the $\\chi^2$ (/ˈkaɪ skweə(r)) test of independence.\n",
    "\n",
    "Scipy has a method called `chi2_contingency()` that takes a contingency table of observed frequencies as input. Note that this table cannot include marginal totals or marginal frequencies.  Instead, it must consist of _m_ x _n_ observations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(26.617776266575998,\n",
       " 2.479518719230249e-07,\n",
       " 1,\n",
       " array([[ 23.55769231, 116.44230769],\n",
       "        [ 11.44230769,  56.55769231]]))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "contab_obs = pd.crosstab(\n",
    "    bank['Gender'],\n",
    "    bank['Manager'],\n",
    "    margins = False)\n",
    "chi = stats.chi2_contingency(contab_obs)\n",
    "chi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output of the `chi2_contingency()` method is not particularly attractive but it contains what we need:\n",
    "1. The first line is the $\\chi^2$ statistic, which we can safely ignore\n",
    "2. The second line is the probability of getting a $\\chi^2$ statistic that large _if_ the two variables are independent. This _p_-value is very small ($10^{-7}$) so we conclude there is almost zero chance that gender and managerial status are independent at this bank.\n",
    "3. The third line is the degrees of freedom, which we can safely ignore.\n",
    "4. The remainder of the output is a matrix showing the _expected_ frequencies under the assumption in independence.  These expected values are quite different from the observed values above.\n",
    "\n",
    "You may notice that the $\\chi^2$ statistic and _p_-value are different from those provided by R.  This is because scipy defaults to the \"Pearson's Chi-squared test with Yates' continuity correction\" version of the test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}