Lesson 7 Gap Analysis with Categorical Variables
The focus of this tutorial is on “associations” between two categorical variables. The scenario we will examine is the following:
Are managerial status and gender at the bank related?
Naturally, we should expect managerial status and gender to be independent—there should be no relationship between the two. If there is a relationship between the two, we have a gap (a situation that requires managerial attention).
7.1 Categorical Variables
7.1.1 Preliminaries
The two variables we are interested in (managerial status and gender) are categorical. We already have a text-based gender variable in the Bank data set. We can map the ordinal variable JobGrade
to a binary categorical variable using the procedure described in Lesson @reg(recode). I call the new variable “Mgmt.text” in order to indicate that it is a text variable:
7.2 Tables in R
7.2.1 Simple Table
R has a built-in table command that creates an n x m “pivot table”. In this case, the table will show the number of employees in each combination of managerial status and gender. The table command needs to know the two categorical variables (in this case: Bank$Gender
and Bank$Mgmt.Text
):
##
## mgmt non-mgmt
## Female 10 130
## Male 25 43
7.2.2 Frequency Verus Proportions
The table above provides a good summary of the data, but is not useful for comparison purposes because we have different number of male and female employees. To assess whether the two variables are independent, we need to look at proportions. The easiest way to do this in R is to assign the table to a new object and use the table object as input to other R functions:
mytable <- table(Bank$Gender, Bank$Mgmt.text)
addmargins(mytable, margin=c(1,2)) ## just to show the marginal frequencies
##
## mgmt non-mgmt Sum
## Female 10 130 140
## Male 25 43 68
## Sum 35 173 208
##
## mgmt non-mgmt
## Female 0.04807692 0.62500000
## Male 0.12019231 0.20673077
By default, prop.table
calulates the relative frequency of each cell. For example, there are 130 employees who are both female and non-mgmt out of a total of 208 employees. The relative frequency of Female + non-mgmt is thus 130/208 = 0.625.
A more useful measure is this case the row percentages: of the 140 females, what proportion are mangement? And how does that compare to the percentages for the 68 male employees?
You can tell prop.table
to use marginal totals rather than the grand total when calculating proportions. Given the convention that tables are described as rows x columns, we set the value of the margin
argument to 1 for row percentages and 2 for column percentages:
##
## mgmt non-mgmt
## Female 0.07142857 0.92857143
## Male 0.36764706 0.63235294
The 130 female non-managers out of a marginal total of 140 female employees yields a proportion of females who are non-managers of 0.929. We can make this look a bit more like percentages by multiplying and rounding the resulting table:
##
## mgmt non-mgmt
## Female 7 93
## Male 37 63
This makes it clear that the proportion of females who are managers in the bank is quite different from the proportion of males who are managers. The sum of the percentages in each row is 100%.
7.3 Contingency Tables
The built-in table function is R is a good start, but it would be a lot of additional work to use it to do full contingency tables and chi-squared tests of independence. Fortunately, we can use third-party packages to perform specialized analyses. For example, the “gmodels” package provides contingency table functionality that is almost identical to SAS.
7.3.1 Using gmodels
gmodels is not installed by default so use RStudio’s Packages tab to install the package. The procedure is identical to that used to install the tidyverse packages in the data tutorial.
7.3.2 Using gmodel’s CrossTable Command
The gmodels has a CrossTable
function. A brief explanation of the options I have selected:
- As before, the first two variables are my columns containing categorical data
- expected=TRUE creates a chi-squared test of independence
- prop.t=FALSE turns off the cell-level proportions. These are just clutter for my purposes.
- prop.c=FALSE turns off column percentages.
- prop.chisq=FALSE suppresses the contribution to the chi-squared statistic in each cell. Normally I like to show this, but I want to keep things simple for now.
# 2-Way Cross Tabulation
library(gmodels)
CrossTable(Bank$Gender, Bank$Mgmt.text, expected=TRUE, prop.t=FALSE, prop.c=FALSE, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Expected N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 208
##
##
## | Bank$Mgmt.text
## Bank$Gender | mgmt | non-mgmt | Row Total |
## -------------|-----------|-----------|-----------|
## Female | 10 | 130 | 140 |
## | 23.558 | 116.442 | |
## | 0.071 | 0.929 | 0.673 |
## -------------|-----------|-----------|-----------|
## Male | 25 | 43 | 68 |
## | 11.442 | 56.558 | |
## | 0.368 | 0.632 | 0.327 |
## -------------|-----------|-----------|-----------|
## Column Total | 35 | 173 | 208 |
## -------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 28.69528 d.f. = 1 p = 8.470998e-08
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 26.61778 d.f. = 1 p = 2.479519e-07
##
##
To walk through this: The legend of the top of the (rather ugly) output tells you what is in each cell:
- The observed joint frequency. The data set contains 10 employees who are both Female and Mgmt.
- The expected joint frequency given the assumption of independence. This tell us that if gender and management status are independent, we should expect 23.558 (i.e., 24) employees who are both Female and Mgmt. The fact that the observed frequency is much lower than the expected frequency suggests that they two variables are not independent in the Bank data set.
- The row percentages tell you, for example, how many females (the row) belong to the Mgmt and non-Mgmt classes. In this case, 7.1% of females are managers. If you skip down a row, you see that 37% of males are managers.
- The column percentages are summed in the other direction. They tell you for a particular column (e.g., managers) what proportion are female and male. Here 29% of managers are female and 71% are male.
You can eyeball this all you want, but the real question is whether there is statistical evidence that Gender and Management Status are NOT independent. The chi-squared independence test at the bottom of the output provides the answer.
# Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 28.7 d.f. = 1 p = 8.471e-08
The p-value is very small (10-8), which indicates there is a very small probability that the two dimensions are independent. If they are not independent, they must be dependent (associated) in some way. This is a gap.