4. Recoding Data¶
4.1. Preliminaries¶
I include the data import and library import commands at the start of each lesson so that the lessons are self-contained.
import pandas as pd
bank = pd.read_csv('Data/Bank.csv')
4.2. Appending a column¶
As in R, we can add a column (“Series”) to our Pandas data frame. In the code below, I add a new column called “Dummy” and set every value in the series to zero.
bank['Dummy'] = 0
bank.head()
Employee | EducLev | JobGrade | YrHired | YrBorn | Gender | YrsPrior | PCJob | Salary | Dummy | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 1 | 92 | 69 | Male | 1 | No | 32.0 | 0 |
1 | 2 | 1 | 1 | 81 | 57 | Female | 1 | No | 39.1 | 0 |
2 | 3 | 1 | 1 | 83 | 60 | Female | 0 | No | 33.2 | 0 |
3 | 4 | 2 | 1 | 87 | 55 | Female | 7 | No | 30.6 | 0 |
4 | 5 | 3 | 1 | 92 | 67 | Male | 0 | No | 29.0 | 0 |
Setting all values of “Dummy” to a constant value is not very useful, so I can drop the column using the drop
method. Like many functions in Pandas, drop requires an axis
argument (where 0=row and 1=column). The inplace = True
argument is also common in Pandas: it is equivalent to bank = bank.drop(...)
. That is, it ensures the changes are not part of a new data frame but are written back to the original data frame.
bank.drop('Dummy', axis=1, inplace=True)
bank.head()
Employee | EducLev | JobGrade | YrHired | YrBorn | Gender | YrsPrior | PCJob | Salary | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 1 | 92 | 69 | Male | 1 | No | 32.0 |
1 | 2 | 1 | 1 | 81 | 57 | Female | 1 | No | 39.1 |
2 | 3 | 1 | 1 | 83 | 60 | Female | 0 | No | 33.2 |
3 | 4 | 2 | 1 | 87 | 55 | Female | 7 | No | 30.6 |
4 | 5 | 3 | 1 | 92 | 67 | Male | 0 | No | 29.0 |
4.3. Recoding using the ternary operator¶
Recoding is easy in R because R naturally manages arrays and vectors. Based on our experience with R, we might expect the following expression to work. The core of the expression is Python’s inline if
statement (or ternary operator), which takes the form:
<return value if true> if <logical expression> else <return value if false>
To remap “Female” and “Male” to 1 and 0, we might think we could use the following ternary operator:
1 if bank['Gender'] == "Female" else 0
Unfortunately, although this approach works magically in R, it does not work in Python. This is because the ternary operator does not work on the entire bank[‘Gender’] series. Of course, we have some alternatives.
4.3.1. The numpy where method¶
Numpy is another useful Python library (which means it has to be imported before it is used in our code). Its where()
method works the same as the ternary operator, except it works with arrays of data:
where(<logical condition>, <value if true>, <value if false>)
import numpy as np
bank['GenderDummy_F'] = np.where(bank['Gender'] == "Female", 1, 0)
bank.head()
Employee | EducLev | JobGrade | YrHired | YrBorn | Gender | YrsPrior | PCJob | Salary | GenderDummy_F | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 1 | 92 | 69 | Male | 1 | No | 32.0 | 0 |
1 | 2 | 1 | 1 | 81 | 57 | Female | 1 | No | 39.1 | 1 |
2 | 3 | 1 | 1 | 83 | 60 | Female | 0 | No | 33.2 | 1 |
3 | 4 | 2 | 1 | 87 | 55 | Female | 7 | No | 30.6 | 1 |
4 | 5 | 3 | 1 | 92 | 67 | Male | 0 | No | 29.0 | 0 |
4.3.2. Applying a function¶
Pandas has a special method called apply()
which applies an expression to each element of the Series object. Which expression? The easiest way to see how this works is to start with a parameterized function that implements the if/then logic. What follows is a standard function declaration in Python. The code defines a new function called “my_recode” which takes a single parameter “gender”. The function returns a 1 or 0 depending on the value passed to it:
def my_recode(gender):
if gender == "Female":
return 1
else:
return 0
Once defined, we can call the function anywhere within our notebook. The code below tests the function over the expect range of inputs. We see that we get 1 and 0 in response, as expected:
my_recode("Female"), my_recode("Male")
(1, 0)
Now we can use the Pandas apply()
method to call the function for each value of the “Gender” column:
bank['GenderDummy_F'] = bank['Gender'].apply(my_recode)
bank.head()
Employee | EducLev | JobGrade | YrHired | YrBorn | Gender | YrsPrior | PCJob | Salary | GenderDummy_F | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 1 | 92 | 69 | Male | 1 | No | 32.0 | 0 |
1 | 2 | 1 | 1 | 81 | 57 | Female | 1 | No | 39.1 | 1 |
2 | 3 | 1 | 1 | 83 | 60 | Female | 0 | No | 33.2 | 1 |
3 | 4 | 2 | 1 | 87 | 55 | Female | 7 | No | 30.6 | 1 |
4 | 5 | 3 | 1 | 92 | 67 | Male | 0 | No | 29.0 | 0 |
4.3.3. Applying a lambda function¶
A slightly more elegant approach is to apply a lambda function in Python. A lambda function is simply a short, anonymous (unnamed), inline function. It saves us from having to define a separate function (as we did with my_recode
). In addition, the lambda function makes the argument (in this case x) explicit. The explicit, non-array argument allows us to use the ternary operator:
bank['GenderDummy_F'] = bank['Gender'].apply(lambda x: 1 if x == "Female" else 0)
bank.head()
Employee | EducLev | JobGrade | YrHired | YrBorn | Gender | YrsPrior | PCJob | Salary | GenderDummy_F | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 1 | 92 | 69 | Male | 1 | No | 32.0 | 0 |
1 | 2 | 1 | 1 | 81 | 57 | Female | 1 | No | 39.1 | 1 |
2 | 3 | 1 | 1 | 83 | 60 | Female | 0 | No | 33.2 | 1 |
3 | 4 | 2 | 1 | 87 | 55 | Female | 7 | No | 30.6 | 1 |
4 | 5 | 3 | 1 | 92 | 67 | Male | 0 | No | 29.0 | 0 |
The obvious advantage with the apply()
method is that the function (be it explicitly named or lambda) can be arbitrarily complex.
4.4. Replacing values from a list¶
Pandas has a replace()
method that can take lists. For example, we could create a list of job grades (1-6) and a corresponding list of “managerial status” for each of the job grades. Thus, when the replace()
method sees a job grade of 1, it replaces it with the corresponding value in the other list.
grades = [1,2,3,4,5,6]
status = ["non-mgmt", "non-mgmt", "non-mgmt", "non-mgmt", "mgmt", "mgmt"]
bank['Manager'] = bank['JobGrade'].replace(grades, status)
bank[170:175]
Employee | EducLev | JobGrade | YrHired | YrBorn | Gender | YrsPrior | PCJob | Salary | GenderDummy_F | Manager | |
---|---|---|---|---|---|---|---|---|---|---|---|
170 | 171 | 2 | 4 | 79 | 42 | Female | 1 | No | 45.5 | 1 | non-mgmt |
171 | 172 | 3 | 4 | 84 | 58 | Female | 0 | No | 44.5 | 1 | non-mgmt |
172 | 173 | 2 | 4 | 82 | 55 | Female | 2 | No | 51.2 | 1 | non-mgmt |
173 | 174 | 5 | 5 | 88 | 61 | Male | 0 | No | 47.5 | 0 | mgmt |
174 | 175 | 5 | 5 | 87 | 58 | Female | 0 | No | 44.5 | 1 | mgmt |
Here I create a list of six job grades and six managerial statuses (the lists have to be the same length and the i th job grade has to correspond to the i th managerial status). Since the inline = True
argument is not passed to replace()
, no change is made to the underlying “Job Grade” column. Instead, I assign the output of the replace()
method to a new column called “Manager”.
Instead of calling head()
(or tail()
) to preview the results, I use Python’s slice to show rows 170-175. This gives me a sample of managerial and non-managerial employees.
Of course, it doesn’t take much imagination to see how the replace()
function could be used to create dummy variables. Returning to the “Gender” example:
genders=["Female", "Male"]
dummy_vars=[1,0]
bank['GenderDummy_F'] = bank['Gender'].replace(genders, dummy_vars)
bank.head()
Employee | EducLev | JobGrade | YrHired | YrBorn | Gender | YrsPrior | PCJob | Salary | GenderDummy_F | Manager | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 1 | 92 | 69 | Male | 1 | No | 32.0 | 0 | non-mgmt |
1 | 2 | 1 | 1 | 81 | 57 | Female | 1 | No | 39.1 | 1 | non-mgmt |
2 | 3 | 1 | 1 | 83 | 60 | Female | 0 | No | 33.2 | 1 | non-mgmt |
3 | 4 | 2 | 1 | 87 | 55 | Female | 7 | No | 30.6 | 1 | non-mgmt |
4 | 5 | 3 | 1 | 92 | 67 | Male | 0 | No | 29.0 | 0 | non-mgmt |
4.5. Logging variables¶
As we have seen, we occasionally want to transform a numerical column in order to increase the linearity of out models. For this, we can use the numpy log()
function, which returns the natural (base \(e\)) logarithm:
bank['logSalary'] = np.log(bank['Salary'])
If we want, we can plot the results. In this case, a log transform does not really improve the normality of the salary data. The underlying issue appears to be bimodality—there are actually two salary distributions: workers and managers.
import seaborn as sns
sns.kdeplot(x=bank['logSalary'], shade=True, linewidth=2);