4. Recoding Data

4.1. Preliminaries

I include the data import and library import commands at the start of each lesson so that the lessons are self-contained.

import pandas as pd
bank = pd.read_csv('Data/Bank.csv')

4.2. Appending a column

As in R, we can add a column (“Series”) to our Pandas data frame. In the code below, I add a new column called “Dummy” and set every value in the series to zero.

bank['Dummy'] = 0
bank.head()
Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob Salary Dummy
0 1 3 1 92 69 Male 1 No 32.0 0
1 2 1 1 81 57 Female 1 No 39.1 0
2 3 1 1 83 60 Female 0 No 33.2 0
3 4 2 1 87 55 Female 7 No 30.6 0
4 5 3 1 92 67 Male 0 No 29.0 0

Setting all values of “Dummy” to a constant value is not very useful, so I can drop the column using the drop method. Like many functions in Pandas, drop requires an axis argument (where 0=row and 1=column). The inplace = True argument is also common in Pandas: it is equivalent to bank = bank.drop(...). That is, it ensures the changes are not part of a new data frame but are written back to the original data frame.

bank.drop('Dummy', axis=1, inplace=True)
bank.head()
Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob Salary
0 1 3 1 92 69 Male 1 No 32.0
1 2 1 1 81 57 Female 1 No 39.1
2 3 1 1 83 60 Female 0 No 33.2
3 4 2 1 87 55 Female 7 No 30.6
4 5 3 1 92 67 Male 0 No 29.0

4.3. Recoding using the ternary operator

Recoding is easy in R because R naturally manages arrays and vectors. Based on our experience with R, we might expect the following expression to work. The core of the expression is Python’s inline if statement (or ternary operator), which takes the form: <return value if true> if <logical expression> else <return value if false>

To remap “Female” and “Male” to 1 and 0, we might think we could use the following ternary operator: 1 if bank['Gender'] == "Female" else 0

Unfortunately, although this approach works magically in R, it does not work in Python. This is because the ternary operator does not work on the entire bank[‘Gender’] series. Of course, we have some alternatives.

4.3.1. The numpy where method

Numpy is another useful Python library (which means it has to be imported before it is used in our code). Its where() method works the same as the ternary operator, except it works with arrays of data: where(<logical condition>, <value if true>, <value if false>)

import numpy as np
bank['GenderDummy_F'] = np.where(bank['Gender'] == "Female", 1, 0)
bank.head()
Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob Salary GenderDummy_F
0 1 3 1 92 69 Male 1 No 32.0 0
1 2 1 1 81 57 Female 1 No 39.1 1
2 3 1 1 83 60 Female 0 No 33.2 1
3 4 2 1 87 55 Female 7 No 30.6 1
4 5 3 1 92 67 Male 0 No 29.0 0

4.3.2. Applying a function

Pandas has a special method called apply() which applies an expression to each element of the Series object. Which expression? The easiest way to see how this works is to start with a parameterized function that implements the if/then logic. What follows is a standard function declaration in Python. The code defines a new function called “my_recode” which takes a single parameter “gender”. The function returns a 1 or 0 depending on the value passed to it:

def my_recode(gender):
    if gender == "Female":
        return 1
    else:
        return 0

Once defined, we can call the function anywhere within our notebook. The code below tests the function over the expect range of inputs. We see that we get 1 and 0 in response, as expected:

my_recode("Female"), my_recode("Male")
(1, 0)

Now we can use the Pandas apply() method to call the function for each value of the “Gender” column:

bank['GenderDummy_F'] = bank['Gender'].apply(my_recode)
bank.head()
Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob Salary GenderDummy_F
0 1 3 1 92 69 Male 1 No 32.0 0
1 2 1 1 81 57 Female 1 No 39.1 1
2 3 1 1 83 60 Female 0 No 33.2 1
3 4 2 1 87 55 Female 7 No 30.6 1
4 5 3 1 92 67 Male 0 No 29.0 0

4.3.3. Applying a lambda function

A slightly more elegant approach is to apply a lambda function in Python. A lambda function is simply a short, anonymous (unnamed), inline function. It saves us from having to define a separate function (as we did with my_recode). In addition, the lambda function makes the argument (in this case x) explicit. The explicit, non-array argument allows us to use the ternary operator:

bank['GenderDummy_F'] = bank['Gender'].apply(lambda x: 1 if x == "Female" else 0)
bank.head()
Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob Salary GenderDummy_F
0 1 3 1 92 69 Male 1 No 32.0 0
1 2 1 1 81 57 Female 1 No 39.1 1
2 3 1 1 83 60 Female 0 No 33.2 1
3 4 2 1 87 55 Female 7 No 30.6 1
4 5 3 1 92 67 Male 0 No 29.0 0

The obvious advantage with the apply() method is that the function (be it explicitly named or lambda) can be arbitrarily complex.

4.4. Replacing values from a list

Pandas has a replace() method that can take lists. For example, we could create a list of job grades (1-6) and a corresponding list of “managerial status” for each of the job grades. Thus, when the replace() method sees a job grade of 1, it replaces it with the corresponding value in the other list.

grades = [1,2,3,4,5,6]
status = ["non-mgmt", "non-mgmt", "non-mgmt", "non-mgmt", "mgmt", "mgmt"]

bank['Manager'] = bank['JobGrade'].replace(grades, status)
bank[170:175]
Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob Salary GenderDummy_F Manager
170 171 2 4 79 42 Female 1 No 45.5 1 non-mgmt
171 172 3 4 84 58 Female 0 No 44.5 1 non-mgmt
172 173 2 4 82 55 Female 2 No 51.2 1 non-mgmt
173 174 5 5 88 61 Male 0 No 47.5 0 mgmt
174 175 5 5 87 58 Female 0 No 44.5 1 mgmt

Here I create a list of six job grades and six managerial statuses (the lists have to be the same length and the i th job grade has to correspond to the i th managerial status). Since the inline = True argument is not passed to replace(), no change is made to the underlying “Job Grade” column. Instead, I assign the output of the replace() method to a new column called “Manager”.

Instead of calling head() (or tail()) to preview the results, I use Python’s slice to show rows 170-175. This gives me a sample of managerial and non-managerial employees.

Of course, it doesn’t take much imagination to see how the replace() function could be used to create dummy variables. Returning to the “Gender” example:

genders=["Female", "Male"]
dummy_vars=[1,0]

bank['GenderDummy_F'] = bank['Gender'].replace(genders, dummy_vars)
bank.head()
Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob Salary GenderDummy_F Manager
0 1 3 1 92 69 Male 1 No 32.0 0 non-mgmt
1 2 1 1 81 57 Female 1 No 39.1 1 non-mgmt
2 3 1 1 83 60 Female 0 No 33.2 1 non-mgmt
3 4 2 1 87 55 Female 7 No 30.6 1 non-mgmt
4 5 3 1 92 67 Male 0 No 29.0 0 non-mgmt

4.5. Logging variables

As we have seen, we occasionally want to transform a numerical column in order to increase the linearity of out models. For this, we can use the numpy log() function, which returns the natural (base \(e\)) logarithm:

bank['logSalary'] = np.log(bank['Salary'])

If we want, we can plot the results. In this case, a log transform does not really improve the normality of the salary data. The underlying issue appears to be bimodality—there are actually two salary distributions: workers and managers.

import seaborn as sns
sns.kdeplot(x=bank['logSalary'], shade=True, linewidth=2);
_images/05_recode_25_0.png