1. Importing Data¶
One of the most magical things about R is its ability to handle matrices of data using the data frame object (or tibble in the Tidyverse). The pandas
library gives Python much the same functionality.
Recall that a data frame is an object consisting of rows and columns (just like a spreadsheet or database table). In addition to storing the data, the data frame object also has useful properties and methods that can be accessed through code. Thus, the first step in data analysis is load data from a file into a data frame object.
1.1. Loading the Pandas library¶
A Python library contains things (objects, code) that is not part of the core Python language but is nonetheless useful to some community of users. Libraries save us from re-inventing the wheel: Once someone has created a library and made it available for download, we can use the contents of the library in our own Python programs.
For example, to use the pandas.DataFrame
object in our programs, we must first import the Pandas library into our environment.
Start by creating a new notebook in Jupyter and moving to the first cell:
import pandas
It is now possible to directly reference pandas.DataFrame
in my scripts and Jupyter will know what I am talking about. Note, however, that programmers are lazy, and typing the word “pandas” over and over again is seen as an unnecessary burden. As such, programmers typically alias the library when they import it:
import pandas as pd
Here, “pd” is the alias for “pandas”. So I can write pd.DataFrame
in my code rather than pandas.DataFrame
. Not sure what I will do with all the time I save.
Finally, if I plan to use the DataFrame object frequently in my code, I can call it out by name using the from
keyword:
from pandas import DataFrame
Now I can type DataFrame
rather than pd.DataFrame
.
1.2. Creating a DataFrame object¶
Python is an object-orient programming language. You create an instance of an object by calling its constructor, typically with some arguments.
To illustrate, let’s create a data frame called “bank” that contains data from a CSV file stored on my computer in the “Data” folder. To do this, we call Panda’s read_csv
method:
bank = pd.read_csv('Data/Bank.csv')
bank.head()
Employee | EducLev | JobGrade | YrHired | YrBorn | Gender | YrsPrior | PCJob | Salary | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 1 | 92 | 69 | Male | 1 | No | 32.0 |
1 | 2 | 1 | 1 | 81 | 57 | Female | 1 | No | 39.1 |
2 | 3 | 1 | 1 | 83 | 60 | Female | 0 | No | 33.2 |
3 | 4 | 2 | 1 | 87 | 55 | Female | 7 | No | 30.6 |
4 | 5 | 3 | 1 | 92 | 67 | Male | 0 | No | 29.0 |
Note that the head(n)
method can be called on the new DataFrame bank
. Head simply prints out the first n rows of the data frame so you can see if it imported okay.
- If practical, put the data file in the same folder as your Jupyter notebook. In the example above, I created a subfolder called "Data" and put the bank data there. This makes it easier for you to remember (and type) the path to the data file.
- If your data must reside somewhere else on your computer or network: Be very careful when typing. Cut and paste the path from Windows Explorer if required.
- Beware the slashes/backslashes. Unix-like operating systems seperate directory names with a slash "/" whereas Microsoft Windows uses a backslash "\". In general, Python seperates folders with forward slashes. The backward slash is used in Python as the 'escape' character. If you cut and paste from Windows Explorer, you must either change all the backslashes to slashes or add the "r" prefix to your pathname to indicate a "raw" string of text (Python will ingore the special meaning of backslashes). So the following should work for a more complex file location:
pd.read_csv('C:/Users/Michael Brydon/Data/Bank.csv")
— all forward slashespd.read_csv(r'C:\Users\Michael Brydon\Data/Bank.csv")
— mixture of slashes and backslashes with the "r" prefix
1.3. Creating a DataFrame object from an Excel file¶
The Pandas library supports importing from many different data sources, including CSV, Excel, JSON, or even direct database access over a network.
Here we use read_excel
instead of read_csv
. It is the same data, but in a different file format on my computer:
bankxl = pd.read_excel('Data/Bank.xlsx')
bankxl.head()
Employee | EducLev | JobGrade | YrHired | YrBorn | Gender | YrsPrior | PCJob | Salary | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 1 | 92 | 69 | Male | 1 | No | 32.0 |
1 | 2 | 1 | 1 | 81 | 57 | Female | 1 | No | 39.1 |
2 | 3 | 1 | 1 | 83 | 60 | Female | 0 | No | 33.2 |
3 | 4 | 2 | 1 | 87 | 55 | Female | 7 | No | 30.6 |
4 | 5 | 3 | 1 | 92 | 67 | Male | 0 | No | 29.0 |
Running head()
after importing is a good idea just to make sure you imported what you think you imported. If not, you will have to either:
edit your source file (Excel) to make sure it has the necessary tabular structure (with the first row as headings, and so on)
add arguments to your
read_...
method to make sure you are starting at the correct row, reading the correct spreadsheet, and so on.
Here is where Google-ing and learning Python go hand in hand. If worst comes to worst, read the documentation
1.4. Confirming data frame properties¶
As noted previously, objects in Python expose useful properties and methods. For example, we can confirm the size of the bank
data frame with the shape
property, which gives us the number of rows (209) and columns (9):
bank.shape
(208, 9)
We can use the describe()
method to generate some summary statistics:
bank.describe()
Employee | EducLev | JobGrade | YrHired | YrBorn | YrsPrior | Salary | |
---|---|---|---|---|---|---|---|
count | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 | 208.000000 |
mean | 104.500000 | 3.158654 | 2.759615 | 85.326923 | 54.605769 | 2.375000 | 39.921923 |
std | 60.188592 | 1.467464 | 1.566529 | 6.987832 | 10.318988 | 3.135237 | 11.256154 |
min | 1.000000 | 1.000000 | 1.000000 | 56.000000 | 30.000000 | 0.000000 | 26.700000 |
25% | 52.750000 | 2.000000 | 1.000000 | 82.000000 | 47.750000 | 0.000000 | 33.000000 |
50% | 104.500000 | 3.000000 | 3.000000 | 87.000000 | 56.500000 | 1.000000 | 37.000000 |
75% | 156.250000 | 5.000000 | 4.000000 | 90.000000 | 63.000000 | 4.000000 | 44.000000 |
max | 208.000000 | 5.000000 | 6.000000 | 93.000000 | 73.000000 | 18.000000 | 97.000000 |
First, notice the difference between a property like shape
with no parentheses and a method like describe()
with parentheses for passing arguments.
Second, by default, the describe()
method only summarizes the numerical columns. Recall that our data frame consists of nine columns. To see the issue, run the info()
method:
bank.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 208 entries, 0 to 207
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Employee 208 non-null int64
1 EducLev 208 non-null int64
2 JobGrade 208 non-null int64
3 YrHired 208 non-null int64
4 YrBorn 208 non-null int64
5 Gender 208 non-null object
6 YrsPrior 208 non-null int64
7 PCJob 208 non-null object
8 Salary 208 non-null float64
dtypes: float64(1), int64(6), object(2)
memory usage: 14.8+ KB
Here we see the nine columns in the data frame listed by data type. Six are integer values, one (Salary) is a floating point number, and two (Gender, PCJob) are “Objects” (string objects):
1.5. Converting strings to categories¶
Recall in R that we made a distinction between strings (character data) like a person’s name and factors, which look like strings but have a limited set of repeated values. Color is a factor, for example, because it takes on a few known values like “red”, “green”, and “blue”. Similarly, a column called car_brand
would be a factor because it would contain a finite number of string values, such as “Ford”, “Tesla”, “Ferarri”.
Pandas permits the same distinction, but instead uses the term category instead of factor. This is telling: R uses the old statistical technical term “factor” whereas Pandas/Python uses the more straightforward term “category”. This is the difference between the two languages in a nutshell.
The process for replacing the two (string) “Object” columns with categories is similar to the one we used in R. The key is understanding how to reference columns in Python. Two possibilities:
square bracket notation:
bank['Gender']
dot notation:
bank.Gender
Of these two, square bracket notation is slightly more flexible because it permits column names with spaces, e.g., dataframe['column name']
. The dot notation of this would fail because Python has no way of knowing what the space after “column” means: dataframe.column name
.
Once we know how to reference a column (or a “Series” in Pandas-speak), we can run the type conversion method and specify “category” as the output data type:
bank['Gender'].astype('category')
0 Male
1 Female
2 Female
3 Female
4 Male
...
203 Male
204 Male
205 Male
206 Male
207 Female
Name: Gender, Length: 208, dtype: category
Categories (2, object): [Female, Male]
Note at the bottom of the summary that the Gender
category consists of two values: Female and Male.
Of course, running the command above just lists the converted values; it does not do anything to the “bank” data frame. To replace the existing column in the data frame, we use the assignment operator:
bank['Gender'] = bank['Gender'].astype('category')
bank['PCJob'] = bank['PCJob'].astype('category')
We can confirm the result by running the describe()
method again, but this time using an argument to ensure we get only the categorical columns:
bank.describe(include='category')
Gender | PCJob | |
---|---|---|
count | 208 | 208 |
unique | 2 | 2 |
top | Female | No |
freq | 140 | 189 |
This output tells us that we have 208 rows of data. Both the “Gender” and “PCJob” columns have two unique categorical values and the most frequent are “Female” and “No” respectively. 140 of the 208 employees are female while 189/208 do not have PC jobs.
1.6. Type conversion using programming constructs¶
As an aside, it is possible to do the same type conversions en masse in Python using core programming constructs such as lists, iterators, and looping. Specifically, for a larger data set with many columns, it might make sense to do the following:
make a list of the columns to convert
iterate through each column making the conversion assignment, as above
loop through the values of the variable
col
until all the columns are converted
Use of an iterated list is not really worth it in this case (two lines of code using either approach):
for col in ['Gender', 'PCJob']:
bankxl[col] = bankxl[col].astype('category')
bank.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 208 entries, 0 to 207
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Employee 208 non-null int64
1 EducLev 208 non-null int64
2 JobGrade 208 non-null int64
3 YrHired 208 non-null int64
4 YrBorn 208 non-null int64
5 Gender 208 non-null category
6 YrsPrior 208 non-null int64
7 PCJob 208 non-null category
8 Salary 208 non-null float64
dtypes: category(2), float64(1), int64(6)
memory usage: 12.1 KB