Pandas - Data Frames
Pandas is a library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. Its name is a play on the phrase "Python data analysis" itself.
Series
The first main data type we will learn about for pandas is the series data type.
A series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a series, is that a series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.
Let's import Pandas and explore the series object through some examples.
Constucting a Series
A series can be created from a list, numpy array, or a dictionary.
Output:
0 10
1 20
2 30
dtype: int64
We could create this same series from a numpy array.
In either case the data [10,20,30]
in the series is indexed by 0, 1, 2. We could create our own labels which could be used as indices. Similar to a dictionary.
Output:
a 10
b 20
c 30
dtype: int64
In fact, we could create a series directly from a dictionary.
Output:
a 10
b 20
c 30
dtype: int64
Pandas can hold a variety of object types as data in a series.
Using an Index
The key to using a series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).
Let's see some examples of how to grab information from a Series. Let us create two series, ser1
and ser2
.
Output:
Harry 10
Hermione 20
Ron 30
Neville 40
dtype: int64
Output:
Harry 55
Hagrid 65
Hermione 75
Albus 85
dtype: int64
Indexing can be used to select specific data.
Output:
55
Note that the original numerical index is still available to use, and moreover we can use slicing:
Output:
Harry 10
Hermione 20
dtype: int64
Fancy indexing is available too.
Output:
Harry 10
Ron 30
dtype: int64
Operations can also be done on series based off the index.
Output:
Albus NaN
Hagrid NaN
Harry 65.0
Hermione 95.0
Neville NaN
Ron NaN
dtype: float64
Data Frames
A Data Frame in pandas is a table where each column is a series, and all series share the same index. This is best seen in terms of an example.
W | X | Y | Z | |
---|---|---|---|---|
A | 0.302665 | 1.693723 | -1.706086 | -1.159119 |
B | -0.134841 | 0.390528 | 0.166905 | 0.184502 |
C | 0.807706 | 0.072960 | 0.638787 | 0.329646 |
D | -0.497104 | -0.754070 | -0.943406 | 0.484752 |
E | -0.116773 | 1.901755 | 0.238127 | 1.996652 |
Selecting Columns
Each column (W, X, Y, Z) in this data frame is indexed by A, B, C, D, E.
Output:
A 0.302665
B -0.134841
C 0.807706
D -0.497104
E -0.116773
Name: W, dtype: float64
We can select multiple columns from this data frame by passing in a list of column names.
This returns the data frame consisting only of columns W and Z.
Adding and Dropping Columns
A new column can be added by calling the data frame on the new column index, and assigning a list of values.
W | X | Y | Z | ZZ | |
---|---|---|---|---|---|
A | 0.302665 | 1.693723 | -1.706086 | -1.159119 | 1 |
B | -0.134841 | 0.390528 | 0.166905 | 0.184502 | 2 |
C | 0.807706 | 0.072960 | 0.638787 | 0.329646 | 3 |
D | -0.497104 | -0.754070 | -0.943406 | 0.484752 | 4 |
E | -0.116773 | 1.901755 | 0.238127 | 1.996652 | 5 |
A column can be dropped by using the drop()
method. This doesn't delete the column, just drops it from view. In other words, drop()
does not affect the data frame in place. By default the drop()
method drops the row of a specified index (an axis = 0
object), to drop a column we need to specify axis=1
.
The data frame still has column ZZ, we just dropped it from view. To delete the column from the data frame (i.e. to drop the column in place) we set the inplace
argument to true.
As another example, we create a new column which is the sum of two existing columns.
W | X | Y | Z | Y+Z | |
---|---|---|---|---|---|
A | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -2.865205 |
B | -0.134841 | 0.390528 | 0.166905 | 0.184502 | 0.351406 |
C | 0.807706 | 0.072960 | 0.638787 | 0.329646 | 0.968433 |
D | -0.497104 | -0.754070 | -0.943406 | 0.484752 | -0.458655 |
E | -0.116773 | 1.901755 | 0.238127 | 1.996652 | 2.234779 |
Selecting Rows
Rows can be selected by either using .loc[]
with the index name, or .iloc[]
with the index number.
Both
and return the same output:Output:
W 0.807706
X 0.072960
Y 0.638787
Z 0.329646
Y+Z 0.968433
Name: C, dtype: float64
Note that a row is also a series indexed by the column names.
Multiple rows can be selected by passing in a list of row indices into .loc[]
or .iloc[]
.
W | X | Y | Z | Y+Z | |
---|---|---|---|---|---|
A | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -2.865205 |
C | 0.807706 | 0.072960 | 0.638787 | 0.329646 | 0.968433 |
Selecting Entries and Submatrices
Similar to how it is done in NumPy we can select entries at specific locations, as well as submatrices (entries in specific rows and columns).
Output:
0.302665
W | X | |
---|---|---|
A | 0.302665 | 1.693723 |
C | 0.807706 | 0.072960 |
Conditional Selection
An important feature of pandas is conditional selection using bracket notation, very similar to NumPy.
First we call up our data frame from last section (we've deleted the column Y+Z).
W | X | Y | Z | |
---|---|---|---|---|
A | 0.302665 | 1.693723 | -1.706086 | -1.159119 |
B | -0.134841 | 0.390528 | 0.166905 | 0.184502 |
C | 0.807706 | 0.072960 | 0.638787 | 0.329646 |
D | -0.497104 | -0.754070 | -0.943406 | 0.484752 |
E | -0.116773 | 1.901755 | 0.238127 | 1.996652 |
We can construct a boolean data frame as follows.
W | X | Y | Z | |
---|---|---|---|---|
A | True | True | False | False |
B | False | True | True | True |
C | True | True | True | True |
D | False | False | False | True |
E | False | True | True | True |
This boolean frame can be used to select data from the frame.
W | X | Y | Z | |
---|---|---|---|---|
A | 0.302665 | 1.693723 | NaN | NaN |
B | NaN | 0.390528 | 0.166905 | 0.184502 |
C | 0.807706 | 0.072960 | 0.638787 | 0.329646 |
D | NaN | NaN | NaN | 0.484752 |
E | NaN | 1.901755 | 0.238127 | 1.996652 |
Filtering data by boolean values is most commonly used by filtering out rows when a specified column has a false value. Let's see an example of this.
Output:
A True
B False
C True
D False
E False
Name: W, dtype: bool
Then we use this boolean series to mask the entire data frame, which will only return rows of index \(0\) and 2. In other words, we are asking only for those rows in the data frame for which column W has positive values.
W | X | Y | Z | |
---|---|---|---|---|
A | 0.302665 | 1.693723 | -1.706086 | -1.159119 |
C | 0.807706 | 0.072960 | 0.638787 | 0.329646 |
As another example, let's grab all rows of the data frame where column Z has value less than \(0\).
W | X | Y | Z | |
---|---|---|---|---|
A | 0.302665 | 1.693723 | -1.706086 | -1.159119 |
The object returned from this conditional selection is a data frame, so we could in turn use data frame commands on it such as selecting a column. For example, let's look at the data frame where the W values are negative, and ask for columns X and Y of this.
X | Y | |
---|---|---|
B | 0.390528 | 0.166905 |
D | -0.754070 | -0.943406 |
E | 1.901755 | 0.238127 |
Conditionals can be joined together using and (&
) and or (|
) operators.
W | X | Y | Z | |
---|---|---|---|---|
A | 0.302665 | 1.693723 | -1.706086 | -1.159119 |
More Indexing
Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy.
We call up our original data frame from last section.
W | X | Y | Z | |
---|---|---|---|---|
A | 0.302665 | 1.693723 | -1.706086 | -1.159119 |
B | -0.134841 | 0.390528 | 0.166905 | 0.184502 |
C | 0.807706 | 0.072960 | 0.638787 | 0.329646 |
D | -0.497104 | -0.754070 | -0.943406 | 0.484752 |
E | -0.116773 | 1.901755 | 0.238127 | 1.996652 |
To reset the index of the data frame back to \(0\) to \(n\) we call the reset_index
method. By default this does not occur in place (use the inplace=True
optional argument to make the change permanent).
index | W | X | Y | Z | |
---|---|---|---|---|---|
0 | A | 0.302665 | 1.693723 | -1.706086 | -1.159119 |
1 | B | -0.134841 | 0.390528 | 0.166905 | 0.184502 |
2 | C | 0.807706 | 0.072960 | 0.638787 | 0.329646 |
3 | D | -0.497104 | -0.754070 | -0.943406 | 0.484752 |
4 | E | -0.116773 | 1.901755 | 0.238127 | 1.996652 |
Now we'll add a new column of provinces, which we will convert to an index column.
W | X | Y | Z | Prov | |
---|---|---|---|---|---|
A | 0.302665 | 1.693723 | -1.706086 | -1.159119 | BC |
B | -0.134841 | 0.390528 | 0.166905 | 0.184502 | AB |
C | 0.807706 | 0.072960 | 0.638787 | 0.329646 | SK |
D | -0.497104 | -0.754070 | -0.943406 | 0.484752 | ON |
E | -0.116773 | 1.901755 | 0.238127 | 1.996652 | QB |
W | X | Y | Z | |
---|---|---|---|---|
Prov | ||||
BC | 0.302665 | 1.693723 | -1.706086 | -1.159119 |
AB | -0.134841 | 0.390528 | 0.166905 | 0.184502 |
SK | 0.807706 | 0.072960 | 0.638787 | 0.329646 |
ON | -0.497104 | -0.754070 | -0.943406 | 0.484752 |
QB | -0.116773 | 1.901755 | 0.238127 | 1.996652 |
Notice that reset_index
copies the current index into a new column before overwriting it with the numerical index. On the other hand, set_index
does not create a copy of the current index, it just overwrites it.
Multi-Index and Index Hierarchy
In this section we give an example of a data frame with a multi-index (or index hierarchy). In future sections this will be created much more naturally from a larger data frame, but for now we will build from scratch.
outside = ['G1','G1','G1','G2','G2','G2'] # level 0 index
inside = [1,2,3,1,2,3] # level 1 index
hier_index = list(zip(outside,inside)) # create list of tuples
hier_index = pd.MultiIndex.from_tuples(hier_index)
We can grab the sub-frame indexed by the outer index 'G1':
We can also grab rows/columns of this sub-frame by the usual indexing commands:
Output:
A 0.907969
B 0.503826
Name: 2, dtype: float64
We can name the multi-indexes using the index.names
methods.
Cross Section
To capture a cross section of rows or columns from a data frame with multi-level index we can use the xs
method. As an example we will capture the rows of the data frame with index Num
equal to 1.
Missing Data
Pandas will automatically fill in missing values with a null or not a number type data object, denoted NaN
. In this section we see how to drop or replace 'NaN' values.
import numpy as np
import pandas as pd
d = {'A':[1,2,np.nan],'B':[5,np.nan,np.nan],'C':[1,2,3]}
df = pd.DataFrame(d)
To drop any rows (axis = 0)
or columns (axis=1
) use the dropna
method.
The optional thresh
argument can specify how many non-NaN values there must be in order to keep the row. The example below keeps a row if it has at lease 2 non-NaN values.
Te fill in missing data use the fill
method. In the example below we fill in missing values with 0.
As another example, let's fill in the third entry in column A with the mean of that column.
Output:
0 1.0
1 2.0
2 1.5
Name: A, dtype: float64
Groupby
The groupby
method involves some combination of splitting the data frame, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df
The groupby
method creates a new DataFrameGroupBy
object.
Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9210b90950>
We can call aggregate functions on this object. For example, sum
, mean
, median
, std
, max
, min
, count
. What is returned is a data frame indexed by the groupby column entries.
The count()
function counts the number of instances in a column that were grouped together.
Typical usage of groupby
would be all in one line, for example if we wanted the total sales for FB:
Output:
Sales 593
Name: FB, dtype: int64
The describe
method is a workhorse for a groupby object, it will give you lots of information for the data that is grouped.
The data could be reformatted if you prefer to have the companies as columns and the data as rows.
Concatenation, Merging, and Joining
There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this section we will discuss these 3 methods with examples.
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
Concatenation
Concatenation basically glues together data frames. Keep in mind that dimensions should match along the axis you are concatenating on. Use pd.concat
and pass in a list of data frames to concatenate together:
Merge
Merging basically glues together data frames, but matches data based on a key column (e.g. think merging two grade tables together and matching data based on student numbers).
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
Merging can be done on more than one key column.
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
Join
Joining is a convenient method for combining the columns of two potentially differently-indexed data frames into a single result data frame.
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
Operations
In this section we cover a variety of operations that can be done on data frames.
import numpy as np
import pandas as pd
``
```py
df = pd.DataFrame({'col1':[1,2,3,4],
'col2':[444,555,666,444],
'col3':['abc','def','ghi','xyz']})
df
The .unique()
method for a series will return a numpy array of all the unique values in that series.
Output:
array([444, 555, 666])
The .nunique()
method returns the number of unique entries.
Output:
3
The .value_counts()
method for a series will return a series which counts the number of times each item appears in the column.
Output:
444 2
555 1
666 1
Name: col2, dtype: int64
Apply
The .apply()
method for a series will apply any python function (user defined or built-in) to the values of the series.
Output:
0 2
1 5
2 10
3 17
Name: col1, dtype: int64
We could have done this using a lambda function instead.
Output:
0 2
1 5
2 10
3 17
Name: col1, dtype: int64
As another example we apply the length function to the column with string values.
Output:
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
Removing Rows and Columns
To remove a row (axis=0
) or column (axis=1
) use the drop()
method. To permanently remove the row/column use the inplace = True
optional argument.
Retrieve Column and Index Names
Output:
Index(['col1', 'col2', 'col3'], dtype='object')
Output:
RangeIndex(start=0, stop=4, step=1)
Here is an example where we change the index from the default range index.
df['key'] = ['A','B','C','D'] # add a new column
df.set_index('key').index # set new index, then call index method
Output:
Index(['A', 'B', 'C', 'D'], dtype='object', name='key')
Sorting
Finding Null Values
Let's consider a dataframe with NaN
values.
df = pd.DataFrame({'col1':[1,2,3,np.nan],
'col2':[np.nan,555,666,444],
'col3':['abc','def','ghi','xyz']})
We can fill in null values using the fillna()
method.
To make this a permanent change to the dataframe use argmument inplace = True
.
Pivot Table
A pivot table reorganizes data in an extensive table so that statistics can be summarized. Here we do a small example where we use the pivot_table
method to create a data frame with multi-level indices.
data = {'A':['foo','foo','foo','bar','bar','bar'],
'B':['one','one','two','two','one','one'],
'C':['x','y','x','y','x','y'],
'D':[1,3,2,5,4,1]}
df = pd.DataFrame(data)
Data Input and Output
In this section we read data from a variety of sources: csv, excel, html, and similarly write data frames to each of these types of objects.
Read/write functionality for csv and excel is built into pandas. For html we require four additional libraries:
- sqlalchemy (Python SQL Toolkit and Object Relational Mapper)
- lxml (mature binding for the libxml2 and libxslt libraries)
- html5lib (html5lib is a pure-python library for parsing HTML)
- BeautifulSoup4 (library for scraping from websites)
If you are using Anaconda's distribution of python each library can be installed by:
conda install <name of library>
Note: You may need to install the xlrd
library too. You will be prompted if this wasn't automatically installed, and can install it the same way the other four libraries were installed.
Once installed we import them into our notebook as follows.
As usual, make sure to import numpy and pandas.
CSV Files
First we read a csv file into a data frame. The csv file is as follows.
Download: df-example.csv
a,b,c,d
0,1,2,3
4,5,6,7
8,9,10,11
12,13,14,15
To read the csv file contents into a data frame we use the read_csv()
method.
To write a data frame to a csv file use the to_csv()
method on the data frame.
Excel Files
Pandas can read and write excel files, keep in mind, this only imports data, not formulas or images - having images or macros may cause this read_excel
method to crash. To import an excel spreadsheet into pandas first notice what sheet your data is in.
Download: df-excel_sample.xlsx
To write a data frame to an excel spreadsheet use the to_excel()
method.
HTML Files
Pandas can read table tags off of an html document. The object returned is a list of data frames, one data frame per table tag found on the page. For example, we grab the tables from Jaaps puzzle page (the third table on the page is the one we want):
url: https://www.jaapsch.net/puzzles/fifteen.htm
SQL Files
:: UNDER CONSTRUCTION ::
Exercises
Download for Question 1: df-ex-salaries.csv
Download for Question 2: df-ex-ecommerce_purchases.csv
-
San Francisco Salaries: In this first exercise we will be using a publicly available data set of San Francisco government employees. The data set was obtained from https://www.kaggle.com/kaggle/sf-salaries.
Complete the tasks outlined in below. The tasks will get more challenging as you go along.-
Import pandas as np.
-
Read
df-ex-salaries.csv
into a data frame calledsal
. -
Check the head of the data frame.
-
Use the
info()
method to find out how many entries there are.Output <class 'pandas.core.frame.DataFrame'> RangeIndex: 148654 entries, 0 to 148653 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 148654 non-null int64 1 EmployeeName 148654 non-null object 2 JobTitle 148654 non-null object 3 BasePay 148045 non-null float64 4 OvertimePay 148650 non-null float64 5 OtherPay 148650 non-null float64 6 Benefits 112491 non-null float64 7 TotalPay 148654 non-null float64 8 TotalPayBenefits 148654 non-null float64 9 Year 148654 non-null int64 10 Notes 0 non-null float64 11 Agency 148654 non-null object 12 Status 0 non-null float64 dtypes: float64(8), int64(2), object(3) memory usage: 14.7+ MB
-
What is the average
BasePay
?
Ans:66325.44884050643
-
What is the highest amount of OvertimePay in the dataset?
Ans:245131.88
-
What is the job title of JOSEPH DRISCOLL? (Note: there is also a lowercase Joseph Driscoll which we don't want.)
-
How much does JOSEPH DRISCOLL make (including benefits)?
-
What is the name of the highest paid person (including benefits)?
Ans:NATHANIEL FORD
-
What is the name of the lowest paid person (including benefits)?
Ans:Joe Lopez
-
What is the average (mean) BasePay of all employees per year (2011-2014)?
-
How many unique job titles are there?
Ans:2159
-
What are the top 5 most common jobs?
-
How many Job Titles were represented by only one person in 2013? (e.g. Job Titles with only one occurrence in 2013?)
Ans:202
-
How many people have the word Chief in their job title?
Ans:627
-
Is there a correlation between length of the Job Title string and Salary?
-
-
Ecommerce Purchases:
Download: df-ex-ecommerce_purchases.csv- Import pandas and read in the Ecommerce Purchases csv file and set it to a DataFrame called
ecom
. - Check the head of the DataFrame.
- How many rows and columns are there?
- What is the average Purchase Price?
Ans:50.347302
- What were the highest and lowest purchase prices?
Ans:99.99
and0
- How many people have English 'en' as their Language of choice on the website?
Ans:1098
- How many people have the job title of "Lawyer"?
Ans:30
- How many people made the purchase during the AM and how many people made the purchase during PM?
(Hint: use
value_counts()
.)
Ans:PM 5068
andAM 4932
- What are the 5 most common Job Titles?
- Someone made a purchase that came from Lot: "90 WT" , what was the Purchase Price for this transaction?
Ans:75.1
- What is the email of the person with the following Credit Card Number: 4926535242672853?
Ans: bondellen@williams-garza.com - How many people have American Express as their Credit Card Provider and made a purchase above $95 ?
- Hard: How many people have a credit card that expires in 2025?
- Hard: What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)?
- Import pandas and read in the Ecommerce Purchases csv file and set it to a DataFrame called