Pandas - Data Frames

Pandas is a library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. Its name is a play on the phrase "Python data analysis" itself.

Series

The first main data type we will learn about for pandas is the series data type.

A series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a series, is that a series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's import Pandas and explore the series object through some examples.

import numpy as np
import pandas as pd

Constucting a Series

A series can be created from a list, numpy array, or a dictionary.

my_list = [10,20,30]
pd.Series(data = my_data)   # "data =" isn't necessary to write

Output:
0    10
1    20
2    30
dtype: int64

We could create this same series from a numpy array.

arr = np.array([10,20,30])
pd.Series(arr)

In either case the data [10,20,30] in the series is indexed by 0, 1, 2. We could create our own labels which could be used as indices. Similar to a dictionary.

labels = ['a','b','c']
pd.Series(data = my_list, index = labels)

Output:
a    10
b    20
c    30
dtype: int64

In fact, we could create a series directly from a dictionary.

d = { 'a':10 , 'b':20 , 'c':30 }
pd.Series(d)

Output:
a    10
b    20
c    30
dtype: int64

Pandas can hold a variety of object types as data in a series.

Using an Index

The key to using a series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two series, ser1 and ser2.

ser1 = pd.Series([10,20,30,40],
         index = ['Harry', 'Hermione', 'Ron', 'Neville'])

Output:
Harry       10
Hermione    20
Ron         30
Neville     40
dtype: int64

ser2 = pd.Series([55,65,75,85],
         index = ['Harry', 'Hagrid', 'Hermione', 'Albus'])
ser2

Output:
Harry       55
Hagrid     65
Hermione    75
Albus       85
dtype: int64

Indexing can be used to select specific data.

ser1['Harry']

Output:
55

Note that the original numerical index is still available to use, and moreover we can use slicing:

ser1[0:2]

Output:
Harry       10
Hermione    20
dtype: int64

Fancy indexing is available too.

ser1[['Harry','Ron']]

Output:
Harry    10
Ron      30
dtype: int64

Operations can also be done on series based off the index.

ser1 + ser2

Output:
Albus       NaN
Hagrid      NaN
Harry       65.0
Hermione    95.0
Neville     NaN
Ron         NaN
dtype: float64

Data Frames

A Data Frame in pandas is a table where each column is a series, and all series share the same index. This is best seen in terms of an example.

import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)

df = pd.DataFrame(randn(5,4), ['A','B','C','D','E'],['W','X','Y','Z'])
df

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
B	-0.134841	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	-0.497104	-0.754070	-0.943406	0.484752
E	-0.116773	1.901755	0.238127	1.996652

Selecting Columns

Each column (W, X, Y, Z) in this data frame is indexed by A, B, C, D, E.

df['W']

Output:
A    0.302665
B   -0.134841
C    0.807706
D   -0.497104
E   -0.116773
Name: W, dtype: float64

We can select multiple columns from this data frame by passing in a list of column names.

df[['W','Z']]

This returns the data frame consisting only of columns W and Z.

Adding and Dropping Columns

A new column can be added by calling the data frame on the new column index, and assigning a list of values.

df['ZZ'] = [1,2,3,4,5]

	W	X	Y	Z	ZZ
A	0.302665	1.693723	-1.706086	-1.159119	1
B	-0.134841	0.390528	0.166905	0.184502	2
C	0.807706	0.072960	0.638787	0.329646	3
D	-0.497104	-0.754070	-0.943406	0.484752	4
E	-0.116773	1.901755	0.238127	1.996652	5

A column can be dropped by using the drop() method. This doesn't delete the column, just drops it from view. In other words, drop() does not affect the data frame in place. By default the drop() method drops the row of a specified index (an axis = 0 object), to drop a column we need to specify axis=1.

df.drop('ZZ', axis=1)  # drops column ZZ

The data frame still has column ZZ, we just dropped it from view. To delete the column from the data frame (i.e. to drop the column in place) we set the inplace argument to true.

df.drop('ZZ', axis=1, inplace=True)  # deletes column ZZ

As another example, we create a new column which is the sum of two existing columns.

df['Y+Z'] = df['Y'] + df['Z']

	W	X	Y	Z	Y+Z
A	0.302665	1.693723	-1.706086	-1.159119	-2.865205
B	-0.134841	0.390528	0.166905	0.184502	0.351406
C	0.807706	0.072960	0.638787	0.329646	0.968433
D	-0.497104	-0.754070	-0.943406	0.484752	-0.458655
E	-0.116773	1.901755	0.238127	1.996652	2.234779

Selecting Rows

Rows can be selected by either using .loc[] with the index name, or .iloc[] with the index number.

Both

df.loc['C']

and

df.iloc[2]  # row C is the row of index 2

return the same output:

Output:
W      0.807706
X      0.072960
Y      0.638787
Z      0.329646
Y+Z    0.968433
Name: C, dtype: float64

Note that a row is also a series indexed by the column names.

Multiple rows can be selected by passing in a list of row indices into .loc[] or .iloc[].

df.loc[['A','C']]

	W	X	Y	Z	Y+Z
A	0.302665	1.693723	-1.706086	-1.159119	-2.865205
C	0.807706	0.072960	0.638787	0.329646	0.968433

Selecting Entries and Submatrices

Similar to how it is done in NumPy we can select entries at specific locations, as well as submatrices (entries in specific rows and columns).

df.loc['A','W']

Output:
0.302665

df.loc[['A','C'],['W','X']]

	W	X
A	0.302665	1.693723
C	0.807706	0.072960

Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to NumPy.

First we call up our data frame from last section (we've deleted the column Y+Z).

df

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
B	-0.134841	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	-0.497104	-0.754070	-0.943406	0.484752
E	-0.116773	1.901755	0.238127	1.996652

We can construct a boolean data frame as follows.

df>0

	W	X	Y	Z
A	True	True	False	False
B	False	True	True	True
C	True	True	True	True
D	False	False	False	True
E	False	True	True	True

This boolean frame can be used to select data from the frame.

df[df>0]

	W	X	Y	Z
A	0.302665	1.693723	NaN	NaN
B	NaN	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	NaN	NaN	NaN	0.484752
E	NaN	1.901755	0.238127	1.996652

Filtering data by boolean values is most commonly used by filtering out rows when a specified column has a false value. Let's see an example of this.

df['W']>0

Output:
A     True
B    False
C     True
D    False
E    False
Name: W, dtype: bool

Then we use this boolean series to mask the entire data frame, which will only return rows of index $0$ and 2. In other words, we are asking only for those rows in the data frame for which column W has positive values.

df[df['W']>0]

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
C	0.807706	0.072960	0.638787	0.329646

As another example, let's grab all rows of the data frame where column Z has value less than $0$.

df[df['Z']<0]

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119

The object returned from this conditional selection is a data frame, so we could in turn use data frame commands on it such as selecting a column. For example, let's look at the data frame where the W values are negative, and ask for columns X and Y of this.

df[df['W']<0][['X','Y']]

	X	Y
B	0.390528	0.166905
D	-0.754070	-0.943406
E	1.901755	0.238127

Conditionals can be joined together using and (&) and or (|) operators.

df[(df['W']>0) & (df['X']>1)]

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119

More Indexing

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy.

We call up our original data frame from last section.

df

	W	X	Y	Z
A	0.302665	1.693723	-1.706086	-1.159119
B	-0.134841	0.390528	0.166905	0.184502
C	0.807706	0.072960	0.638787	0.329646
D	-0.497104	-0.754070	-0.943406	0.484752
E	-0.116773	1.901755	0.238127	1.996652

To reset the index of the data frame back to $0$ to $n$ we call the reset_index method. By default this does not occur in place (use the inplace=True optional argument to make the change permanent).

df.reset_index()

	index	W	X	Y	Z
0	A	0.302665	1.693723	-1.706086	-1.159119
1	B	-0.134841	0.390528	0.166905	0.184502
2	C	0.807706	0.072960	0.638787	0.329646
3	D	-0.497104	-0.754070	-0.943406	0.484752
4	E	-0.116773	1.901755	0.238127	1.996652

Now we'll add a new column of provinces, which we will convert to an index column.

new_ind = "BC AB SK ON QB".split()  # make a list of Provices
df['Prov'] = new_ind
df

	W	X	Y	Z	Prov
A	0.302665	1.693723	-1.706086	-1.159119	BC
B	-0.134841	0.390528	0.166905	0.184502	AB
C	0.807706	0.072960	0.638787	0.329646	SK
D	-0.497104	-0.754070	-0.943406	0.484752	ON
E	-0.116773	1.901755	0.238127	1.996652	QB

df.set_index('Prov')

	W	X	Y	Z
Prov
BC	0.302665	1.693723	-1.706086	-1.159119
AB	-0.134841	0.390528	0.166905	0.184502
SK	0.807706	0.072960	0.638787	0.329646
ON	-0.497104	-0.754070	-0.943406	0.484752
QB	-0.116773	1.901755	0.238127	1.996652

Notice that reset_index copies the current index into a new column before overwriting it with the numerical index. On the other hand, set_index does not create a copy of the current index, it just overwrites it.

Multi-Index and Index Hierarchy

In this section we give an example of a data frame with a multi-index (or index hierarchy). In future sections this will be created much more naturally from a larger data frame, but for now we will build from scratch.

import numpy as np
import pandas as pd
from numpy.random import randn

outside = ['G1','G1','G1','G2','G2','G2'] # level 0 index
inside = [1,2,3,1,2,3]                    # level 1 index
hier_index = list(zip(outside,inside))    # create list of tuples
hier_index = pd.MultiIndex.from_tuples(hier_index)

df = pd.DataFrame(randn(6,2),hier_index,['A','B'])
df

Graph Graph

We can grab the sub-frame indexed by the outer index 'G1':

df.loc['G1']

Graph Graph

We can also grab rows/columns of this sub-frame by the usual indexing commands:

df.loc['G1'].loc[2]

Output:
A    0.907969
B    0.503826
Name: 2, dtype: float64

We can name the multi-indexes using the index.names methods.

df.index.names = ['Groups','Num']
df

Graph Graph

Cross Section

To capture a cross section of rows or columns from a data frame with multi-level index we can use the xs method. As an example we will capture the rows of the data frame with index Num equal to 1.

df.xs(1,level='Num')

Missing Data

Pandas will automatically fill in missing values with a null or not a number type data object, denoted NaN. In this section we see how to drop or replace 'NaN' values.

import numpy as np
import pandas as pd

d = {'A':[1,2,np.nan],'B':[5,np.nan,np.nan],'C':[1,2,3]}
df = pd.DataFrame(d)

Graph Graph

To drop any rows (axis = 0) or columns (axis=1) use the dropna method.

df.dropna(axis=0)

Graph Graph

df.dropna(axis=1)

Graph Graph

The optional thresh argument can specify how many non-NaN values there must be in order to keep the row. The example below keeps a row if it has at lease 2 non-NaN values.

df.dropna(axis=0,thresh=2)

Graph Graph

Te fill in missing data use the fill method. In the example below we fill in missing values with 0.

df.fillna(value = 0)   # could also use df.fillna(0)

Graph Graph

As another example, let's fill in the third entry in column A with the mean of that column.

df['A'].fillna(value = df['A'].mean())

Output:
0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

Groupby

The groupby method involves some combination of splitting the data frame, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

import numpy as np
import pandas as pd

data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df

Graph Graph

The groupby method creates a new DataFrameGroupBy object.

byComp = df.groupby('Company')
byComp

Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9210b90950>

We can call aggregate functions on this object. For example, sum, mean, median, std, max, min, count. What is returned is a data frame indexed by the groupby column entries.

byComp.mean()

Graph Graph

The count() function counts the number of instances in a column that were grouped together.

byComp.count()

Graph Graph

Typical usage of groupby would be all in one line, for example if we wanted the total sales for FB:

df.groupby('Company').sum().loc['FB']

Output:
Sales    593
Name: FB, dtype: int64

The describe method is a workhorse for a groupby object, it will give you lots of information for the data that is grouped.

df.groupby('Company').describe()

Graph Graph

The data could be reformatted if you prefer to have the companies as columns and the data as rows.

df.groupby('Company').describe().transpose()

Graph Graph

Concatenation, Merging, and Joining

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this section we will discuss these 3 methods with examples.

import numpy as np
import pandas as pd

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                   index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                   index=[4, 5, 6, 7])

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                   index=[8, 9, 10, 11])

df1

Graph Graph

Concatenation

Concatenation basically glues together data frames. Keep in mind that dimensions should match along the axis you are concatenating on. Use pd.concat and pass in a list of data frames to concatenate together:

pd.concat([df1,df2,df3])

Graph Graph

Merge

Merging basically glues together data frames, but matches data based on a key column (e.g. think merging two grade tables together and matching data based on student numbers).

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                       'A': ['A0', 'A1', 'A2', 'A3'],
                       'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']})

pd.merge(left,right,how='inner',on='key')

Graph Graph

Merging can be done on more than one key column.

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                         'C': ['C0', 'C1', 'C2', 'C3'],
                         'D': ['D0', 'D1', 'D2', 'D3']})

Graph Graph

pd.merge(left, right, on=['key1', 'key2'])

Graph Graph

Join

Joining is a convenient method for combining the columns of two potentially differently-indexed data frames into a single result data frame.

left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                    index=['K0', 'K1', 'K2'])

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                      'D': ['D0', 'D2', 'D3']},
                     index=['K0', 'K2', 'K3'])

Graph Graph

left.join(right)

Graph Graph

Operations

In this section we cover a variety of operations that can be done on data frames.

import numpy as np
import pandas as pd
``

```py
df = pd.DataFrame({'col1':[1,2,3,4],
        'col2':[444,555,666,444],
        'col3':['abc','def','ghi','xyz']})
df

Graph Graph

The .unique() method for a series will return a numpy array of all the unique values in that series.

df['col2'].unique()

Output:
array([444, 555, 666])

The .nunique() method returns the number of unique entries.

df['col2'].nunique()

Output:
3

The .value_counts() method for a series will return a series which counts the number of times each item appears in the column.

df['col2'].value_counts()

Output:
444    2
555    1
666    1
Name: col2, dtype: int64

Apply

The .apply() method for a series will apply any python function (user defined or built-in) to the values of the series.

def my_funct(x):
    return x*x+1

df['col1'].apply(my_funct)

Output:
0     2
1     5
2    10
3    17
Name: col1, dtype: int64

We could have done this using a lambda function instead.

df['col1'].apply(lambda x: x*x+1)

Output:
0     2
1     5
2    10
3    17
Name: col1, dtype: int64

As another example we apply the length function to the column with string values.

df['col3'].apply(len)

Output:
0    3
1    3
2    3
3    3
Name: col3, dtype: int64

Removing Rows and Columns

To remove a row (axis=0) or column (axis=1) use the drop() method. To permanently remove the row/column use the inplace = True optional argument.

df.drop('col3', axis=1)

Graph Graph

Retrieve Column and Index Names

df.columns

Output:
Index(['col1', 'col2', 'col3'], dtype='object')

df.index

Output:
RangeIndex(start=0, stop=4, step=1)

Here is an example where we change the index from the default range index.

df['key'] = ['A','B','C','D']  # add a new column
df.set_index('key').index  # set new index, then call index method

Output:
Index(['A', 'B', 'C', 'D'], dtype='object', name='key')

Sorting

df.sort_values('col2')

Graph Graph

df.sort_values('col2', ascending=False)

Graph Graph

Finding Null Values

Let's consider a dataframe with NaN values.

df = pd.DataFrame({'col1':[1,2,3,np.nan],
                   'col2':[np.nan,555,666,444],
                   'col3':['abc','def','ghi','xyz']})

Graph Graph

df.isnull()

Graph Graph

We can fill in null values using the fillna() method.

df.fillna('FILL')

Graph Graph

To make this a permanent change to the dataframe use argmument inplace = True.

Pivot Table

A pivot table reorganizes data in an extensive table so that statistics can be summarized. Here we do a small example where we use the pivot_table method to create a data frame with multi-level indices.

data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)

Graph Graph

df.pivot_table(values='D',index=['A', 'B'],columns=['C'])

Graph Graph

Data Input and Output

In this section we read data from a variety of sources: csv, excel, html, and similarly write data frames to each of these types of objects.

Read/write functionality for csv and excel is built into pandas. For html we require four additional libraries:

sqlalchemy (Python SQL Toolkit and Object Relational Mapper)
lxml (mature binding for the libxml2 and libxslt libraries)
html5lib (html5lib is a pure-python library for parsing HTML)
BeautifulSoup4 (library for scraping from websites)

If you are using Anaconda's distribution of python each library can be installed by:

conda install <name of library>

Note: You may need to install the xlrd library too. You will be prompted if this wasn't automatically installed, and can install it the same way the other four libraries were installed.

Once installed we import them into our notebook as follows.

import sqlalchemy
import lxml
import html5lib
from bs4 import BeautifulSoup

As usual, make sure to import numpy and pandas.

import numpy as np
import pandas as pd

CSV Files

First we read a csv file into a data frame. The csv file is as follows.

Download: df-example.csv

a,b,c,d
0,1,2,3
4,5,6,7
8,9,10,11
12,13,14,15

To read the csv file contents into a data frame we use the read_csv() method.

df = pd.read_csv('df-example')
df

Graph Graph

To write a data frame to a csv file use the to_csv() method on the data frame.

df.to_csv('examplewrite',index=False)

Excel Files

Pandas can read and write excel files, keep in mind, this only imports data, not formulas or images - having images or macros may cause this read_excel method to crash. To import an excel spreadsheet into pandas first notice what sheet your data is in.

Download: df-excel_sample.xlsx

pd.read_excel('df-excel_sample.xlsx',sheet_name='Sheet1')

To write a data frame to an excel spreadsheet use the to_excel() method.

df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

HTML Files

Pandas can read table tags off of an html document. The object returned is a list of data frames, one data frame per table tag found on the page. For example, we grab the tables from Jaaps puzzle page (the third table on the page is the one we want):

url: https://www.jaapsch.net/puzzles/fifteen.htm

df = pd.read_html('https://www.jaapsch.net/puzzles/fifteen.htm')
df[2]

Graph Graph

SQL Files

:: UNDER CONSTRUCTION ::

Exercises

Download for Question 1: df-ex-salaries.csv
Download for Question 2: df-ex-ecommerce_purchases.csv

San Francisco Salaries: In this first exercise we will be using a publicly available data set of San Francisco government employees. The data set was obtained from https://www.kaggle.com/kaggle/sf-salaries.
Complete the tasks outlined in below. The tasks will get more challenging as you go along.

Import pandas as np.
Read df-ex-salaries.csv into a data frame called sal.
Check the head of the data frame.

Use the info() method to find out how many entries there are.

Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Id                148654 non-null  int64  
 1   EmployeeName      148654 non-null  object
 2   JobTitle          148654 non-null  object
 3   BasePay           148045 non-null  float64
 4   OvertimePay       148650 non-null  float64
 5   OtherPay          148650 non-null  float64
 6   Benefits          112491 non-null  float64
 7   TotalPay          148654 non-null  float64
 8   TotalPayBenefits  148654 non-null  float64
 9   Year              148654 non-null  int64  
 10  Notes             0 non-null       float64
 11  Agency            148654 non-null  object
 12  Status            0 non-null       float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB

What is the average BasePay?
Ans: 66325.44884050643
What is the highest amount of OvertimePay in the dataset?
Ans: 245131.88
What is the job title of JOSEPH DRISCOLL? (Note: there is also a lowercase Joseph Driscoll which we don't want.)
```
 24    CAPTAIN, FIRE SUPPRESSION
Name: JobTitle, dtype: object
```

How much does JOSEPH DRISCOLL make (including benefits)?

24    270324.91
Name: TotalPayBenefits, dtype: float64

What is the name of the highest paid person (including benefits)?
Ans: NATHANIEL FORD
What is the name of the lowest paid person (including benefits)?
Ans: Joe Lopez

What is the average (mean) BasePay of all employees per year (2011-2014)?

Year
2011    63595.956517
2012    65436.406857
2013    69630.030216
2014    66564.421924
Name: BasePay, dtype: float64

How many unique job titles are there?
Ans: 2159

What are the top 5 most common jobs?

Transit Operator                7036
Special Nurse                   4389
Registered Nurse                3736
Public Svc Aide-Public Works    2518
Police Officer 3                2421
Name: JobTitle, dtype: int64

How many Job Titles were represented by only one person in 2013? (e.g. Job Titles with only one occurrence in 2013?)
Ans: 202
How many people have the word Chief in their job title?
Ans: 627
Is there a correlation between length of the Job Title string and Salary?

Ecommerce Purchases:
Download: df-ex-ecommerce_purchases.csv
1. Import pandas and read in the Ecommerce Purchases csv file and set it to a DataFrame called ecom.
2. Check the head of the DataFrame.
3. How many rows and columns are there?
4. What is the average Purchase Price?
  Ans: 50.347302
5. What were the highest and lowest purchase prices?
  Ans: 99.99 and 0
6. How many people have English 'en' as their Language of choice on the website?
  Ans: 1098
7. How many people have the job title of "Lawyer"?
  Ans: 30
8. How many people made the purchase during the AM and how many people made the purchase during PM? (Hint: use value_counts().)
  Ans: PM 5068 and AM 4932
9. What are the 5 most common Job Titles?
10. Someone made a purchase that came from Lot: "90 WT" , what was the Purchase Price for this transaction?
  Ans: 75.1
11. What is the email of the person with the following Credit Card Number: 4926535242672853?
  Ans: bondellen@williams-garza.com
12. How many people have American Express as their Credit Card Provider and made a purchase above $95 ?
13. Hard: How many people have a credit card that expires in 2025?
14. Hard: What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)?