{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Importing Data\n", "One of the most magical things about R is its ability to handle matrices of data using the data frame object (or tibble in the Tidyverse). The `pandas` library gives Python much the same functionality.\n", "\n", "Recall that a data frame is an object consisting of rows and columns (just like a spreadsheet or database table). In addition to storing the data, the data frame object also has useful properties and methods that can be accessed through code. Thus, the first step in data analysis is load data from a file into a data frame object.\n", "\n", "## Loading the Pandas library\n", "A Python library contains things (objects, code) that is not part of the core Python language but is nonetheless useful to some community of users. Libraries save us from re-inventing the wheel: Once someone has created a library and made it available for download, we can use the contents of the library in our own Python programs.\n", "\n", "For example, to use the `pandas.DataFrame` object in our programs, we must first import the Pandas library into our environment.\n", "\n", "Start by creating a new notebook in Jupyter and moving to the first cell:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "import pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is now possible to directly reference `pandas.DataFrame` in my scripts and Jupyter will know what I am talking about. Note, however, that programmers are lazy, and typing the word \"pandas\" over and over again is seen as an unnecessary burden. As such, programmers typically _alias_ the library when they import it:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, \"pd\" is the alias for \"pandas\". So I can write `pd.DataFrame` in my code rather than `pandas.DataFrame`. Not sure what I will do with all the time I save.\n", "\n", "Finally, if I plan to use the DataFrame object frequently in my code, I can call it out by name using the `from` keyword:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "from pandas import DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now I can type `DataFrame` rather than `pd.DataFrame`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a DataFrame object\n", "Python is an object-orient programming language. You create an _instance_ of an _object_ by calling its *constructor*, typically with some arguments.\n", "\n", "To illustrate, let's create a data frame called \"bank\" that contains data from a CSV file stored on my computer in the \"Data\" folder. To do this, we call Panda's `read_csv` method:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EmployeeEducLevJobGradeYrHiredYrBornGenderYrsPriorPCJobSalary
01319269Male1No32.0
12118157Female1No39.1
23118360Female0No33.2
34218755Female7No30.6
45319267Male0No29.0
\n", "
" ], "text/plain": [ " Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob \\\n", "0 1 3 1 92 69 Male 1 No \n", "1 2 1 1 81 57 Female 1 No \n", "2 3 1 1 83 60 Female 0 No \n", "3 4 2 1 87 55 Female 7 No \n", "4 5 3 1 92 67 Male 0 No \n", "\n", " Salary \n", "0 32.0 \n", "1 39.1 \n", "2 33.2 \n", "3 30.6 \n", "4 29.0 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bank = pd.read_csv('Data/Bank.csv')\n", "bank.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the `head(n)` method can be called on the new DataFrame `bank`. Head simply prints out the first _n_ rows of the data frame so you can see if it imported okay." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Hint:\n", " Pointing to a file on your computer is always tricky when you don't have a pop-up file system chooser like you do in most Windows applications. Some basic hints:\n", "
    \n", "
  1. If practical, put the data file in the same folder as your Jupyter notebook. In the example above, I created a subfolder called \"Data\" and put the bank data there. This makes it easier for you to remember (and type) the path to the data file.
  2. \n", "
  3. If your data must reside somewhere else on your computer or network: Be very careful when typing. Cut and paste the path from Windows Explorer if required.
  4. \n", "
  5. Beware the slashes/backslashes. Unix-like operating systems seperate directory names with a slash \"/\" whereas Microsoft Windows uses a backslash \"\\\". In general, Python seperates folders with forward slashes. The backward slash is used in Python as the 'escape' character. If you cut and paste from Windows Explorer, you must either change all the backslashes to slashes or add the \"r\" prefix to your pathname to indicate a \"raw\" string of text (Python will ingore the special meaning of backslashes). So the following should work for a more complex file location:\n", "
      \n", "
    1. pd.read_csv('C:/Users/Michael Brydon/Data/Bank.csv\") — all forward slashes
    2. \n", "
    3. pd.read_csv(r'C:\\Users\\Michael Brydon\\Data/Bank.csv\") — mixture of slashes and backslashes with the \"r\" prefix
    4. \n", "
    \n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a DataFrame object from an Excel file\n", "The Pandas library supports importing from many different data sources, including CSV, Excel, JSON, or even direct database access over a network.\n", "\n", "Here we use `read_excel` instead of `read_csv`. It is the same data, but in a different file format on my computer:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EmployeeEducLevJobGradeYrHiredYrBornGenderYrsPriorPCJobSalary
01319269Male1No32.0
12118157Female1No39.1
23118360Female0No33.2
34218755Female7No30.6
45319267Male0No29.0
\n", "
" ], "text/plain": [ " Employee EducLev JobGrade YrHired YrBorn Gender YrsPrior PCJob \\\n", "0 1 3 1 92 69 Male 1 No \n", "1 2 1 1 81 57 Female 1 No \n", "2 3 1 1 83 60 Female 0 No \n", "3 4 2 1 87 55 Female 7 No \n", "4 5 3 1 92 67 Male 0 No \n", "\n", " Salary \n", "0 32.0 \n", "1 39.1 \n", "2 33.2 \n", "3 30.6 \n", "4 29.0 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bankxl = pd.read_excel('Data/Bank.xlsx')\n", "bankxl.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Running `head()` after importing is a good idea just to make sure you imported what you think you imported. If not, you will have to either:\n", "- edit your source file (Excel) to make sure it has the necessary tabular structure (with the first row as headings, and so on)\n", "- add arguments to your `read_...` method to make sure you are starting at the correct row, reading the correct spreadsheet, and so on.\n", "\n", "Here is where Google-ing and learning Python go hand in hand. If worst comes to worst, read [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Confirming data frame properties\n", "As noted previously, objects in Python expose useful properties and methods. For example, we can confirm the size of the `bank` data frame with the `shape` property, which gives us the number of rows (209) and columns (9):" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(208, 9)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bank.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the `describe()` method to generate some summary statistics:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EmployeeEducLevJobGradeYrHiredYrBornYrsPriorSalary
count208.000000208.000000208.000000208.000000208.000000208.000000208.000000
mean104.5000003.1586542.75961585.32692354.6057692.37500039.921923
std60.1885921.4674641.5665296.98783210.3189883.13523711.256154
min1.0000001.0000001.00000056.00000030.0000000.00000026.700000
25%52.7500002.0000001.00000082.00000047.7500000.00000033.000000
50%104.5000003.0000003.00000087.00000056.5000001.00000037.000000
75%156.2500005.0000004.00000090.00000063.0000004.00000044.000000
max208.0000005.0000006.00000093.00000073.00000018.00000097.000000
\n", "
" ], "text/plain": [ " Employee EducLev JobGrade YrHired YrBorn YrsPrior \\\n", "count 208.000000 208.000000 208.000000 208.000000 208.000000 208.000000 \n", "mean 104.500000 3.158654 2.759615 85.326923 54.605769 2.375000 \n", "std 60.188592 1.467464 1.566529 6.987832 10.318988 3.135237 \n", "min 1.000000 1.000000 1.000000 56.000000 30.000000 0.000000 \n", "25% 52.750000 2.000000 1.000000 82.000000 47.750000 0.000000 \n", "50% 104.500000 3.000000 3.000000 87.000000 56.500000 1.000000 \n", "75% 156.250000 5.000000 4.000000 90.000000 63.000000 4.000000 \n", "max 208.000000 5.000000 6.000000 93.000000 73.000000 18.000000 \n", "\n", " Salary \n", "count 208.000000 \n", "mean 39.921923 \n", "std 11.256154 \n", "min 26.700000 \n", "25% 33.000000 \n", "50% 37.000000 \n", "75% 44.000000 \n", "max 97.000000 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bank.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, notice the difference between a _property_ like `shape` with no parentheses and a _method_ like `describe()` with parentheses for passing arguments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Hint:\n", " Press the Shift-Tab key while within the parantheses of a Python method for a list of the possible arguments. Hit Shift-Tab twice to get more detailed help.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Second, by default, the `describe()` method only summarizes the numerical columns. Recall that our data frame consists of nine columns. To see the issue, run the `info()` method:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 208 entries, 0 to 207\n", "Data columns (total 9 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Employee 208 non-null int64 \n", " 1 EducLev 208 non-null int64 \n", " 2 JobGrade 208 non-null int64 \n", " 3 YrHired 208 non-null int64 \n", " 4 YrBorn 208 non-null int64 \n", " 5 Gender 208 non-null object \n", " 6 YrsPrior 208 non-null int64 \n", " 7 PCJob 208 non-null object \n", " 8 Salary 208 non-null float64\n", "dtypes: float64(1), int64(6), object(2)\n", "memory usage: 14.8+ KB\n" ] } ], "source": [ "bank.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see the nine columns in the data frame listed by data type. Six are integer values, one (Salary) is a floating point number, and two (Gender, PCJob) are \"Objects\" (string objects):" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting strings to categories\n", "Recall in R that we made a distinction between _strings_ (character data) like a person's name and *factors*, which look like strings but have a limited set of repeated values. Color is a factor, for example, because it takes on a few known values like \"red\", \"green\", and \"blue\". Similarly, a column called `car_brand` would be a factor because it would contain a finite number of string values, such as \"Ford\", \"Tesla\", \"Ferarri\".\n", "\n", "Pandas permits the same distinction, but instead uses the term _category_ instead of _factor_. This is telling: R uses the old statistical technical term \"factor\" whereas Pandas/Python uses the more straightforward term \"category\". This is the difference between the two languages in a nutshell.\n", "\n", "The process for replacing the two (string) \"Object\" columns with categories is similar to the one we used in R. The key is understanding how to reference columns in Python. Two possibilities:\n", "- square bracket notation: `bank['Gender']`\n", "- dot notation: `bank.Gender`\n", "\n", "Of these two, square bracket notation is slightly more flexible because it permits column names with spaces, e.g., `dataframe['column name']`. The dot notation of this would fail because Python has no way of knowing what the space after \"column\" means: `dataframe.column name`.\n", "\n", "Once we know how to reference a column (or a \"Series\" in Pandas-speak), we can run the type conversion method and specify \"category\" as the output data type:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Male\n", "1 Female\n", "2 Female\n", "3 Female\n", "4 Male\n", " ... \n", "203 Male\n", "204 Male\n", "205 Male\n", "206 Male\n", "207 Female\n", "Name: Gender, Length: 208, dtype: category\n", "Categories (2, object): [Female, Male]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bank['Gender'].astype('category')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note at the bottom of the summary that the `Gender` category consists of two values: Female and Male.\n", "\n", "Of course, running the command above just lists the converted values; it does not do anything to the \"bank\" data frame. To _replace_ the existing column in the data frame, we use the assignment operator:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "bank['Gender'] = bank['Gender'].astype('category')\n", "bank['PCJob'] = bank['PCJob'].astype('category')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can confirm the result by running the `describe()` method again, but this time using an argument to ensure we get only the categorical columns:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GenderPCJob
count208208
unique22
topFemaleNo
freq140189
\n", "
" ], "text/plain": [ " Gender PCJob\n", "count 208 208\n", "unique 2 2\n", "top Female No\n", "freq 140 189" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bank.describe(include='category')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This output tells us that we have 208 rows of data. Both the \"Gender\" and \"PCJob\" columns have two unique categorical values and the most frequent are \"Female\" and \"No\" respectively. 140 of the 208 employees are female while 189/208 do not have PC jobs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Type conversion using programming constructs\n", "As an aside, it is possible to do the same type conversions _en masse_ in Python using core programming constructs such as lists, iterators, and looping. Specifically, for a larger data set with many columns, it might make sense to do the following:\n", "1. make a list of the columns to convert\n", "2. iterate through each column making the conversion assignment, as above\n", "3. loop through the values of the variable `col` until all the columns are converted\n", "\n", "Use of an iterated list is not really worth it in this case (two lines of code using either approach):" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 208 entries, 0 to 207\n", "Data columns (total 9 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Employee 208 non-null int64 \n", " 1 EducLev 208 non-null int64 \n", " 2 JobGrade 208 non-null int64 \n", " 3 YrHired 208 non-null int64 \n", " 4 YrBorn 208 non-null int64 \n", " 5 Gender 208 non-null category\n", " 6 YrsPrior 208 non-null int64 \n", " 7 PCJob 208 non-null category\n", " 8 Salary 208 non-null float64 \n", "dtypes: category(2), float64(1), int64(6)\n", "memory usage: 12.1 KB\n" ] } ], "source": [ "for col in ['Gender', 'PCJob']:\n", " bankxl[col] = bankxl[col].astype('category')\n", " \n", "bank.info()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }