2. Basic Visualization

In this tutorial we show how Python and its graphics libraries can be used to create the two most common types of distributional plots: histograms and boxplots.

2.1. Preliminaries

I include the data import and library import commands at the start of each lesson so that the lessons are self-contained.

import pandas as pd
bank = pd.read_csv('Data/Bank.csv')

2.2. Basic descriptive statistics

Pandas provides basic descriptive statistic functions as methods of the Series object. Recall that each DataFrame object consists of multiple Series (columns). Thus, the average salary for bank employees can be found as:

bank['Salary'].mean()
39.921923076923086

Similarly, using a variable to save some typing:

sal = bank['Salary']
sal.min(), sal.mean(), sal.median(), sal.max() 
(26.7, 39.921923076923086, 37.0, 97.0)

Or, recall, we can get statistical summary of all numerical columns using the describe() method:

bank.describe()
Employee EducLev JobGrade YrHired YrBorn YrsPrior Salary
count 208.000000 208.000000 208.000000 208.000000 208.000000 208.000000 208.000000
mean 104.500000 3.158654 2.759615 85.326923 54.605769 2.375000 39.921923
std 60.188592 1.467464 1.566529 6.987832 10.318988 3.135237 11.256154
min 1.000000 1.000000 1.000000 56.000000 30.000000 0.000000 26.700000
25% 52.750000 2.000000 1.000000 82.000000 47.750000 0.000000 33.000000
50% 104.500000 3.000000 3.000000 87.000000 56.500000 1.000000 37.000000
75% 156.250000 5.000000 4.000000 90.000000 63.000000 4.000000 44.000000
max 208.000000 5.000000 6.000000 93.000000 73.000000 18.000000 97.000000

2.3. Histograms in Seaborn

Two graphics libraries are in common use in Python: Matplotlib and Seaborn. Seaborn is an extension of Matplotlib that addresses a few specific graphics challenges, including histograms and boxplots. As such, we will restrict our attention here to Seaborn.

2.4. Loading the library

As before, we must load a library before we can use it. Seaborn is typically aliased as sns, but this is just a convention.

import seaborn as sns

2.5. Creating a histogram

Histograms are created in Seaborn using the histplot() (histogram plot) method. The syntax of Seaborn is closer to R than Python. For example, the plot is called on a Seaborn library object (sns) and passed a data frame as an argument.

sns.histplot(x=bank['Salary'])
<AxesSubplot:xlabel='Salary', ylabel='Count'>
_images/03_visualization_12_1.png

A few things to notice about this output

  • The histplot() method returns an AxesSubplot value. Since we don’t need this (or even know what it is), we can clean-up our output in ending each Seaborn (or Matplotlib) call with a semicolon.

  • Seaborn guesses at a good number of bins. It appears to be more than the default in R. But recall that the point of a histogram is to get a rough sense of the shape of the distribution of the variable. We can certainly change the number of bins (to say 10 or 12), but it is not critical.

We can pass some arguments to the method to get a more elaborate histogram. Turning on the kernel density estimate (kde=True) gives us a smoothed “kernel density” line, like in SAS EG.

sns.histplot(x=bank['Salary'], bins=10, kde=True);
_images/03_visualization_14_0.png

Of course, it is possible to change colors, and so on. I have split the more detailed method call below over multiple lines, which is more readable and more with keeping with R-style coding.

sns.histplot(x=bank['Salary'], 
             bins=10, kde=False,
             stat="probability",
             color='green' 
            );
_images/03_visualization_16_0.png

2.6. Creating a boplot

Creating a boxplot in Seaborn is very simple:

sns.boxplot(x=bank['Salary']);
_images/03_visualization_18_0.png

If you prefer a vertical orientation, you can plot your data as the y variable instead of the x variable, as done above. Also, notice that Seaborn does not provide an indicator of the mean by default. Obviously, skewed data such as this pulls the mean away from the median. I like to eyeball the difference between the two measures.

sns.boxplot(y=bank['Salary'],  color='lightgreen', showmeans=True);
_images/03_visualization_20_0.png