Seaborn - Data Visualization
Seaborn is a statistical visualization library built on to of matplotlib, and is designed to work very well with pandas dataframe objects.
Distribution Plots
We'll start with a built-n data set in the seaborn library.
We can plot a distribution (histogram) for univariate numerical data:
sns.displot(tips['total_bill'],kde=False)
sns.displot(tips['total_bill'], bins=30) # second image below
For bivariate numerical data we can plot the two distributions together using jointplot
.
To visualize pairwise relationships between numerical data across an entire data frame we can use pairplot
.
This will do a jointplot
for every pair of numerical columns in the data frame, and arrange the plots in a (symmetric) grid. The diagonal will just be a univariate histogram.
%Using the optional hue
argument on a categorical column will colour the data points according to their categorical value.
Kernel Density Estimates
# Don't worry about understanding this code!
# It's just for the diagram below
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
#Create dataset
dataset = np.random.randn(25)
# Create another rugplot
sns.rugplot(dataset);
# Set up the x-axis for the plot
x_min = dataset.min() - 2
x_max = dataset.max() + 2
# 100 equally spaced points from x_min to x_max
x_axis = np.linspace(x_min,x_max,100)
# Set up the bandwidth, for info on this:
url = 'http://en.wikipedia.org/wiki/
Kernel_density_estimation#Practical_estimation_of_the_bandwidth'
bandwidth = ((4*dataset.std()**5)/(3*len(dataset)))**.2
# Create an empty kernel list
kernel_list = []
# Plot each basis function
for data_point in dataset:
# Create a kernel for each point and append to list
kernel = stats.norm(data_point,bandwidth).pdf(x_axis)
kernel_list.append(kernel)
#Scale for plotting
kernel = kernel / kernel.max()
kernel = kernel * .4
plt.plot(x_axis,kernel,color = 'grey',alpha=0.5)
plt.ylim(0,1)
# To get the kde plot we can sum these basis functions.
# Plot the sum of the basis function
sum_of_kde = np.sum(kernel_list,axis=0)
# Plot figure
fig = plt.plot(x_axis,sum_of_kde,color='steelblue')
# Add the initial rugplot
sns.rugplot(dataset,c = 'steelblue')
# Get rid of y-tick marks
plt.yticks([])
# Set title
plt.suptitle("Sum of the Basis Functions")
Categorical Plots
Bar Plots
# displays the standard deviation of total_bill for each sex
sns.barplot(x='sex', y='total_bill', data=tips,estimator=np.std)
Box and Whisker Plots
Violin Plots
Strip Plots
Swarm Plots
sns.violinplot(x='day',y='total_bill',data=tips)
sns.swarmplot(x='day',y='total_bill',data=tips, color='black')
Cat Plots
catplot
is the general type of plot for categorical data. All the specific commands above at just a type of catplot.
sns.catplot(x='day', y='total_bill', data=tips, kind='bar')
sns.catplot(x='day', y='total_bill', data=tips, kind='violin')
sns.catplot(x='day', y='total_bill', data=tips, kind='strip',
hue='sex', dodge=True)
Matrix Plots
For the plots we will explore in this section we need to restructure our tables so each row and column represent a variable. In the case of the tips data set we'll look at a simple example where we construct a correlation table. Notice each row corresponds to a variable, and so does each column.
We use a pivot table to restructure the flights data: rows correspond to months, columns to years, and the values come from the passengers column.
Grid
# create an empty grid of axes to plot on, store it variable g
g = sns.PairGrid(iris)
g.map(plt.scatter)
g = sns.PairGrid(iris, hue='species')
g.map_diag(sns.histplot)
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot)
g = sns.FacetGrid(tips, col='time', row='sex')
g.map(sns.histplot,'total_bill', color = 'steelblue')
g = sns.FacetGrid(tips, col='time', row='sex')
g.map(plt.scatter,'total_bill','tip', color = 'forestgreen')
Regression Plots
In this section we explore the lmplot
command for producing a linear model (regression) plot over a scatter plot.
import seaborn as sns
tips = sns.load_dataset('tips')
sns.lmplot(x='total_bill', y='tip', data=tips)
Under the hood lmplot
is calling matplotlib
so we can directly interface with the parameters using kws
.
We can produce a FacetGrid
by using the col
and row
parameters.
Style and Colour
Exercises
We will be working with a famous titanic data set for these exercises. Later on in the Machine Learning section of the course, we will revisit this data, and use it to predict survival rates of passengers. For now, we'll just focus on the visualization of the data with seaborn:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')
titanic = sns.load_dataset('titanic')
titanic.head()
Recreate the plots below using the titanic dataframe. There are very few hints since most of the plots can be done with just one or two lines of code and a hint would basically give away the solution. Keep careful attention to the x and y labels for hints.
1.
2.
3.
4.
5.
6.
7.