Pandas - Data Visualization
For a very quick analysis of our pandas data frame we can use some of the built in methods for pandas objects.
The Data
We'll use some fake data csv files that we can read in as dataframes.
Style Sheets
Matplotlib
has style sheets you can use to make your plots look a little nicer. These style sheets include plot_bmh
, plot_fivethirtyeight
, plot_ggplot
and more. They basically create a set of style rules that your plots follow.
Here is how to use them.
Before plt.style.use()
your plots look like this:
We can change the style as follows:
Now the plot looks like this
Other options are bmh
, dark_background
, fivethirtyeight
.
Let's stick with the ggplot
style and actually show you how to utilize pandas built-in plotting capabilities!
Plot Types
There are several plot types built-in to pandas, most of them statistical plots by nature:
df.plot.area
df.plot.barh
df.plot.density
df.plot.hist
df.plot.line
df.plot.scatter
df.plot.bar
df.plot.box
df.plot.hexbin
df.plot.kde
df.plot.pie
You can also just call df.plot(kind='hist')
or replace that kind argument with any of the key terms shown in the list above (e.g. box
,bar
, etc..)
You can use c
to have the colour based off another column value Use cmap
to indicate colormap to use.
Or use s
to indicate size based off another column. s
parameter needs to be an array, not just the name of a column:
A hexagonal bin plot is useful for bivariate data, and is an alternative to scatterplot.
df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
df.plot.hexbin(x='a',y='b',gridsize=25,cmap='Oranges')
Exercises
Use the df3
data set to replicate the following plots.
Download: df3.csv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
a 500 non-null float64
b 500 non-null float64
c 500 non-null float64
d 500 non-null float64
dtypes: float64(4)
memory usage: 15.7 KB
-
Recreate this scatter plot of
b
vsa
. Note the colour and size of the points. Also note the figure size. See if you can figure out how to stretch it in a similar fashion. Remember back to your matplotlib lecture.
-
Create a histogram of the 'a' column.
-
These plots are okay, but they don't look very polished. Use style sheets to set the style to 'ggplot' and redo the histogram from above. Also figure out how to add more bins to it.
-
Create a boxplot comparing the a and b columns.
-
Create a kde plot of the 'd' column
-
Figure out how to increase the linewidth and make the linestyle dashed. (Note: You would usually not dash a kde plot line).
-
Create an area plot of all the columns for just the rows up to 30. (hint: use .loc)
-
Note, you may find this really hard, reference the solutions if you can't figure it out! Notice how the legend in our previous figure overlapped some of actual diagram. Can you figure out how to display the legend outside of the plot as shown below?
Try searching Google for a good stackoverflow link on this topic. hint