Data visualisation using Matplotlib
OverviewTeaching: 25 min
Exercises: 25 minQuestions
How can I create simple visualisations of my data?Objectives
Import pyplot from the matplotlib library
Create simple plots using pyplot
Matplotlib is a Python graphical library that can be used to produce a variety of different graph types.
The pandas library contains very tight integration with matplotlib. There are functions in pandas that automatically call matplotlib functions to produce graphs.
Although we are using Matplotlib in this episode, pandas can make use of several other graphical libraries available from within Python such as ggplot2 and seaborn.
The matplotlib library can be imported using any of the import techniques we have seen. As
pandas is generally imported with
import panas as pd, you will find that
matplotlib is most commonly imported with
import matplotlib as plt where ‘plt’ is the alias.
In addition to importing the library, in a Jupyter notebook environment we need to tell Jupyter that when we produce a graph we want it to be display the graph in a cell in the notebook just like any other results. To do this we use the
%matplotlib inline directive.
If you forget to do this, you will have to add
plt.show() to see the graphs.
import matplotlib.pyplot as plt %matplotlib inline
Numpy is another Python library. It is used for multi-dimensional array processing. In our case we just want to use it for its useful random number generation functions which we will use to create some fake data to demonstrate some of the graphing functions of matplotlib.
We will use the alias
np, following convention.
will generate 20 random numbers between 0 and 1.
We are using these to create a pandas Series of values.
A bar chart only needs a single set of values. Each ‘bar’ represents the value from the Series of values. A pandas Series (and a Dataframe) have a method called ‘plot’. We only need to tell plot what kind of graph we want.
The ‘x’ axis represents the index values of the Series
import numpy as np import pandas as pd np.random.seed(12345) # set a seed value to ensure reproducibility of the plots s = pd.Series(np.random.rand(20) ) #s # plot the bar chart s.plot(kind='bar')
Internally the pandas ‘plot’ method has called the ‘bar’ method of matplotlib and provided a set of parameters, including the pandas.Series s to generate the graph.
We can use matplotlib directly to produce a similar graph. In this case we need to pass two parameters, the number of bars we need and the pandas Series holding the values.
We also have to explicitly call the
show() function to produce the graph.
plt.bar(range(len(s)), s) plt.show()
Compare the two graphs we have just drawn. How do they differ? Are the differences significant?
Most importantly the data in the graphs is the same. There are cosmetic differentces in the scale points in the x and y axis and in the width of the bars.
The width of the bars can be changed with a parameter in the ‘bar’ function
plt.bar(range ( len ( s )), s, width = 0.5) # the default width is 0.8
We can plot histograms in a similar way, directly from pandas and also from Matplotlib
The pandas way
s = pd.Series(np.random.rand(20)) # plot the bar chart s.plot(kind='hist')
and the matplotlib way
For the Histogram, each data point is allocated to 1 of 10 (by default) equal ‘bins’ of equal size (range of numbers) which are indicated along the x axis and the number of points (frequency) is shown on the y axis.
In this case the graphs are almost identical. The only difference being in the first graph the y axis has a label ‘Frequency’ associated with it.
We can fix this with a call to the
plt.ylabel('Frequency') plt.hist(s) plt.show()
In general most graphs can be broken down into a series of elements which, although typically related in some way, can all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion.
The labels (if any) on the x and y axis are independent of the data values being represented. The title and the legend are also independent objects within the overall graph.
In matplotlib you create the graph by providing values for all of the individual components you choose to include. When you are ready, you call the
Using this same approach we can plot two sets of data on the same graph
We will use a scatter plot to demonstrate some of the available features.
For a scatter plot we need two sets of data points one for the x values and the other for the y values.
The scatter plot requires the x and y coordinates of each of the points being plotted. To provide this we will generate two series of random data one for the x coordinates and the other for the y coordinates
We will generate two sets of points and plot them on the same graph.
We will also add other common features like a title, a legend and labels on the x and y axis.
# Generate some date for 2 sets of points. x1 = pd.Series(np.random.rand(20) - 0.5 ) y1 = pd.Series(np.random.rand(20) - 0.5 ) x2 = pd.Series(np.random.rand(20) + 0.5 ) y2 = pd.Series(np.random.rand(20) + 0.5 ) # Add some features plt.title('Scatter Plot') plt.ylabel('Range of y values') plt.xlabel('Range of x values') # plot the points in a scatter plot plt.scatter(x1,y1, c='red', label='Red Range' ) # 'c' parameter is the colour and 'label' is the text for the legend plt.scatter(x2,y2, c='blue', label='Blue Range') plt.legend( loc=4 ) # the locations 1,2,3 and 4 are top-right, top-left, bottom-left and bottom-right # Show the graph with the two sets of points plt.show()
In the call to the
scatter method, the
label parameter values are used by the legend.
color parameter can be set to any color matplotlib recognises. Full details of the available colours are available in the matplotlib website. The markers section will tell you what markers you can use instead of the default ‘dots’. There is also an
s (size) parameter which allows you to change the size of the marker.
In the scatterplot the s parameter determines the size of the dots. s can be a simple numeric value, say s=100, which will produce dots all of the same size. However you can pass a list of values (or a pandas Series) to provide sizes for the individual dots. This approach is very common as it allows us to provide an extra variable worth of information on the graph.
Modify the code we used for the scatter plot to include a size value for each of the points in the series being plotted. (The downside is that some of the smaller dots may be completely covered by the larger dots. To try and highlight when this has happened we can change the opacity of the dots.)
Find out which parameter controls the opacity of the dots ( clue - it is not called opacity), add it to you code and set it > to a reasonable value .
# Generate some data for 2 sets of points. # and additional data for the sizes - suitably scaled x1 = pd.Series(np.random.rand(20) - 0.5 ) y1 = pd.Series(np.random.rand(20) - 0.5 ) z1 = pd.Series(np.random.rand(20)*200 ) x2 = pd.Series(np.random.rand(20) + 0.5 ) y2 = pd.Series(np.random.rand(20) + 0.5 ) z2 = pd.Series(np.random.rand(20)*200 ) # Add some features plt.title('Scatter Plot') plt.ylabel('Range of y values') plt.xlabel('Range of x values') # plot the points in a scatter plot plt.scatter(x1,y1, c='red', label='Red Range', s=z1, alpha=0.5 ) # 's' parameter is the dot size plt.scatter(x2,y2, c='blue', label='Blue Range', s=z2, alpha=0.5) # 'alpha' is the opacity plt.legend( loc=4 ) plt.show()
A boxplot provides a simple representation of a variety of statistical qualities of a single set of data values.
x = pd.Series(np.random.standard_normal(256)) # Show a boxplot of the data plt.boxplot(x) plt.show()
A common use of the boxplot is to compare the statistical variations across a set of variables.
The variables can be an independent series or columns of a Dataframe.
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE')) # creating a Dataframe directly with pandas plt.boxplot(df.A, labels = 'A') plt.show()
Can you change the code above so that columns
Dare all displayed on the same graph?
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE')) plt.boxplot([df.A, df.C, df.D], labels = ['A', 'C', 'D']) plt.show()
The boxplot function cannot accept a whole Dataframe. The code
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE')) plt.boxplot(df) plt.show()
However we can use the pandas plot method
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE')) df.plot(kind = 'box', return_type='axes') # the return_type='axes' is only needed for forward compatibility
We can add a title to the above by adding the
title parameter. However there are no parameters for adding the axis labels.
To add labels we can use matplotlib directly.
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE')) df.plot(kind = 'box', return_type='axes') plt.title('Box Plot') plt.xlabel('xlabel') plt.ylabel('ylabel') plt.show()
Saving a graph
If you wish to save your graph as an image you can do so using the
savefig() function. The image can be saved as a pdf, jpg or png file by changing the file extension.
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE')) df.plot(kind = 'box', return_type='axes') plt.title('Box Plot') plt.xlabel('xlabel') plt.ylabel('ylabel') #plt.show() plt.savefig('boxplot_from_df.pdf')
Graphs can be drawn directly from pandas, but it still uses matplotlib
Different graph types have different data requirements
Graphs are created from a variety of discrete components placed on a ‘canvas’, you don’t have to use them all
Plotting multiple graphs on a single ‘canvas’ is possible