Data Visualisation with Seaborn and Matplotlib

Author: Olatomiwa Bifarin.

Read as a draft notebook

In [258]:
# Disable warnings in Anaconda
import warnings
warnings.filterwarnings('ignore')

# Matplotlib forms basis for visualization in Python
import matplotlib.pyplot as plt

# We will use the Seaborn library
import seaborn as sns
sns.set()
sns.set_context('talk', font_scale=0.8)

# Graphics in SVG format are more sharp and legible
%config InlineBackend.figure_format = 'retina' 

import pandas as pd

1. Data Setup

In [259]:
data = pd.read_csv("goodreads_data.csv") 
In [260]:
data.head(2)
Out[260]:
Book Id Title Author Additional Authors ISBN MyRating AverageRating Publisher Binding Pages Year Published PublicationYear YearRead DateRead Date Added Bookshelves Bookshelves with positions ExclusiveShelf My Review ReadCount
0 17397466 An Introduction to Statistical Learning: With ... Gareth James Trevor Hastie, Robert Tibshirani, Daniela Witten 1461471370 0 4.60 Springer Hardcover 426.0 2017.0 2013.0 NaN NaN 5/7/18 currently-reading currently-reading (#5) currently-reading NaN 1
1 24203476 The Self-made Billionaire Effect Deluxe: How E... John Sviokla Mitch Cohen NaN 0 3.75 Portfolio ebook 198.0 2014.0 2014.0 NaN NaN 9/25/19 currently-reading currently-reading (#4) currently-reading NaN 1

The dimension of our datasets.

In [261]:
print(data.shape)
(284, 20)

Here are the features for my goodreads dataset

In [262]:
data.columns
Out[262]:
Index(['Book Id', 'Title', 'Author', 'Additional Authors', 'ISBN', 'MyRating',
       'AverageRating', 'Publisher', 'Binding', 'Pages', 'Year Published',
       'PublicationYear', 'YearRead', 'DateRead', 'Date Added', 'Bookshelves',
       'Bookshelves with positions', 'ExclusiveShelf', 'My Review',
       'ReadCount'],
      dtype='object')

Let's look at the data types.

In [263]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284 entries, 0 to 283
Data columns (total 20 columns):
Book Id                       284 non-null int64
Title                         284 non-null object
Author                        284 non-null object
Additional Authors            70 non-null object
ISBN                          239 non-null object
MyRating                      284 non-null int64
AverageRating                 284 non-null float64
Publisher                     271 non-null object
Binding                       277 non-null object
Pages                         276 non-null float64
Year Published                276 non-null float64
PublicationYear               260 non-null float64
YearRead                      156 non-null float64
DateRead                      153 non-null object
Date Added                    284 non-null object
Bookshelves                   128 non-null object
Bookshelves with positions    128 non-null object
ExclusiveShelf                284 non-null object
My Review                     84 non-null object
ReadCount                     284 non-null int64
dtypes: float64(5), int64(3), object(12)
memory usage: 44.5+ KB
In [264]:
cols_to_use = ['Title', 'Author', 'MyRating', 'AverageRating', 
              'Pages', 'PublicationYear', 'YearRead', 
               'DateRead', 'ExclusiveShelf', 'ReadCount']
In [265]:
df=data[cols_to_use]
df.head()
Out[265]:
Title Author MyRating AverageRating Pages PublicationYear YearRead DateRead ExclusiveShelf ReadCount
0 An Introduction to Statistical Learning: With ... Gareth James 0 4.60 426.0 2013.0 NaN NaN currently-reading 1
1 The Self-made Billionaire Effect Deluxe: How E... John Sviokla 0 3.75 198.0 2014.0 NaN NaN currently-reading 1
2 How to Stop Worrying and Start Living Dale Carnegie 5 4.12 358.0 1944.0 2018.0 8/10/18 read 2
3 Enchiridion Epictetus 5 4.23 64.0 125.0 2016.0 5/18/16 read 2
4 Everyday Calculus: Discovering the Hidden Math... Oscar E. Fernandez 0 3.50 150.0 2014.0 NaN NaN currently-reading 1

A dataframe for books read

In [266]:
dfRead = df.drop(df.index[df.ReadCount == 0])
dfRead = df.drop(df.index[df.MyRating == 0])
In [267]:
dfRead.shape
Out[267]:
(154, 10)
In [268]:
dfRead.head()
Out[268]:
Title Author MyRating AverageRating Pages PublicationYear YearRead DateRead ExclusiveShelf ReadCount
2 How to Stop Worrying and Start Living Dale Carnegie 5 4.12 358.0 1944.0 2018.0 8/10/18 read 2
3 Enchiridion Epictetus 5 4.23 64.0 125.0 2016.0 5/18/16 read 2
6 Shoe Dog: A Memoir by the Creator of NIKE Phil Knight 5 4.48 400.0 2016.0 2019.0 9/17/19 read 1
9 Socrates en Òrúnmìlà. Wat we van Afrikaanse fi... Sophie Bosede Oluwole 5 3.35 224.0 NaN 2019.0 8/30/19 read 1
10 The Spy with No Name Jeff Maysh 5 3.35 NaN NaN 2019.0 8/8/19 read 1

2. Histogram and Density Plots

The average raating of the books I have read by goodereads users

In [289]:
dfRead.AverageRating.hist(figsize=(10,6), bins=20);
In [291]:
_, axes = plt.subplots(1, 2, sharey=True, figsize=(15, 6))

dfRead.AverageRating.plot(kind='density', ax=axes[0]);
sns.distplot(dfRead.AverageRating, ax=axes[1]);

3. Bar Plots

Number of books read per year

In [271]:
df_2 = dfRead.groupby('YearRead').sum()
df_2.reset_index(inplace=True)
#convert float to integer
df_2 = df_2.astype({'YearRead': 'int32', 'Pages': 'int32', 'ReadCount': 'int32'})
df_2 = df_2[:-1] #Remove the last row of the dataframe which is year 2019 data
splot = sns.barplot(x='YearRead', y='ReadCount', data=df_2);
for p in splot.patches:
    splot.annotate(format(p.get_height()), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', xytext = (6, 5), 
                   textcoords = 'offset points')

Number of book pages read per year

In [272]:
df_2 = dfRead.groupby('YearRead').sum()
df_2.reset_index(inplace=True)
#convert float to integer
df_2 = df_2.astype({'YearRead': 'int32', 
                    'Pages': 'int32', 
                    'ReadCount': 'int32'})
df_2 = df_2[:-1] #Remove the last row of the dataframe which is year 2019 data
splot = sns.barplot(x='YearRead', y='Pages', data=df_2);
for p in splot.patches:
    splot.annotate(format(p.get_height()), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', xytext = (6, 5), 
                   textcoords = 'offset points')

4. Box, Swarm, and Violin Plots

Boxplot showing number of books I read in a year (2014-2018)

In [286]:
sns.boxplot(x='ReadCount', data=df_2);
In [275]:
df_2['ReadCount'].describe()
Out[275]:
count     5.000000
mean     28.400000
std       8.905055
min      15.000000
25%      24.000000
50%      32.000000
75%      34.000000
max      37.000000
Name: ReadCount, dtype: float64

Box, Swarm, and Violin plots showing the number of books I read in a year (2014-2018)

In [285]:
_, axes = plt.subplots(1, 3, sharex = True, sharey=True, figsize=(8, 5))
sns.boxplot(data=df_2['ReadCount'], ax=axes[0]);
sns.swarmplot(data=df_2['ReadCount'], ax=axes[1], color=".25", size=8);
#sns.boxplot(data=df['Number_of_Pages'], ax=axes[1]);
sns.violinplot(data=df_2['ReadCount'], ax=axes[2]);

5. Scatter Plot

Scatter plot of average rating of books and the pages number

In [277]:
plt.scatter(dfRead['AverageRating'], dfRead['Pages']);
plt.xlabel('Average Rating')
plt.ylabel('Number of Pages in a Book')
plt.title('Scatter Plot of Average Book Rating and Page Numbers');
In [278]:
sns.lmplot('AverageRating', 'Pages', 
              data=df, hue='ReadCount', fit_reg=False);
plt.xlabel('Average Rating')
plt.ylabel('Number of Pages in a Book')
plt.title('Scatter Plot of Average Book Rating and Page Numbers');
In [280]:
sns.jointplot(x='AverageRating', y='Pages', 
              data=df_read, kind='scatter');
In [281]:
sns.jointplot(x='AverageRating', y='Pages', 
              data=df_read, kind='kde');

6. Correlation Matrix

In [282]:
correlation = ['MyRating', 'AverageRating', 'Pages', 
               'PublicationYear', 'ReadCount']
In [283]:
# Calculate and plot
corr_matrix = df_read[correlation].corr()
sns.heatmap(corr_matrix);

7. Scatter plot Matrix

In [284]:
sns.pairplot(df[['Pages','PublicationYear', 'AverageRating']]);