Pandas

Author: Olatomiwa Bifarin.

Read as draft

1 | What is Pandas?

Pandas is an open source python libraries that is used for data handling and manipulation. it was developed to work with the Numpy library. And it is helpful to think of pandas as a building block for data wrangling and analysis in python. So, we are talking about a very essential tool here.

If you don't have pandas installed on your system, you can install the library using the terminal via the following command line:

conda install pandas

Ananconda will take care of the rest.

Installed Pandas Version

In [44]:
pd.__version__
Out[44]:
'0.23.4'

2 | Introducing Series and Dataframes

The main data structures are Series and Dataframe. As the name suggests, series is a one dimensional data structure (with an indexed array): you have an index, you have a value. While dataframe is multidimensional - you can think of dataframe as having indexes with many instances.

Let's take a look at these things more closely.

2.1 | Series

In [45]:
# import numpy and pandas libraries
import numpy as np
import pandas as pd

Declare a series

A series of age

In [46]:
age = pd.Series([31, 26, 37, 49, 57])
age
Out[46]:
0    31
1    26
2    37
3    49
4    57
dtype: int64

There you go! and then you see how pandas, by default it assign an index to these values. However, let's see if we can do that ourselves, we don't really need pandas's help. Afterall, we need names of actual people to match this age.

In [47]:
age = pd.Series([31, 26, 37, 49, 57], 
             index=['tobi', 'ife', 'ayo', 'olori', 'ope'])
age
Out[47]:
tobi     31
ife      26
ayo      37
olori    49
ope      57
dtype: int64

Simple.

Select values

Now let's see if I can programmatically select, say ayo's age. In other words, let's select an internal element.

In [48]:
age['ayo']
Out[48]:
37

Olori's age

In [49]:
age['olori']
Out[49]:
49

And we can even do both at the same time.

In [50]:
age[['ayo','olori']]
Out[50]:
ayo      37
olori    49
dtype: int64

Change values

Let's say I just suddenly realize now that we got Olori's age wrong, she is actually 22. There is a way we assign values to the element without declaring the entire series again.

In [51]:
age['olori'] = 22
age
Out[51]:
tobi     31
ife      26
ayo      37
olori    22
ope      57
dtype: int64

Filter Series

Now, I want the series of those greater than the age of 30. That one too is possible.

In [52]:
age[age > 30]
Out[52]:
tobi    31
ayo     37
ope     57
dtype: int64

Mathematics on Series

In [53]:
age * 2
Out[53]:
tobi      62
ife       52
ayo       74
olori     44
ope      114
dtype: int64
In [54]:
age ** 2
Out[54]:
tobi      961
ife       676
ayo      1369
olori     484
ope      3249
dtype: int64

and so on

2.2 | Dataframes

If you have ever seen a spreadsheet filled with data before, then you know what a dataframe is. That is a dataframe - an extended form of series.

And I can show you.

Declaring Dataframes

This can be done by passing dictionaries into DataFrame( ).

In [55]:
# our dictionary: age_updated
age_updated = {'name' : ['tobi', 'ife', 'ayo', 'olori', 'ope'],
            'age' : [31, 26, 37, 49, 57], 
            'height(cm)' : [1.6, 1.8, 0.72, 1.72, 1.9], 
            'BMI': [31.1, 28.6, 25.7, 25.9, 25]}

dfage = pd.DataFrame(age_updated)
dfage
Out[55]:
name age height(cm) BMI
0 tobi 31 1.60 31.1
1 ife 26 1.80 28.6
2 ayo 37 0.72 25.7
3 olori 49 1.72 25.9
4 ope 57 1.90 25.0

Selecting specific Columns in Dataframes

Now, I want to see only the name, height, and the associated BMI in a new dataframe. In other words, I want to select specific columns from our dfage dataframe.

In [56]:
newframe = pd.DataFrame(dfage, columns=['name','height(cm)','BMI'])
newframe
Out[56]:
name height(cm) BMI
0 tobi 1.60 31.1
1 ife 1.80 28.6
2 ayo 0.72 25.7
3 olori 1.72 25.9
4 ope 1.90 25.0

3 | Main Dataframe functionalities

Now, to delve deeper into the main dataframe functionalities, let's take, hopefully, a more exicting dataset.

In [57]:
# import seaborn library
import seaborn as sns
sns.get_dataset_names(); # to check for the names of dataset available on seaborn. 

I like how this exercise data set looks, so let's use it. And we can use the head( ) method to peep at the top-most section of the dataframe. We can do the same for the 'tail' too.

In [58]:
exercise = sns.load_dataset('exercise');
exercise.head()
Out[58]:
Unnamed: 0 id diet pulse time kind
0 0 1 low fat 85 1 min rest
1 1 1 low fat 85 15 min rest
2 2 1 low fat 88 30 min rest
3 3 2 low fat 90 1 min rest
4 4 2 low fat 92 15 min rest
In [59]:
exercise.tail()
Out[59]:
Unnamed: 0 id diet pulse time kind
85 85 29 no fat 135 15 min running
86 86 29 no fat 130 30 min running
87 87 30 no fat 99 1 min running
88 88 30 no fat 111 15 min running
89 89 30 no fat 150 30 min running

3.1 Columns

What are the columns in my (exercise) dataframe? Yes, we can do that too.

In [60]:
exercise.columns
Out[60]:
Index(['Unnamed: 0', 'id', 'diet', 'pulse', 'time', 'kind'], dtype='object')

3.2 Dataframe dimensions

In [61]:
exercise.shape
Out[61]:
(90, 6)

The output states it clearly 90 rows, 6 columns.

3.3 Dataframe Information and Statistics

In [62]:
exercise.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 6 columns):
Unnamed: 0    90 non-null int64
id            90 non-null int64
diet          90 non-null category
pulse         90 non-null int64
time          90 non-null category
kind          90 non-null category
dtypes: category(3), int64(3)
memory usage: 2.7 KB

category and int64 are different data types. The technical name of pandas dataframe is 'pandas.core.frame.DataFrame'. The total number of data columns is indicated: 6. The range index, and even the memory usage.

In [63]:
exercise.describe()
Out[63]:
Unnamed: 0 id pulse
count 90.000000 90.000000 90.000000
mean 44.500000 15.500000 99.700000
std 26.124701 8.703932 14.858471
min 0.000000 1.000000 80.000000
25% 22.250000 8.000000 90.250000
50% 44.500000 15.500000 96.000000
75% 66.750000 23.000000 103.000000
max 89.000000 30.000000 150.000000

The describe( ) method gives you all basic statistically information: counts, mean, standard deviation, minimum and maximum values, the 0.25 and 0.75 quartiles. If you also have missing values, it lets you know.

Now notice that the category data types are misssing in this data analysis. This is because by default the describe( ) method only takes into consideration the numerical data types.

Now let's see if we can get it to consider our categorical datasets, all we need is just the include parameter.

In [64]:
exercise.describe(include=['category'])
Out[64]:
diet time kind
count 90 90 90
unique 2 3 3
top low fat 30 min running
freq 45 30 30

3.4 Dropping Columns

If you take a look at our exercise dataframe, the first column. I don't like that column, it pisses me off, so let's see if we can drop it off. And I will give my new dataframe a new name - df.

In [65]:
df = exercise.drop(['Unnamed: 0'], axis=1)
df.head()
Out[65]:
id diet pulse time kind
0 1 low fat 85 1 min rest
1 1 low fat 85 15 min rest
2 1 low fat 88 30 min rest
3 2 low fat 90 1 min rest
4 2 low fat 92 15 min rest

3.5 Value Counts

In [66]:
df['diet'].value_counts()
Out[66]:
low fat    45
no fat     45
Name: diet, dtype: int64
In [67]:
df['diet'].value_counts(normalize=True)
Out[67]:
low fat    0.5
no fat     0.5
Name: diet, dtype: float64

3.6 Sorting Columns

Let's call our df dataframe again - just the 'head'.

In [68]:
df.head(3)
Out[68]:
id diet pulse time kind
0 1 low fat 85 1 min rest
1 1 low fat 85 15 min rest
2 1 low fat 88 30 min rest

Let's take pulse as an example, let's say I want to sort my entire dataframe based on the pulse. This is how you do it:

In [69]:
df.sort_values(by='pulse', ascending=True).head(10)
Out[69]:
id diet pulse time kind
9 4 low fat 80 1 min rest
10 4 low fat 82 15 min rest
11 4 low fat 83 30 min rest
16 6 no fat 83 15 min rest
15 6 no fat 83 1 min rest
45 16 no fat 84 1 min walking
32 11 low fat 84 30 min walking
17 6 no fat 84 30 min rest
0 1 low fat 85 1 min rest
1 1 low fat 85 15 min rest

This could be done on multiple columns:

In [70]:
df.sort_values(by=['pulse', 'time'],
               ascending=[False, True]).head()
Out[70]:
id diet pulse time kind
89 30 no fat 150 30 min running
77 26 no fat 143 30 min running
80 27 no fat 140 30 min running
83 28 no fat 140 30 min running
85 29 no fat 135 15 min running

3.7 Indexing and Retrieval

To index and retrieve the first and the last rows of our dataframe, we can use the following df [:1] and df [-1:] respectively

In [71]:
df[:1]
Out[71]:
id diet pulse time kind
0 1 low fat 85 1 min rest
In [72]:
df[-1:]
Out[72]:
id diet pulse time kind
89 30 no fat 150 30 min running

To index walking in the kind column, and retrieve just that dataframe, we can do the following:

In [73]:
df[df['kind'] == 'walking'].head(6)
Out[73]:
id diet pulse time kind
30 11 low fat 86 1 min walking
31 11 low fat 86 15 min walking
32 11 low fat 84 30 min walking
33 12 low fat 93 1 min walking
34 12 low fat 103 15 min walking
35 12 low fat 104 30 min walking

Note the use of boolean indexing here:

In [74]:
df['kind'] == 'walking';

Other ways for indexing include: loc and the iloc methods. loc index by column names, while iloc index by numbers. Let's see how it works. And notice how the first number of the range is inclusive the the last one is not!

In [75]:
df.iloc[30:33, 1:4]
Out[75]:
diet pulse time
30 low fat 86 1 min
31 low fat 86 15 min
32 low fat 84 30 min
In [76]:
df.loc[30:33, 'diet':'time']
Out[76]:
diet pulse time
30 low fat 86 1 min
31 low fat 86 15 min
32 low fat 84 30 min
33 low fat 93 1 min

3.8 Renaming Columns

In [77]:
df.columns
Out[77]:
Index(['id', 'diet', 'pulse', 'time', 'kind'], dtype='object')
In [78]:
df = df.rename({'id':'No', 'diet':'food'}, axis=1)
df.columns
Out[78]:
Index(['No', 'food', 'pulse', 'time', 'kind'], dtype='object')

To add prefix or suffix to name

In [82]:
df.add_prefix('X').head(1)
Out[82]:
XNo Xfood Xpulse Xtime Xkind
0 1 low fat 85 1 min rest
In [84]:
df.add_suffix('Y').head(1)
Out[84]:
NoY foodY pulseY timeY kindY
0 1 low fat 85 1 min rest

Now let's change things back.

In [86]:
df = df.rename({'No':'id', 'food':'diet'}, axis=1)
df.columns
Out[86]:
Index(['id', 'diet', 'pulse', 'time', 'kind'], dtype='object')
In [ ]: