This lesson is in the early stages of development (Alpha version)

# Data Aggregation using Pandas

## Overview

Teaching: 20 min
Exercises: 10 min
Questions
• How can I summarise the data in a data frame?

Objectives
• Access and summarize data stored in a Data Frame

• Perform basic mathematical operations and summary statistics on data in a Pandas Data Frame

• Understand missing data

• Changing to and from ‘NaN’ values

## Using Pandas functions to summarise data in a Data Frame

For variables which contain numerical values we are often interested in various statistical measures relating to those values. For categorical variable we are often interested in the how many of each unique values are present in the dataset.

We shall use the SAFI_results.csv dataset to demonstrate how we can obtain these pieces of information

``````import pandas as pd
df_SAFI
``````

For numeric variables we can obtain a variety of basic statistical information by using the `describe()` method.

``````df_SAFI.describe()
``````

This can be done for the Dataframe as a whole, in which case some of the results might have no sensible meaning. If there are any missing values, represented in the display as `NaN` you will get a warning message.

You can also `.describe()` on a single variable basis.

``````df_SAFI['B_no_membrs'].describe()
``````

There are also a set of methods which allow us to obtain individual values.

``````print(df_SAFI['B_no_membrs'].min())
print(df_SAFI['B_no_membrs'].max())
print(df_SAFI['B_no_membrs'].mean())
print(df_SAFI['B_no_membrs'].std())
print(df_SAFI['B_no_membrs'].count())
print(df_SAFI['B_no_membrs'].sum())
``````
``````2
19
7.190839694656488
3.1722704895263734
131
942
``````

Unlike the `describe()` method which converts the variable to a float (when it was originally an integer), the individual summary methods only does so for the returned result if needed.

We can do the same thing for the `E19_period_use` variable

``````print(df_SAFI['E19_period_use'].min())
print(df_SAFI['E19_period_use'].max())
print(df_SAFI['E19_period_use'].mean())
print(df_SAFI['E19_period_use'].std())
print(df_SAFI['E19_period_use'].count())
print(df_SAFI['E19_period_use'].sum())
``````
``````1.0
45.0
12.043478260869565
8.583030848015385
92
1108.0
``````

## Exercise

Compare the count values returned for the `B_no_membrs` and the `E19_period_use` variables.

1. Why do you think they are different?
2. How does this affect the calculation of the mean values?

## Solution

1. We know from when we originally displayed the contents of the `df_SAFI` Dataframe that there are 131 rows in it. This matches the value for the `B_no_membrs` count. The count for `E19_period_use` however is only 92. If you look at the values in the `E19_period_use` column using
``````df_SAFI['E19_period_use']
``````

you will see that there are several `NaN` values. They also occurred when we used `describe()` on the full Dataframe. `NaN` stands for Not a Number, ie. the value is missing. There are only 92 non-missing values and this is what is reported by the `count()` method. This value is also used in the calculation of the mean and std values.

## Dealing with missing values

We can find out how many variables in our Dataframe contains any `NaN` values with the code

``````df_SAFI.isnull().sum()
``````
``````Column1                             0
A01_interview_date                  0
A03_quest_no                        0
A04_start                           0
...
``````

or for a specific variable

``````df_SAFI['E19_period_use'].isnull().sum()
``````
``````39
``````

Data from most sources has the potential to include missing data. Whether or not this presents a problem at all depends on what you are planning to do.

We have been using data from two very different sources.

The SN7577 dataset is provided by the UK Data Service. Datasets from the UK data Service, have already been ‘cleaned’ and it is unlikely that there will be any genuinely missing data. However you may find that data which was missing has been replaced with a value such as ‘-1’ or ‘Not Specified’. In cases like these it may be appropriate to replace these values with ‘NaN’ before you try to process the data further.

The SAFI dataset we have been using comes from a project called ‘Studying African Farmer-led Irrigation’. The data for this project is questionnaire based, but rather than using a paper-based questionnaire, it has been created and is completed electronically via an app on a smartphone. This provides flexibility in the design and presentation of the questionnaire; a section of the questionnaire may only be presented depending on the answer given to some preceding question. This means that there can quite legitimately be a set of ‘NaN’ values in a record (one complete questionnaire) where you would still consider the record to be complete.

We have already seen how we can check for missing values. There are three other actions we need to be able to do:

1. Remove complete rows which contain `NaN`
2. Replace `NaN` with a value of our choice
3. Replace specific values with `NaN`

With these options we can ensure that the data is suitable for the further processing we have planned.

### Completely remove rows with NaNs

The `dropna()` method will delete all rows if any of the variables contain an `NaN`. For some datasets this may be acceptable. You will need to take care that you have enough rows left for your analysis to have meaning.

``````df_SAFI = pd.read_csv("SAFI_results.csv")
print(df_SAFI.shape)
df_SAFI.dropna(inplace=True)
print(df_SAFI.shape)
``````
``````(131, 55)
(0, 55)
``````

Because there are variables in the SAFI dataset which are all `NaN` using the `dropna()` method effectively deletes all of the rows from the Dataframe, probably not what you wanted. Instead we can use the `notnull()` method as a row selection criteria and delete the rows where a specific variable has `NaN` values.

``````df_SAFI = pd.read_csv("SAFI_results.csv")
print(df_SAFI.shape)
df_SAFI = df_SAFI[(df_SAFI['E_no_group_count'].notnull())]
print(df_SAFI.shape)
``````
``````(131, 55)
(39, 55)
``````

### Replace NaN with a value of our choice

The `E19_period_use` variable answers the question: “For how many years have you been irrigating the land?”. In some cases the land is not irrigated and these are represented by NaN in the dataset. So when we run

``````df_SAFI['E19_period_use'].describe()
``````

we get a count value of 92 and all of the other statistics are based on this count value.

Now supposing that instead of NaN the interviewer entered a value of 0 to indicate the land which is not irrigated has been irrigated for 0 years, technically correct.

To see what happens we can convert all of the NaN values in the `E19_period_use` column to 0 with the following code:

``````df_SAFI['E19_period_use'].fillna(0, inplace=True)
``````

If we now run the `describe()` again you can see that all of the statistic have been changed because the calculations are NOW based on a count of 131. Probably not what we would have wanted.

Conveniently this allows us to demonstrate our 3rd action.

### Replace specific values with NaN

Although we can recognise `NaN` with methods like `isnull()` or `dropna()` actually creating a `NaN` value and putting it into a Dataframe, requires the `numpy` module. The following code will replace our 0 values with `NaN`. We can demonstrate that this has occurred by running the `describe()` again and see that we now have our original values back.

``````import numpy as np
df_SAFI['E19_period_use'].replace(0, np.NaN, inplace = True)
df_SAFI['E19_period_use'].describe()
``````

## Categorical variables

For categorical variables, numerical statistics don’t make any sense. For a categorical variable we can obtain a list of unique values used by the variable by using the `unique()` method.

``````df_SAFI = pd.read_csv("SAFI_results.csv")
pd.unique(df_SAFI['C01_respondent_roof_type'])
``````
``````array(['grass', 'mabatisloping', 'mabatipitched'], dtype=object)
``````

Knowing all of the unique values is useful but what is more useful is knowing how many occurrences of each there are. In order to do this we can use the `groupby` method.

Having performed the `groupby()` we can them `describe()` the results. The format is similar to that which we have seen before except that the ‘grouped by’ variable appears to the left and there is a set of statistics for each unique value of the variable.

``````grouped_data = df_SAFI.groupby('C01_respondent_roof_type')
grouped_data.describe()
``````

You can group by more than one variable at a time by providing them as a list.

``````grouped_data = df_SAFI.groupby(['C01_respondent_roof_type', 'C02_respondent_wall_type'])
grouped_data.describe()
``````

You can also obtain individual statistics if you want.

``````A11_years_farm = df_SAFI.groupby(['C01_respondent_roof_type', 'C02_respondent_wall_type'])['A11_years_farm'].count()
A11_years_farm
``````
``````C01_respondent_roof_type  C02_respondent_wall_type
grass                     burntbricks                 22
muddaub                     42
sunbricks                    9
mabatipitched             burntbricks                  6
muddaub                      3
...
``````

## Exercise

1. Read in the SAFI_results.csv dataset.
2. Get a list of the different `C01_respondent_roof_type` values.
3. Groupby `C01_respondent_roof_type` and describe the results.
4. Remove rows with NULL values for `E_no_group_count`.
5. repeat steps 2 & 3 and compare the results.

## Solution

``````# Steps 1 and 2
import numpy as np
print(df_SAFI.shape)
print(pd.unique(df_SAFI['C01_respondent_roof_type']))
``````
``````# Step 3
grouped_data = df_SAFI.groupby('C01_respondent_roof_type')
grouped_data.describe()
``````
``````# steps 4 and 5
df_SAFI = df_SAFI[(df_SAFI['E_no_group_count'].notnull())]
grouped_data = df_SAFI.groupby('C01_respondent_roof_type')
print(df_SAFI.shape)
print(pd.unique(df_SAFI['C01_respondent_roof_type']))
grouped_data.describe()
``````

`E_no_group_count` is related to whether or not farm plots are irrigated or not. It has no obvious connection to farm buildings. By restricting the data to non-irrigated plots we have accidentally? removed one of the roof_types completely.

## Key Points

• Summarising numerical and categorical variables is a very common requirement

• Missing data can interfere with how statistical summaries are calculated

• Missing data can be replaced or created depending on requirement

• Summarising or aggregation can be done over single or multiple variables at the same time