Last updated on 2023-05-02 | Edit this page
- How can I summarise the data in my tables
- Use the Distinct keyword to get a unique set of values
- Use the ‘group by’ clause to summarise data
- Use built-in statistical functions to provide column summaries
- Use the ‘having’ clause to provide selection criteria to the summary values
- Understand the difference between the ‘where’ and the ‘having’ clauses
Aggregate functions are used perform some kind of mathematical or statistical calculation across a group of rows. The rows in each group are determined by the different values in a specified column or columns. Alternatively you can aggregate across the entire table.
If we wanted to know the minimum, average and maximum values of the ‘A11_years_farm’ column across the whole Farms table, we could write a query such as this;
This sort of query provides us with a general view of the values for a particular column or field across the whole table.
avg are built-in
aggregate functions in SQLite (and any other SQL database system). There
are other such functions available. A complete list can be found in the
SQLite documentation here.
It is more likely that we would want to find such values for a range, or multiple ranges of rows where each range is determined by the values of some other column in the table. Before we do this we will look at how we can find out what different values are contained in a given column.
For the SAFI survey, it was known in advance all of the possible values that certain variables of columns could contain. For example the ‘A06_province’, ‘A07_district’, ‘A08_ward’ and ‘A09_village’ variables could only ever contain a few specific values.
As the SAFI survey was delivered via an Android phone app. It was possible to create the app so that the possible values could be selected from a dropdown list, eliminating any possibility of typing errors. For the ‘A06_province’ there were only three possibilities, but by the time we get down to ‘A09_villages’, a far more specific geography, it would not have been possible to anticipate in advance all of the possible values (village names) and so the values for this field were manually typed in.
To obtain a list of unique values in a particular column we can use
Using the Farms table we will obtain a list of all of the different values of the ‘A06_province’ column contained in the table.
SELECT DISTINCT A06_province
We can see from the results of running this that all 3 values are represented and that there is no missing data in this field.
However if we run a similar query for ‘A09_village’
SELECT DISTINCT A09_village
The problem with allowing free-form text quite obvious. Having two villages, one called ‘Massequece’ and the other called ‘Massequese’ is unlikely.
Detecting this type of problem in a large dataset can be very difficult if you are just ‘eyeballing’ the content. This small SQL query makes it very clear, and in the OpenRefine lesson we provide approaches to detecting and correcting such errors. SQL is not the best tool for correcting this type of error.
You can have more than one column name after the
DISTINCT keyword. In which case the results will include a
row for each unique combination of the columns
Write a query that will return all of the different combinations of the ‘A06_province’, ‘A07_district’, ‘A08_ward’ and ‘A09_village’ columns in the Farms table.
When looking at the results, you may have noticed that they are not in any sorted order. Re-write the query so that the values of the four columns are returned in alphabetical order.
SELECT DISTINCT A06_province, A07_district, A08_ward, A09_village
ORDER BY A06_province, A07_district, A08_ward, A09_village;
Just knowing the combinations is of limited use. You really want to
know How many of each of the values there are. To do
this we use the
GROUP BY clause.
count(*) AS How_many
GROUP BY A08_ward;
This query tells us how many records in the table have each different value in the ‘A08_ward’ column.
In the first example of this episode, three aggregations were
performed over the single column ‘A11_years_farm’. In addition to
calculating multiple aggregation values over a single column, it is also
possible to aggregate over multiple columns by specifying them in all in
SELECT clause and the
GROUP BY clause.
The grouping will take place based on the order of the columns listed
GROUP BY clause. There will be one row returned for
each unique combination of the columns mentioned in the
GROUP BY clause
What is not allowed is specifying a non-aggregated column in the select clause which is not mentioned in the group by clause.
A09_village,count(*) AS How_many
GROUP BY A06_province, A07_district, A08_ward, A09_village
min(A11_years_farm) AS min,
max(A11_years_farm) AS max,
avg(A11_years_farm) AS avg,
count(*) AS how_many
WHERE A07_district = 'Nhamatanda'
GROUP BY A09_village;
Notice that you can use the ‘A07_district’ column in the
WHERE clause but it doesn’t have to appear in the
In order to filter the rows returned in a non-aggregated query we
WHERE clause. For an aggregated query the
equivalent is the
You use the
HAVING clause by providing it with a filter
expression which references one or more of the aggregated columns.
HAVING clause you can use the column alias to refer
to the aggregated column.
min(A11_years_farm) AS min_years,
max(A11_years_farm) AS max_years,
count(*) AS how_many_farms
GROUP BY A08_ward
HAVING how_many_farms > 2;
In this example we want to remove the wards which only have one or two farms.
count(*) AS how_many
GROUP BY D_curr_crop
HAVING how_many > 100