Exploring Data with OpenRefine
Overview
Teaching: 20 min
Exercises: 15 minQuestions
How can we import our data into OpenRefine?
How can we summarise our data?
How can we find errors in our data?
How can we edit data to fix errors?
How can we convert column data from one data type to another?
Objectives
Create a new OpenRefine project from a CSV file
Learn about different types of facets and how they can be used to summarise data of different data types
Lesson
Creating a Project
Start OpenRefine, which will open in your browser (at the address http://127.0.0.0:3333
). Once OpenRefine is launched in your
browser, the left margin has options to Create Project
, Open Project
, or Import Project
. Here we will create a new
project and import our portal rodents data.
- Click
Create Project
from the left margin and select thenThis Computer
(because you’re uploading data from your computer). - Click
Choose Files
and browse to where you stored the filePortal_rodents_19772002_scinameUUIDs.csv
. Select the file and clickOpen
, or just double-click on the filename. - Click
Next>>
under the browse button to upload the data into OpenRefine. -
On the next screen, OpenRefine will present you with a preview of your data. You can check here for obvious errors, if, for example, your file was tab-delimited rather than comma-delimited, the preview would look strange (and you could correct it by choosing the correct separator and clicking the
Update Preview
button on the right. If you selected the wrong file, click<<Start Over
at the top left. - In the middle of the page, will be a set of options (
Character encoding
, etc.). Make sure the tick box next toTrim leading & trailing whitespace from strings
is not ticked. (We’re going to need the leading whitespace in one of our examples.) - If all looks well, click
Create Project>>
in the top right. You will be presented with a view onto your data. This is OpenRefine!
OpenRefine does not modify your original dataset
Once your data is imported into a project - OpenRefine leaves your raw data intact and works on a copy which it creates inside the newly created project. All the data transformation and cleaning steps you apply will be performed on this copy and you can easily undo any changes too.
Data file types supported
OpenRefine can import a variety of different file types, including tab separated (
tsv
), comma separated (csv
), Excel (xls
,xlsx
), JSON, XML, RDF as XML, and Google Spreadsheets. See the Create a project by importing data page for more information.
Let’s now start exploring and getting a higher-level overview of our data - summarising and looking for potential outliers and errors.
Data Faceting
Facets are one of the most useful features of OpenRefine. Data faceting is a process of exploring data by applying multiple filters to investigate its composition. It also allows you to identify a subset of data that you wish to change in bulk.
OpenRefine Wiki: Faceting
Full documentation on faceting can be found at Exploring facets: Faceting
A facet
groups all the like values that appear in a column, and allows you to filter the data by those values. It also allows you to edit values across many records at the same time.
One type of facet is called a ‘Text facet’. This groups all the identical text values in a column and lists each value with the number of records it appears in. The facet information always appears in the left hand panel in the OpenRefine interface.
Here we will use faceting to look for potential errors in data entry in the scientificName
column.
- Scroll over to the
scientificName
column. -
Click the down arrow and choose
Facet
>Text facet
. -
In the left panel, you’ll now see a box containing every unique value in the
scientificName
column along with a number representing how many times that value occurs in the column. - Try sorting this facet by name and by count. Do you notice any problems with the data? What are they?
- Hover the mouse over one of the names in the
facet
list. You should see that you have anedit
function available. - You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we’ll use instead. We’ll learn about these when we talk about clustering.
There will be several near-identical entries in scientificName
. For example, there is one entry for Ammospermophilis harrisi
and
one entry for Ammospermophilus harrisii
. These are both misspellings of Ammospermophilus harrisi
. We will see how to correct these
misspelled and mistyped entries in a later exercise.
More on Facets
OpenRefine Documentation: Exploring Facets
As well as ‘Text facets’ OpenRefine also supports a range of other types of facet. These include:
- Numeric facets
- Timeline facets (for dates)
- Custom facets
- Scatterplot facets
Numeric and Scatterplot facets display graphs instead of lists of values. The numeric facet graph includes ‘drag and drop’ controls you can use to set a start and end range to filter the data displayed. These facets are explored further in Examining Numbers in OpenRefine
Custom facets are a range of different types of facets. Some of the default custom facets are:
- Word facet - this breaks down text into words and counts the number of records each word appears in
- Duplicates facet - this results in a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row
- Text length facet - creates a numeric facet based on the length (number of characters) of the text in each row for the selected column. This can be useful for spotting incorrect or unusual data in a field where specific lengths are expected (e.g. if the values are expected to be years, any row with a text length more than 4 for that column is likely to be incorrect)
- Facet by blank - a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for rows missing key data.
Facets are intended to group together common values and OpenRefine limits the number of values allowed in a single facet to ensure the software does not perform slowly or run out of memory. If you create a facet where there are many unique values (for example, a facet on a ‘book title’ column in a data set that has one row per book) the facet created will be very large and may either slow down the application, or OpenRefine will not create the facet.
Exercise
Using faceting, find out how many years are represented in the census.
Which years have the most and least observations?
Is the column formatted as Number, Date, or Text? How does changing the format change the faceting display?
Solution
- For the column
yr
doFacet
>Text facet
. A box will appear in the left panel showing that there are 26 unique entries in this column.- After creating a facet, click
Sort by count
in the facet box. The year with the most observations is 1997. The least is 1977.- By default, the column
yr
is formatted as Text. You can change the format by doingEdit cells
>Common transforms
>To number
. DoingFacet
>Numeric facet
creates a box in the left panel that shows a histogram of the number of entries per year. Notice that the data is shown as a number, not a date. If you instead transform the column to a date, the program will assume all entries are on January 1st of the year.
Key Points
Faceting can identify errors or outliers in data