Data Cleaning with OpenRefine for Ecologists: All in One View

Content from Introduction

Last updated on 2024-05-02 | Edit this page

Overview

Questions

How is OpenRefine useful?

Objectives

Describe OpenRefine’s uses and applications.
Differentiate data cleaning from data organization.
Experiment with OpenRefine’s user interface.
Locate helpful resources to learn more about OpenRefine.

Lesson

Motivations for the OpenRefine Lesson

Data is often very messy, and this tool saves a lot of time on cleaning headaches.
Data cleaning steps often need repeating with multiple files. It is important to know what you did to your data. This makes it possible for you to repeat these steps again with similarly structured data. OpenRefine is perfect for speeding up repetitive tasks by replaying previous actions on multiple datasets.
Additionally, journals, granting agencies, and other institutions are requiring documentation of the steps you took when working with your data. With OpenRefine, you can capture all actions applied to your raw data and share them with your publication as supplemental material.
Any operation that changes the data in OpenRefine can be reversed or undone.
Some concepts such as clustering algorithms are quite complex, but with OpenRefine we can introduce them, use them, and show their power.

Note: You must export your modified dataset to a new file: OpenRefine does not save over the original source file. All changes are stored in the OpenRefine project.

Before we get started

The following setup is necessary before we can get started (see the instructions here.)

What is OpenRefine?

OpenRefine is a Java program that runs on your machine (not in the cloud): it is a desktop application that uses your web browser as a graphical interface. No internet connection is needed, and none of the data or commands you enter in OpenRefine are sent to a remote server.
OpenRefine does not modify your original dataset. All actions can be reversed in OpenRefine and you can capture all the actions applied to your data and share this documentation with your publication as supplemental material.
OpenRefine saves as you go. You can return to the project at any time to pick up where you left off or export your data to a new file.
OpenRefine can be used to standardise and clean data across your file.

It can also help you

Get an overview of a data set
Resolve inconsistencies in a data set
Help you split data up into more granular parts
Match local data up to other data sets
Enhance a data set with data from other sources
Save a set of data cleaning steps to replay on multiple files

OpenRefine is a powerful, free, and open source tool with a large growing community of practice. More help can be found at https://openrefine.org.

Features

Open source (source on GitHub).
A large growing community, from novice to expert, ready to help.

More Information on OpenRefine

You can find out a lot more about OpenRefine at the official user manual docs.openrefine.org. There is a user forum that can answer a lot of beginner questions and problems. Recipes, scripts, projects, and extensions are available to add functionality to OpenRefine. These can be copied into your OpenRefine instance to run on your dataset.

Key Points

OpenRefine is a powerful, free and open source tool that can be used for data cleaning.
OpenRefine will automatically track any steps you take in working with your data.

Content from Importing Data to OpenRefine

Last updated on 2024-05-02 | Edit this page

Overview

Questions

How can we import our data into OpenRefine?

Objectives

Create a new OpenRefine project from a CSV file

Importing data

What kinds of data files can I import?

There are several options for getting your data set into OpenRefine. You can import files in a variety of formats including:

Comma-separated values (CSV) or tab-separated values (TSV)
Text files
Fixed-width columns
JSON
XML
OpenDocument spreadsheet (ODS)
Excel spreadsheet (XLS or XLSX)
RDF data (JSON-LD, N3, N-Triples, Turtle, RDF/XML)
Wikitext

See the Create a project by importing data page in the OpenRefine manual for more information.

Create your first OpenRefine project (using provided data)

Start OpenRefine, which will open in your browser (at the address http://127.0.0.0:3333). Once OpenRefine is launched in your browser, the left margin has options to:

Create Project
Open Project
Import Project
Language Settings

Click Create Project from the left margin and select then This Computer (because you’re uploading data from your computer).
Click Choose Files and browse to where you stored the file Portal_rodents_19772002_simplified.csv. Select the file and click Open, or just double-click on the filename.

Click Next>> under the browse button to upload the data into OpenRefine.
On the next screen, OpenRefine will present you with a preview of your data. You can check here for obvious errors, if, for example, your file was tab-delimited rather than comma-delimited, the preview would look strange (and you could correct it by choosing the correct separator and clicking the Update Preview button on the right. If you selected the wrong file, click <<Start Over at the top left.
In the middle of the page, will be a set of options (Character encoding, etc.). Make sure the tick box next to Trim leading & trailing whitespace from strings is not ticked. (We’re going to need the leading whitespace in one of our examples.)

If all looks well, click Create Project>> in the top right. You will be presented with a view onto your data. This is OpenRefine!

The columns are all imported as text, even the columns with numbers. We will see how to format the numeric columns in the next episode.

OpenRefine does not modify your original dataset

Once your data is imported into a project - OpenRefine leaves your raw data intact and works on a copy which it creates inside the newly created project. All the data transformation and cleaning steps you apply will be performed on this copy and you can undo any changes too.

Key Points

Use the Create Project option to import data
You can control how data imports using options on the import screen
Several files types may be imported into OpenRefine

Content from Exploring Data with OpenRefine

Last updated on 2023-09-22 | Edit this page

Overview

Questions

How can we summarise our data?
How can we find errors in our data?
How can we edit data to fix errors?
How can we convert column data from one data type to another?

Objectives

Learn about different types of facets and how they can be used to summarise data of different data types

Facets are one of the most useful features of OpenRefine. Data faceting is a process of exploring data by applying multiple filters to investigate its composition. It also allows you to identify a subset of data that you wish to change in bulk.

A facet groups all the like values that appear in a column, and allows you to filter the data by those values. It also allows you to edit values across many records at the same time.

Exploring text columns

One type of facet is called a ‘Text facet’. This groups all the identical text values in a column and lists each value with the number of records it appears in. The facet information always appears in the left hand panel in the OpenRefine interface.

Here we will use faceting to look for potential errors in data entry in the scientificName column.

Scroll over to the scientificName column.
Click the down arrow and choose Facet > Text facet.

In the left panel, you’ll now see a box containing every unique value in the scientificName column along with a number representing how many times that value occurs in the column.

Faceting results on the column scientificName

Try sorting this facet by name and by count. Do you notice any problems with the data? What are they?
Hover the mouse over one of the names in the facet list. You should see that you have an edit function available.
You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we’ll use instead. We’ll learn about these when we talk about clustering.

Exercise

Using faceting, find out how many years are represented in the census.
Which years have the most and least observations?
Is the column formatted as Number, Date, or Text?

Show me the solution

For the column yr do Facet > Text facet. A box will appear in the left panel showing that there are 16 unique entries in this column.
After creating a facet, click Sort by count in the facet box. The year with the most observations is 1978. The least is 1993.
By default, the column yr is formatted as Text.

Exploring numeric columns

When a table is imported into OpenRefine, all columns are treated as having text values. We can transform columns to other data types (e.g. number or date) using the Edit cells > Common transforms feature. Here we will experiment changing columns to numbers and see what additional capabilities that grants us.

Sometimes there are non-number values or blanks in a column which may represent errors in data entry and we want to find them. We can do that with a Numeric facet.

Create a numeric facet for the column yr. The facet will be empty because OpenRefine sees all the values as text.

To transform cells in the yr column to numbers, click the down arrow for that column, then Edit cells > Common transforms… > To number. You will notice the yr values change from left-justified to right-justified, and black to green color.

Exercise

The dataset included other numeric columns that we will explore in this exercise:

period - Unique number assigned to each survey period
plot - Plot number animal was caught on, from 1 to 24
recordID - Unique record ID number to facilitate quick reference to particular entry

Transform the columns period, plot, and recordID from text to numbers.

How does changing the format change the faceting display for the yr column?
Can all columns be transformed to numbers?

Show me the solution

Displaying a Numeric facet of yr shows a histogram of the number of entries per year. Notice that the data is shown as a number, not a date. If you instead transform the column to a date, the program will assume all entries are on January 1st of the year.

Only observations that include only numerals (0-9) can be transformed to numbers. If you apply a number transformation to a column that doesn’t meet this criteria, and then click the Undo / Redo tab, you will see a step that starts with Text transform on 0 cells. This means that the data in that column was not transformed.

The next exercise will explore what happens when a numeric column contains values that are not numbers.

Exercise

For a column you transformed to numbers, edit one or two cells, replacing the numbers with text (such as abc) or blank (no number or text).
Use the pulldown menu to apply a numeric facet to the column you edited. The facet will appear in the left panel.
Notice that there are several checkboxes in this facet: Numeric, Non-numeric, Blank, and Error. Below these are counts of the number of cells in each category. You should see checks for Non-numeric and Blank if you changed some values.
Experiment with checking or unchecking these boxes to select subsets of your data.

When done examining the numeric data, remove this facet by clicking the x in the upper left corner of its panel. Note that this does not undo the edits you made to the cells in this column.

Examine a pair of numeric columns using scatterplots

Now that we have multiple columns representing numbers, we can see how they relate to one another using the scatterplot facet. Select a numeric column, for example recordID, and use the pulldown menu to > Facet > Scatterplot facet. A new window called Scatterplot Matrix will appear. There are squares for each pair of numeric columns organized in an upper right triangle. Each square has little dots for the cell values from each row.

Click the image of the scatterplot between recordID and yr to select this one for the facet.

Exercise

Click in the scatterplot facet in the lower left margin and drag to highlight a rectangle. How does this change the data rows displayed?

More Details on Faceting

Full documentation on faceting can be found at Exploring facets: Faceting

Key Points

Faceting can identify errors or outliers in data

Content from Transforming Data

Last updated on 2024-05-02 | Edit this page

Overview

Questions

How can we transform our data to correct errors?

Objectives

Learn about clustering and how it is applied to group and edit typos
Split values from one column into multiple columns
Manipulate data using previous cleaning steps with undo/redo
Remove leading and trailing white spaces from cells

Data splitting

We can split data from one column into multiple columns if the parts are separated by a common separator (say a comma, or a space).

Let us suppose we want to split the scientificName column into separate columns, one for genus and one for species.
Click the down arrow next to the scientificName column. Choose Edit Column > Split into several columns...
In the pop-up, in the Separator box, replace the comma with a space (the box will look empty when you’re done).
Important! Uncheck the box that says Remove this column.
Click OK. You should get some new columns called scientificName 1, scientificName 2, scientificName 3, and scientificName 4.
Notice that in some cases these newly created columns are empty (you can check by text faceting the column). Why? What do you think we can do to fix it?

The entries that have data in scientificName 3 and scientificName 4 but not the first two scientificName columns had an extra space at the beginning of the entry. Leading and trailing white spaces are very difficult to notice when cleaning data manually. This is another advantage of using OpenRefine to clean your data - this process can be automated.

In newer versions of OpenRefine (from version 3.4.1) there is now an option to clean leading and trailing white spaces from all data when importing the data initially and creating the project.

Exercise

Look at the data in the column coordinates and split these values to obtain latitude and longitude. Make sure that the option for Guess cell type is checked and that Remove this column is not. Rename the new columns.

What type of data does OpenRefine assign to the new colunms?

Show me the solution

Both new columns will appear with green text, indicating they are numeric. The option for Guess cell type allowed OpenRefine to guess that these values were numeric.

Undoing / Redoing actions

It is common while exploring and cleaning a dataset to make a mistake or decide to change the order of the process you wish to conduct. OpenRefine provides Undo and Redo operations to roll back your changes.

Click Undo / Redo in the left side of the screen. All the changes you have made will appear in the left-hand panel. The current stage in the data processing is highlighted in blue (i.e. step 4. in the screenshot below). As you click on the different stages in the process, the step identified in blue will change and, far more importantly, the data will revert to that stage in the processing.

We want to undo the splitting of the column scientificName. Select the stage just before the split occurred and the new scientificName columns will disappear.
Notice that you can still click on the last stage and make the columns reappear, and toggle back and forth between these states. You can also select the state more than one steps back and revert to that state.
Let’s leave the dataset in the state before scientificNames was split.

Trimming leading and trailing whitespace

Words with spaces at the beginning or end are particularly hard for humans to identify from strings without these spaces (as we have seen with the scientificName column). However, blank spaces can make a big difference to computers, so we usually want to remove them.

In the header for the column scientificName, choose Edit cells > Common transforms > Trim leading and trailing whitespace.
Notice that the Split step has now disappeared from the Undo / Redo pane on the left and is replaced with a Text transform on 2 cells

Exercise

Repeat the splitting of column scientificName exercise after trimming the whitespace.

Show me the solution

On the scientificName column, click the down arrow next to the scientificName column and choose Edit Column > Split into several columns... from the drop down menu. Use a blank character as a separator, as before. You should now get only two columns scientificName 1 and scientificName 2.

Renaming columns

We now have the genus and species parts neatly separated into 2 columns - scientificName 1 and scientificName 2. We want to rename these as genus and species, respectively.

Let’s first rename the scientificName 1 column. On the column, click the down arrow and then Edit column > Rename this column.
Type “genus” into the box that appears.

Exercise

Try to change the name of the scientificName 2 column to species. What problem do you encounter? How can you fix the problem?

Show me the solution

On the scientificName 2 column, click the down arrow and then Edit column > Rename this column.
Type “species” into the box that appears.
A pop-up will appear that says Another column already named species. This is because there is another column with the same name where we’ve recorded the species abbreviation.
You can use another name for the scientificName 2 or change the name of the species column and then rename the scientificName 2 column.

Edit the name of the species column to species_abbreviation. Then, rename scientificName 2 to species.

Combining columns to create new ones

The date for each row in the data file is split in three columns: dy (day), mo (month), and yr (year). We can create a new column with the date in the format we want by combining these columns.

Click on the menu for the yr column and select Edit column > Join columns....
In the window that opens up, check the boxes next to the columns yr, mo, and dy.
Enter - as a separator.
Select the option Write result in new column named and write date as the name for the new column.
Click OK

You can change the order of the columns by dragging the columns in the left side of the window.

Once the new column is created, convert it to date using Edit cells > Common transforms > To date. Now you can explore the data using a timeline facet. Create the new facet by clicking on the menu for the column date and select Facet > Timeline facet.

Data clustering

Clustering allows you to find groups of entries that are not identical but are sufficiently similar that they may be alternative representations of the same thing (term or data value). For example, the two strings New York and new york are very likely to refer to the same concept and just have a capitalization differences. Likewise, Björk and Bjork probably refer to the same person. These kinds of variations occur a lot in scientific data. Clustering gives us a tool to resolve them.

OpenRefine provides different clustering algorithms. The best way to understand how they work is to experiment with them.

The dataset has several near-identical entries in scientificName. For example, there are two misspellings of Ammospermophilus harrisii:

Ammospermophilis harrisi and
Ammospermophilus harrisi

If you removed it, reinstate the scientificName text facet (you can also remove all the other facets to gain some space). In the scientificName text facet box - click the Cluster button.
In the resulting pop-up window, you can change the Method and the Keying Function. Try different combinations to see what different mergers of values are suggested.
If you select the key collision method and the metaphone3 keying function. It should identify one cluster:

Note that the New Cell Value column displays the new name that will replace the value in all the cells in the group. You can change this if you wish to choose a different value than the suggested one.
Tick the Merge? checkbox beside each group, then click Merge selected & Close to apply the corrections to the dataset and close the window.
The text facet of scientificName will update to show the new summary of the column. It will now have ten options:

Facet of scientificName after clustering

Clustering Documentation

Full documentation on clustering can be found at the OpenRefine Clustering Methods In-depth page of the OpenRefine manual.

Key Points

Clustering can identify outliers in data and help us fix errors in bulk

Content from Filtering and Sorting with OpenRefine

Last updated on 2023-05-09 | Edit this page

Overview

Questions

How can we select only a subset of our data to work with?
How can we sort our data?

Objectives

Employ text filter or include/exclude to filter to a subset of rows.
Sort tables by a column.
Sort tables by multiple columns.

Filtering

Sometimes you want to view and work only with a subset of data or apply an operation only to a subset. You can do this by applying various filters to your data.

One way to filter down our data is to use the include or exclude buttons on the entries in a text facet. If you still have your text facet for scientificName, you can use it. If you’ve closed that facet, recreate it by selecting Facet > Text facet on the scientificName column.

In the text facet, hover over one of the names, e.g. Baiomys taylori. Notice that when you hover over it, there are buttons to the right for edit and include.
Whilst hovering over Baiomys taylori, move to the right and click the include option. This will include this species, as signified by the name of the species changing from blue to orange, and new options of edit and exclude will be presented. Note that in the top of the page, “33 matching rows” is now displayed instead of “790 rows”.
You can include other species in your current filter - e.g. click on Chaetodipus baileyi in the same way to include it in the filter.
Alternatively, you can click the name of the species to include it in the filter instead of clicking the include/exclude buttons. This will include the selected species and exclude all others options in a single step, which can be useful.
Click include and exclude on the other species and notice how the entries appear and disappear from the data table to the right.

Click on Reset at the top-right of the facet before continuing to the next step.

Text filters

One way to filter data is to create a text filter on a column. Close all facets you may have created previously and reinstate the text facet on the scientificName column.

Click the down arrow next to scientificName > Text filter. A scientificName filter will appear on the left margin below the text facet.
Type in bai into the text box in the filter and press return. At the top of the page it will report that, out of the 790 rows in the raw data, there are 35 rows in which the text has been found within the scientificName column (and these rows will be selected for the subsequent steps).

Near the top of the screen, change Show: to 50. This way, you will see all the matching rows in a single page.

Exercise

What scientific names are selected by this text filter?
How would you restrict the filter to one of the species selected?

Show me the solution

Do Facet > Text facet on the scientificName column after filtering. This will show that two names match your filter criteria. They are Baiomys taylori and Chaetodipus baileyi.
To restrict to only one of these two species, you could:

Check the case sensitive box within the scientificName facet. Once you do this, you will see that using the upper-case Bai will only > > return Baiomys taylori, while using lower-case bai will only return Chaetodipus baileyi.
You could include more letters in your filter (i.e. typing baio will exclusively return Baiomys taylori, while bail will only return Chaetodipus baileyi).

Important: Make sure both species (Baiomys taylori and Chaetodipus baileyi) are included in your filtered dataset before continuing with the rest of the exercises.

Sort

Sorting data is a useful practice for detecting outliers in data - potential errors and blanks will sort to the top or the bottom of your data.

You can sort the data in a column by using the drop-down menu available in that column. There you can sort by text, numbers, dates or booleans (TRUE or FALSE values). You can also specify what order to put Blanks and Errors in the sorted results.

If this is your first time sorting this table, then the drop-down menu for the selected column shows Sort.... Select what you would like to sort by (such as numbers). Additional options will then appear for you to fine-tune your sorting.

Exercise

Sort by month. How can you ensure that months are in order?

Show me the solution

In the mo column, select Sort... > numbers and select smallest first. The months are listed from 1 (for January) through 12 (December).

If you try to re-sort a column that you have already used, the drop-down menu changes slightly, to > Sort without the ..., to remind you that you have already used this column. It will give you additional options:

Sort > Sort... - This option enables you to modify your original sort.
Sort > Reverse - This option allows you to reverse the order of the sort.
Sort > Remove sort - This option allows you to undo your sort.

Exercise

Sort the data by plot. What year(s) were observations recorded for plot 1 in this filtered dataset?

Show me the solution

In the plot column, select Sort... > numbers and select smallest first. The years represented include 1990 and 1995.

Sorting by multiple columns

You can sort by multiple columns by performing sort on additional columns. The sort will depend on the order in which you select columns to sort. To restart the sorting process with a particular column, check the sort by this column alone box in the Sort pop-up menu.

Exercise

You might like to look for trends in your data by month of collection across years.

How do you sort your data by month?
How would you do this differently if you were instead trying to see all of your entries in chronological order?

Show me the solution

For the mo column, click on Sort... and then numbers. This will group all entries made in, for example, January, together, regardless of the year that entry was collected.
For the yr column, click on Sort > Sort... > numbers and select sort by this column alone. This will undo the sorting by month step. Once you’ve sorted by yr you can then apply another sorting step to sort by month within year. To do this for the mo column, click on Sort > numbers but do not select sort by this column alone. To ensure that all entries are shown chronologically, you will need to also sort by days within each month. Click on the dy column then Sort > numbers. Your data should now be in chronological order.

If you go back to one of the already sorted columns and select > Sort > Remove sort, that column is removed from your multiple sort. If it is the only column sorted, then data reverts to its original order.

Exercise

Sort by year, month and day in some order. Be creative: try sorting as numbers or text, and in reverse order (largest to smallest or z to a).

Use Sort > Remove sort to remove the sort on the second of three columns. Notice how that changes the order.

Key Points

OpenRefine provides various ways to sort and filter data without affecting the raw data.

Content from Reconciliation of Values

Last updated on 2023-05-09 | Edit this page

Overview

Questions

How can we reconcile scientific names against a taxonomy?

Objectives

Add external reconciliation services.
Cleanup scientific names by matching them to a taxonomy.
Identify and correct misspelled or incorrect names for a taxon.

Reconciliation of names

Reconciliation services allow you to lookup terms from your data in OpenRefine against external services, and use values from the external services in your data. The OpenRefine manual provides more information about the reconciliation feature.

Reconciliation services can be more sophisticated and often quicker than using the method described above to retrieve data from a URL. However, to use the Reconciliation function in OpenRefine requires the external resource to support the necessary service for OpenRefine to work with, which means unless the service you wish to use supports such a service you cannot use the Reconciliation approach.

There are a few services where you can find an OpenRefine Reconciliation option available. For example Wikidata has a reconciliation service at https://wikidata.reconci.link/.

Reconcile scientific names with the Encyclopedia of Life (EOL)

We can use the Encyclopedia of Life Reconciliation service to find the taxonomic matches to the names in the dataset.

In the scientificName column use the dropdown menu to choose ‘Reconcile->Start Reconciling’
If this is the first time you’ve used this particular reconciliation service, you’ll need to add the details of the service now

Click ‘Add Standard Service…’ and in the dialogue that appears enter:
- https://eol.org/api/reconciliation

You should now see a heading in the list on the left hand side of the Reconciliation dialogue called “Encyclopedia of Life”
Click on this to choose to use this reconciliation service
In the middle box in the reconciliation dialogue you may get asked what type of ‘entity’ you want to reconcile to - that is, what type of thing are you looking for. The list will vary depending on what reconciliation service you are using.

In this case, the only option is “Taxon”

In the box on the righthand side of the reconciliation dialogue you can choose if other columns are used to help the reconciliation service make a match - however it is sometimes hard to tell what use (if any) the reconciliation service makes of these additional columns
At the bottom of the reconciliation dialogue there is the option to “Auto-match candidates with high confidence”. This can be a time saver, but in this case you are going to uncheck it, so you can see the results before a match is made
Now click ‘Start Reconciling’

Reconciliation is an operation that can take a little time if you have many values to look up.

Once the reconciliation has completed two Facets should be created automatically:

scientificName: Judgement
scientificName: best candidate’s score

These are two of several specific reconciliation facets and actions that you can get from the ‘Reconcile’ menu (from the column drop down menu).

Close the ‘scientificName: best candidate’s score’ facet, but leave the ‘scientificName: Judgement’ facet open.

If you look at the scientificName column, you should see some cells have found one or more matches - the potential matches are shown in a list in each cell. Next to each potential match there is a ‘tick’ and a ‘double tick’. To accept a reconciliation match you can use the ‘tick’ options in cells. The ‘tick’ accepts the match for the single cell, the ‘double tick’ accepts the match for all identical cells.

Create a text facet on the scientificName column
Choose Ammospermophilus harrisii

In the scientificName column you should be able to see the various potential matches. Clicking on a match will take you to the EOL page for that entity

Click a ‘double tick’ in one of the scientificName column cells for the option “Ammospermophilus harrisii (Audubon & Bachman 1854)”
This will accept this as a match for all cells with this value - you should see the other options disappear

There are two things that reconciliation can do for you. Firstly it gets a standard form of the name or label for the entity. Secondly it gets an ID for the entity - in this case a page and numeric id for the scientific name in EOL. This is hidden in the default view, but can be extracted:

In the scientificName column use the dropdown menu to choose Reconcile > Add entity identifiers column...
Give the column the name “EOL-ID”
This will create a new column that contains the EOL ID for the matched entity

Column with the identifiers from reconciliation

Reconcile country, state, and counties against Wikidata

The country field contains several alternative ways to indicate the United States of America, including:

US
UNITED STATES
United States of America

Exercise

If Wikidata does not appear in the list of reconciliation services, add the standard service using the URL: https://wikidata.reconci.link/en/api

Reconcile the columns county, state, and country using Wikidata.
- If a type does not show in the list, search using the Reconcile against type box:
  - country: country (Q6256)
  - state: U.S. state (Q35657)
  - county: county of the United States (Q47168)
Mouseover the options listed to see a preview of the entity suggested
- For the cells with multiple options, choose one of the suggested values and click the double checkmark button to apply to all cells with the same value
Click the menu of the country column and select Edit column > Add column based on this column...
Enter “reconciled_country” in the field for New column name
In the Expression box, enter the following GREL: cell.recon.best.name
Click OK

This will create a new column with the reconciled names for the countries. Create a text facet to see that there are a single name for each country.

In the cases where OpenRefine did not select a match automatically, are the options relevant?
Why do some cells in the county column have many options?

Show me the solution

The options may be for the same place, but with different wording, like Hernando County for Hernando.
Several county names exist in multiple states. You can mouseover each option and find the correct one that matches the state.

Key Points

OpenRefine can look up existing reconciliation services to enrich data

Content from Looking Up Data

Last updated on 2023-05-09 | Edit this page

Overview

Questions

How do I fetch data from an Application Programming Interface (API) to be used in OpenRefine?
How do I reconcile my data by comparing it to authoritative datasets

Objectives

Use URLs to fetch data from the web based on columns in an OpenRefine project
Add columns to parse JSON data returned by web services

Looking up data from a URL

OpenRefine can retrieve data from URLs. This can be used in various ways, including looking up additional information from a remote service, based on information in your OpenRefine data. As an example, you can look up the scientific names in a dataset against the taxonomy of the Global Biodiversity Information Facility (GBIF), and retrieve additional information such as higher taxonomy and identifiers.

Typically this is a two step process, firstly a step to retrieve data from a remote service, and secondly to extract the relevant information from the data you have retrieved.

To retrieve data from an external source, use the drop down menu at any column heading and select Edit column > Add column by fetching URLs....

This will prompt you for a GREL expression to create a URL. This URL will use existing values in your data to build a query. When the query runs OpenRefine will request each URL (for each line) and retrieve whatever data is returned (this may often be structured data, but could be HTML). The data retrieved will be stored in a cell in the new column that has been added to the project. You can then use OpenRefine transformations to extract relevant information from the data that has been retrieved. Two specific OpenRefine functions used for this are:

parseHtml()
parseJson()

The parseHtml() function can also be used to extract data from XML.

Retrieving higher taxonomy from GBIF

In this case we are going to use the GBIF API. Note that API providers may impose rate limits or have other requirements for using their data, so it’s important to check the site’s documentation. To reduce the impact on the service, use a value of 500 in the Throttle Delay setting to specify the number of milliseconds between requests.

The syntax for requesting species information from GBIF is http://api.gbif.org/v1/species/match?name={name} where {name} is replaced with the scientific name in the dataset.

Using the dropdown menu of scientificName, select Edit column > Add column by fetching URLs...
In the New column name field, enter “gbif_JSON”
In the field for Throttle delay enter 500
In the expression box type the GREL "http://api.gbif.org/v1/species/match?name="+escape(value,'url')

At this point, your screen should be similar to this:
Click ‘OK’

You should see a message at the top on the OpenRefine screen indicating it is fetching some data, with progress showing the percentage of the proportion of rows of data successfully being fetched. Wait for this to complete.

At this point you should have a new column containing a long text string in a format called ‘JSON’ (this stands for JavaScript Object Notation, although very rarely spelt out in full). The results should look like this figure:

Example of the results of fetching data from GBIF

OpenRefine has a function for extracting data from JSON (sometimes referred to as ‘parsing’ the JSON). The parseJson() function is explained in more detail at the Format-based functions page.

In the new column you’ve just added use the dropdown menu to access Edit column > Add column based on this column...
Add the new column name: “gbif_family”
In the Expression box type the GREL value.parseJson().get("family")
You should see in the Preview the taxonomic family of the scientific names displays, similar to this screen:

The reason for using Add column based on this column is that this allows you to retain the full JSON and extract further data from it if you need to. If you only wanted the taxonomic family and did not need any other information from the JSON you could use Edit cells > Transform... with the same GREL expression.

Key Points

OpenRefine can look up custom URLs to fetch data based on what’s in an OpenRefine project
Such API calls can be custom built

Content from Exporting Data Cleaning Steps

Last updated on 2023-05-09 | Edit this page

Overview

Questions

How can we document the data-cleaning steps we’ve applied to our data?
How can we apply these steps to additional data sets?

Objectives

Describe how OpenRefine generates JSON code.
Demonstrate ability to export JSON code from OpenRefine.
Save JSON code from an analysis.
Apply saved JSON code to an analysis.

Export the steps to clean up and enhance the dataset

As you conduct your data cleaning and preliminary analysis, OpenRefine saves every change you make to the dataset. These changes are saved in a format known as JSON (JavaScript Object Notation). You can export this JSON script and apply it to other data files. If you had 20 files to clean, and they all had the same type of errors (e.g. species name misspellings, leading white spaces), and all files had the same column names, you could save the JSON script, open a new file to clean in OpenRefine, paste in the script and run it. This gives you a quick way to clean all of your related data.

In the Undo / Redo section, click Extract..., and select the steps that you want to apply to other datasets by clicking the check boxes.
Copy the code from the right hand panel and paste it into a text editor (like NotePad on Windows or TextEdit on Mac). Make sure it saves as a plain text file. In TextEdit, do this by selecting Format > Make plain text and save the file as a txt file.

Let’s practice running these steps on a new dataset. We’ll test this on an uncleaned version of the dataset we’ve been working with.

Exercise

Download an uncleaned version of the dataset from the Setup page or use the version of the raw dataset you saved to your computer.
Start a new project in OpenRefine with this file and name it something different from your existing project.
Click the Undo / Redo tab > Apply and paste in the contents of txt file with the JSON code.
Click Perform operations. The dataset should now be the same as your other cleaned dataset.

For convenience, we used the same dataset. In reality you could use this process to clean related datasets. For example, data that you had collected over different fieldwork periods or data that was collected by different researchers (provided everyone uses the same column headings).

Reproducible science

Now, that you know how scripts work, you may wonder how to use them in your own scientific research. For inspiration, you can read more about the succesful application of the reproducible science principles in archaeology or marine ecology:

Marwick et al. (2017) Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation
Stewart Lowndes et al. (2017) Our path to better science in less time using open data science tools

Key Points

All changes are being tracked in OpenRefine, and this information can be used for scripts for future analyses or reproducing an analysis.
Scripts can (and should) be published together with the dataset as part of the digital appendix of the research output.

Content from Exporting and Saving Data from OpenRefine

Last updated on 2023-09-23 | Edit this page

Overview

Questions

How can we save and export our cleaned data from OpenRefine?

Objectives

Export cleaned data from an OpenRefine project.
Save an OpenRefine project.

Exporting cleaned data

Once you are done cleaning your data, you will likely want to export it.

Click Export in the top right and select the file type you want to export the data in. Tab-separated value (tsv) or Comma-separated value (csv) would be good choices.

That file will be exported to your default Downloads directory. That file can then be opened in a spreadsheet program or imported into programs like R or Python.

Saving and exporting a project

In OpenRefine you can save or export the project. This means you’re saving the data and all the information about the cleaning and data transformation steps you’ve done. Once you’ve saved a project, you can open it up again and be just where you stopped before.

Saving

By default OpenRefine is saving your project. If you close OpenRefine and open it up again, you’ll see a list of your projects. You can click on any one of them to open it up again.

To find the location on your computer where the files are saved, see the OpenRefine manual page for where the data is stored.

Exporting project files

You can also export a project. This is helpful, for instance, if you wanted to send your raw data and cleaning steps to a collaborator, or share this information as a supplement to a publication.

Click the Export button in the top right and select OpenRefine project archive to file.
A tar.gz file will download to your default Download directory. The tar.gz extension tells you that this is a compressed file, which means that this file contains multiple files.
If you open the compressed file (if on Windows, you will need a tool like 7-Zip), you should see:

A history folder which contains several zip files. Each of these files itself contains a change.txt file.
- These change.txt files are the records of each individual transformation that you did to your data.
A data.zip file. When expanded, this zip file includes a file called data.txt which is a copy of your raw data.
You may also see other files.

Importing project files

You can import an existing project into OpenRefine by clicking Open... in the upper right, then clicking on Import Project on the left-side menu. Click Choose File and select the tar.gz project file. This project will include all of the raw data and cleaning steps that were part of the original project.

Key Points

OpenRefine can save the clean data to a number of formats.
Cleaned data or entire projects can be exported from OpenRefine.
Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.

Content from Other Resources in OpenRefine

Last updated on 2023-05-09 | Edit this page

Overview

Questions

What other resources are available for working with OpenRefine?

Objectives

Explore the online resources available for more information on OpenRefine.
Identify other resources about OpenRefine.

Other resources about OpenRefine

OpenRefine is more than a simple data cleaning tool. People are using it for all sorts of activities. Here are some other resources that might prove useful.

OpenRefine has documentation and other resources:

In addition, see these other useful resources:

Use of OpenRefine with GBIF - PDF guide
Getting started with OpenRefine by Thomas Padilla
Cleaning Data with OpenRefine by Seth van Hooland, Ruben Verborgh and Max De Wilde
Identifying potential headings for Authority work using III Sierra, MS Excel and OpenRefine
Free your metadata website
Cleaning Data with OpenRefine by John Little
Grateful Data is a fun site with many resources devoted to OpenRefine, including a nice tutorial.

Key Points

Other examples and resources online are good for learning more about OpenRefine

Overview

Questions

Objectives

Lesson

Motivations for the OpenRefine Lesson

Before we get started

What is OpenRefine?

It can also help you

Features

More Information on OpenRefine

Key Points

Overview

Questions

Objectives

Importing data

What kinds of data files can I import?

Create your first OpenRefine project (using provided data)

OpenRefine does not modify your original dataset

Key Points

Overview

Questions

Objectives

Exploring data with facets

Exploring text columns

Facets and large datasets

Exercise

Show me the solution

Exploring numeric columns

Numeric facet

Exercise

Show me the solution

Exercise

Examine a pair of numeric columns using scatterplots

Exercise

More Details on Faceting

Key Points

Overview

Questions

Objectives

Data splitting

Exercise

Show me the solution

Undoing / Redoing actions

Trimming leading and trailing whitespace

Exercise

Show me the solution

Renaming columns

Exercise

Show me the solution

Combining columns to create new ones

Data clustering

Clustering Documentation

Key Points

Overview

Questions

Objectives

Filtering

Including/excluding data entries on facets

Text filters

Exercise

Show me the solution

Filters vs. facets

Sort

Exercise

Show me the solution

Exercise

Show me the solution

Sorting by multiple columns

Exercise

Show me the solution

Exercise

Key Points

Overview

Questions

Objectives

Reconciliation of names

Reconcile scientific names with the Encyclopedia of Life (EOL)

Reconcile country, state, and counties against Wikidata

Exercise

Show me the solution