Data Cleaning with OpenRefine for Ecologists: Instructor Notes

Lesson motivation and learning objectives

Lesson design

Technical tips and tricks

Ask the learners to install and test running the program before the workshop. This will reveal usual problems that can be fixed before the workshop.

OpenRefine version

The current version of the lesson has been tested with OpenRefine 3.6.2. Using older versions is not recommended, behavior may be different and cause confusion in the learners.

Windows users should download the “Windows kit with embedded Java” version from the Download page to avoid having to install Java.

Mac users can use the kit available in the Download page, Java does not have to be installed separately.

Linux users must install Java in their system.

Creating a Project

Start the program. (Double-click on the openrefine.exe file. A command line box might show up, this needs to be left open, and OpenRefine will open in the browser).

Note the file types OpenRefine handles: TSV, CSF, *SV, Excel (.xls .xlsx), JSON, XML, RDF as XML, Google Data documents. Support for other formats can be added with OpenRefine extensions.

In this first step, we’ll browse our computer to the sample data file for this lesson (If you haven’t already, download the data from: https://ndownloader.figshare.com/files/7823341). In this case, I’ve modified the Portal_rodents.csv file. I added several columns: scientificName, locality, county, state, country and I generated several more columns in the lesson itself (JSON, decimalLatitude, decimalLongitude). Data in locality, county, country, JSON, decimalLatitude and decimalLongitude are contrived and are in no way related to the original dataset.

Once OpenRefine is open, you’ll be asked if you want to Create, Open, or Import a Project.

Faceting

Exploring data by applying multiple filters

OpenRefine supports faceted browsing as a mechanism for

Typically, you create a facet on a particular column. The facet summarizes the cells in that column to give you a big picture of that column, and allows you to filter to some subset of rows for which their cells in that column satisfy some constraint. That’s a bit abstract, so let’s jump into some examples.

More on faceting

Clustering

In OpenRefine, clustering refers to the operation of “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings “New York” and “new york” are very likely to refer to the same concept and just have capitalization differences. Likewise, “Gödel” and “Godel” probably refer to the same person.

One of the most magical bits of OpenRefine, the moment you realize what you’ve been missing. OpenRefine has several clustering algorithms built in. Experiment with them, and learn more about these algorithms and how they work.

More on clustering

Split / Undo - Redo

If data in a column needs to be split into multiple columns, and the strings in the cells are separated by a common separator (say a comma, or a space), you can use that separator to divide up the bits into their own columns.

Common problems