Summary and Setup

A part of the data workflow is preparing the data for analysis. Some of this involves data cleaning, where errors in the data are identifed and corrected or formatting made consistent. This step must be taken with the same care and attention to reproducibility as the analysis.

OpenRefine (formerly Google Refine) is a powerful free and open source tool for working with messy data: cleaning it and transforming it from one format into another.

Learning objectives


By the end of this lesson, you will be able to:

  • create, export and import a project in OpenRefine
  • view and work on subsets of rows using facets and text filters
  • reduce variations in data through clustering, bulk editing and transformations
  • undo and redo actions and export the history of actions
  • save cleaned data in a widely supported file format

This lesson will teach you to use OpenRefine to effectively clean and format data and automatically track any changes that you make. Many people comment that this tool saves them literally months of work trying to make these edits by hand.

Importantly, this lesson does not cover all of OpenRefine’s functionalities. It also does not correct all errors in the provided dataset.

Getting Started


Data Carpentry’s teaching is hands-on, so participants are encouraged to use their own computers to ensure the proper setup of tools for an efficient workflow.

These lessons assume no prior knowledge of the skills or tools.

To most effectively use these materials, please make sure to install everything before working through this lesson.

Data

The data for this lesson is a part of the Data Carpentry Social Sciences workshop. It is a teaching version of the Studying African Farmer-Led Irrigation (SAFI) database. The SAFI dataset represents interviews of farmers in two countries in eastern sub-Saharan Africa (Mozambique and Tanzania). These interviews were conducted between November 2016 and June 2017 and probed household features (e.g. construction materials used, number of household members), agricultural practices (e.g. water usage), and assets (e.g. number and types of livestock).

The data used in this lesson is a subset of the teaching version that has been intentionally ‘messed up’ for this lesson.

Download the data file to your computer.

Software

For this lesson you will need OpenRefine (formerly Google Refine) and a web browser. Basic installation steps are provided on this page. The OpenRefine installation manual provides more details about installation, upgrades and configuration.

Note: this is a Java program that runs on your machine (not in the cloud). It runs inside your browser, but no web connection is needed for this lesson.

Administrator rights

You do not need administrative rights on the computer to install OpenRefine. However, if anti-malware software blocks OpenRefine when you try to start it, you may need administrative rights to allow OpenRefine to run. OpenRefine is safe to run.

Windows

  • Check that you have Firefox, Edge, Opera or Chrome browsers installed and set as your default browser. OpenRefine runs in your default browser. It will not run correctly in Internet Explorer.

  • Download the software from openrefine.org.

  • Unzip the downloaded file into a directory by right-clicking and selecting “Extract…”. Name that directory something like OpenRefine.

    Long paths

    The path to the directory you extract the application files into should be short, because some of OpenRefine’s files have very long names. If the path is too long, OpenRefine cannot start.

  • Go to your newly created OpenRefine directory.

  • Launch OpenRefine by opening openrefine.exe. This will launch a command prompt window, but you can ignore that and wait for the browser to launch.

  • If you see Internet Explorer start, or OpenRefine does not automatically open for you, point one of the supported browsers at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.

Mac

  • Check that you have Firefox, Edge, Opera or Chrome browsers installed and set as your default browser. OpenRefine runs in your default browser. It will not run correctly in Internet Explorer.
  • Download the software from openrefine.org.
  • Unzip the downloaded file into a directory by double-clicking it. Name that directory something like OpenRefine.
  • Go to your newly created OpenRefine directory.
  • Drag the OpenRefine app into the Applications folder.
  • Launch OpenRefine: Control-click the app icon, then choose “Open” from the shortcut menu. For Troubleshooting help, see the Apple support page.
  • If you are using a different browser than listed above, or if OpenRefine does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.

Linux

  • Check that you have Firefox or Chrome browsers installed and set as your default browser. OpenRefine runs in your default browser.
  • Download the software from openrefine.org.
  • Unzip the downloaded file into a directory. Name that directory something like OpenRefine.
  • Go to your newly created OpenRefine directory.
  • Launch OpenRefine by typing ./refine into the terminal within the OpenRefine directory.
  • If you are using a different browser than listed above, or if OpenRefine does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.

Exiting OpenRefine

To exit OpenRefine, close all the browser tabs or windows, then navigate to the command line window. To close this window and ensure OpenRefine exits properly, hold down [control] and press [c] on your keyboard. This will save all changes to your projects.

Remember, it’s important to close the browser window or tab first to ensure you’re not actively using OpenRefine before stopping the server. This prevents any unsaved changes from being lost. After stopping the server, you can safely exit the terminal or command prompt window.