OpenRefine for Social Science Data

Introduction

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What is OpenRefine useful for?

Objectives
  • Describe OpenRefine’s uses and applications.

  • Differentiate data cleaning from data organization.

  • Experiment with OpenRefine’s user interface.

  • Locate helpful resources to learn more about OpenRefine.

Lesson

Motivations for the OpenRefine Lesson

Features

Before we get started

Note: this is a Java program that runs on your machine (not in the cloud). It runs inside your browser, but no web connection is needed.

Follow the Setup instructions to install OpenRefine.

If after installation and running OpenRefine, it does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.

Getting help for OpenRefine

You can find out a lot more about OpenRefine at http://openrefine.org and check out some great introductory videos. These videos and others on OpenRefine can also be found on YouTube by searching under ‘OpenRefine’. There is a Google Group that can answer a lot of beginner questions and problems. Information can also be found on StackOverflow where you can find a lot of help. As with other programs of this type, OpenRefine libraries are available too, where you can find a script you need and copy it into your OpenRefine instance to run it on your dataset.

Key Points

  • OpenRefine is a powerful, free, and open source tool that can be used for data cleaning.

  • OpenRefine will automatically track any steps allowing you to backtrack as needed and providing a record of all work done.


Working with OpenRefine

Overview

Teaching: 15 min
Exercises: 20 min
Questions
  • How can we bring our data into OpenRefine?

  • How can we sort and summarize our data?

  • How can we find and correct errors in our raw data?

Objectives
  • Create a new OpenRefine project from a CSV file.

  • Understand potential problems with file headers.

  • Use facets to summarize data from a column.

  • Use clustering to detect possible typing errors.

  • Understand that there are different clustering algorithms which might give different results.

  • Employ drop-downs to remove white spaces from cells.

  • Manipulate data using previous steps with undo/redo.

Lesson

Creating a new OpenRefine project

In Windows, you can start the OpenRefine program by double-clicking on the openrefine.exe file. Java services will start automatically on your machine, and OpenRefine will open in your browser. On a Mac, OpenRefine can be launched from your Applications folder. If you are using Linux, you will need to navigate to your OpenRefine directory in the command line and run ./refine.

OpenRefine can import a variety of file types, including tab separated (tsv), comma separated (csv), Excel (xls, xlsx), JSON, XML, RDF as XML, and Google Spreadsheets. See the OpenRefine Create a Project by Importing Data page for more information.

In this first step, we’ll browse our computer to the sample data file for this lesson. In this case, we will be using data obtained from interviews of farmers in two countries in eastern sub-Saharan Africa (Mozambique and Tanzania). Instructions on downloading the data are available here.

Once OpenRefine is launched in your browser, the left margin has options to Create Project, Open Project, or Import Project. Here we will create a new project:

  1. Click Create Project and select Get data from This Computer.
  2. Click Choose Files and select the file SAFI_openrefine.csv that you downloaded in the setup step. Click Open or double-click on the filename.
  3. Click Next>> under the browse button to upload the data into OpenRefine.
  4. OpenRefine gives you a preview - a chance to show you it understood the file. If, for example, your file was really tab-delimited, the preview might look strange. You would then choose the correct separator in the box shown and click Update Preview (middle right). If this is the wrong file, click <<Start Over (upper left). There are also options to indicate whether the dataset has column headers included and whether OpenRefine should skip a number of rows before reading the data. Parse Options

  5. If all looks well, click Create Project>> (upper right).

Note that at step 1, you could upload data in a standard form from a web address by selecting Get data from Web Addresses (URLs). However, this won’t work for all URLs.

Using Facets

Exploring data by applying multiple filters

Facets are one of the most useful features of OpenRefine and can help both get an overview of the data in a project as well as help you bring more consistency to the data. OpenRefine supports faceted browsing as a mechanism for

A ‘Facet’ groups all the like values that appear in a column, and then allows you to filter the data by these values and edit values across many records at the same time.

One type of Facet is called a ‘Text facet’. This groups all the identical text values in a column and lists each value with the number of records it appears in. The facet information always appears in the left hand panel in the OpenRefine interface.

Here we will use faceting to look for potential errors in data entry in the village column.

  1. Scroll over to the village column.
  2. Click the down arrow and choose Facet > Text facet.
  3. In the left panel, you’ll now see a box containing every unique value in the village column along with a number representing how many times that value occurs in the column.
  4. Try sorting this facet by name and by count. Do you notice any problems with the data? What are they?
  5. Hover the mouse over one of the names in the Facet list. You should see that you have an edit function available.
  6. You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we’ll use instead. We’ll learn about these when we talk about clustering.

Solution

  • Chirdozo is likely a mis-entry of Chirodzo.
  • Ruca is likely a mis-entry of Ruaca.
  • Ruaca - Nhamuenda and Ruaca-Nhamuenda refer to the same place (differ only by spaces around the hyphen). You might also wonder if both of these are the same as Ruaca. We will see how to correct these misspelled and mistyped entries in a later exercise.
  • The entry 49 is almost certainly an error but you will not be able to fix it by reference to other data.

Exercise

  1. Using faceting, find out how many different interview_date values there are in the survey results.

  2. Is the column formatted as Text or Date?

  3. Use faceting to produce a timeline display for interview_date. You will need to use Edit cells > Common transforms > To date to convert this column to dates.

  4. During what period were most of the interviews collected?

Solution

For the column interview_date do Facet > Text facet. A box will appear in the left panel showing that there are 19 unique entries in this column. By default, the column interview_date is formatted as Text. You can change the format by doing Edit cells > Common transforms > To date. Notice the the values in the column turn green. Doing Facet > Timeline facet creates a box in the left panel that shows a histogram of the number of entries for each date.

Most of the data was collected in November of 2016.

More on Facets

OpenRefine Manual: Facets

As well as ‘Text facets’ Refine also supports a range of other types of facet. These include:

  • Numeric facets
  • Timeline facets (for dates)
  • Custom facets
  • Scatterplot facets

Numeric and Scatterplot facets display graphs instead of lists of values. The numeric facet graph includes ‘drag and drop’ controls you can use to set a start and end range to filter the data displayed. These facets are explored further in Examining Numbers in OpenRefine

Custom facets are a range of different types of facets. Some of the default custom facets are:

  • Word facet - this breaks down text into words and counts the number of records each word appears in
  • Duplicates facet - this results in a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row
  • Text length facet - creates a numeric facet based on the length (number of characters) of the text in each row for the selected column. This can be useful for spotting incorrect or unusual data in a field where specific lengths are expected (e.g. if the values are expected to be years, any row with a text length more than 4 for that column is likely to be incorrect)
  • Facet by blank - a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for rows missing key data.

Using clustering to detect possible typing errors

In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings New York and new york are very likely to refer to the same concept and just have capitalization differences. Likewise, Gödel and Godel probably refer to the same person. Clustering is a very powerful tool for cleaning datasets which contain misspelled or mistyped entries. OpenRefine has several clustering algorithms built in. Experiment with them, and learn more about these algorithms and how they work.

  1. In the village Text Facet we created in the step above, click the Cluster button.
  2. In the resulting pop-up window, you can change the Method and the Keying Function. Try different combinations to see what different mergers of values are suggested.
  3. Select the key collision method and metaphone3 keying function. It should identify two clusters.
  4. Click the Merge? box beside each cluster, then click Merge Selected and Recluster to apply the corrections to the dataset.
  5. Try selecting different Methods and Keying Functions again, to see what new merges are suggested.
  6. You should find that using the default settings, no more clusters are found, for example to merge Ruaca-Nhamuenda with Ruaca or Chirdozo with Chirodzo. (Note that the nearest neighbor method with ppm distance, radius ≥ 4, and block chars ≤ 4 will find these clusters, as well as other settings with levenshtein distance)
  7. To merge these values we will hover over them in the village text facet, select edit, and manually change the names. Change Chirdozo to Chirodzo and Ruaca-Nhamuenda to Ruaca. You should now have four clusters: Chirodzo, God, Ruaca and 49.

Important: If you Merge using a different method or keying function, or more times than described in the instructions above, your solutions for later exercises will not be the same as shown in those exercise solutions.

Different clustering algorithms

The technical details of how the different clustering algorithms work can be found at the link below.

More on clustering

Transforming data

The data in the items_owned column is a set of items in a list. The list is in square brackets and each item is in single quotes. Before we split the list into individual items in the next section, we first want to remove the brackets and the quotes.

  1. Click the down arrow at the top of the items_owned column. Choose Edit Cells > Transform...
  2. This will open up a window into which you can type a GREL expression. GREL stands for General Refine Expression Language. OR_Transform

  3. First we will remove all of the left square brackets ([). In the Expression box type value.replace("[", "") and click OK.

  4. What the expression means is this: Take the value in each cell in the selected column and replace all of the “[” with “” (i.e. nothing - delete).

  5. Click OK. You should see in the items_owned column that there are no longer any left square brackets.

Exercise

Use this same strategy to remove the single quote marks ('), the right square brackets (]), and spaces from the items_owned column.

Solution

  1. value.replace("'", "")
  2. value.replace("]", "")
  3. value.replace(" ", "") You should now have a list of items separated by semi-colons (;).

Now that we have cleaned out extraneous characters from our items_owned column, we can use a text facet to see which items were commonly owned or rarely owned by the interview respondents.

  1. Click the down arrow at the top of the items_owned column. Choose Facet > Custom text facet...
  2. In the Expression box, type value.split(";").
  3. Click OK.

You should now see a new text facet box in the left-hand pane.

Exercise

Which two items are the most commonly owned? Which are the two least commonly owned?

Solution

Select Sort by: count. The most commonly owned items are mobile phone and radio, the least commonly owned are cars and computers.

Exercise

Perform the same clean up steps and customized text faceting for the months_lack_food column. Which month(s) were farmers more likely to lack food?

Solution

All four cleaning steps can be performed by combining .replace statements. The command is: value.replace("[", "").replace("]", "").replace(" ", "").replace("'", "") This can also be done in four separate steps if preferred. November was the most common month for respondents to lack food.

Exercise

Perform the same clean up steps for the months_no_water, liv_owned, res_change, and no_food_mitigation columns. Hint: To reuse a GREL command, click the History tab and then click Reuse next to the command you would like to apply to that column.

Using undo and redo

It’s common while exploring and cleaning a dataset to discover after you’ve made a change that you really should have done something else first. OpenRefine provides Undo and Redo operations to make this easy.

Exercise

  1. Click where it says Undo / Redo on the left side of the screen. All the changes you have made so far are listed here.
  2. Click on the step that you want to go back to, in this case go back several steps to before you had done any text transformation.
  3. Visually confirm that those columns now contain the special characters that we had removed previously.
  4. Notice that you can still click on the later steps to Redo the actions. Before moving on to the next lesson, redo all the steps in your analysis so that all of the columns you modified are lacking in square brackets, spaces, and single quotes.

Trim Leading and Trailing Whitespace

Words with spaces at the beginning or end are particularly hard for we humans to tell from strings without, but the blank characters will make a difference to the computer. We usually want to remove these. As of version 3.4 of OpenRefine, the option to trim leading and trailing whitespaces is present at the moment of importing the data (see image at the top of this page).

If you unchecked that box when importing data, or if leading or trailing whitespaces were introduced while splitting columns, or other operations, OpenRefine also provides a tool to remove blank characters from the beginning and end of any entries that have them.

  1. Edit the village on the first row to introduce a space at the end, set to God .
  2. Create a new text facet for the village column. You should now see two different entries for God, one of those has a trailing whitespace.
  3. To remove the whitespace, choose Edit cells > Common transforms > Trim leading and trailing whitespace.
  4. You should now see only four choices in your text facet again.

Key Points

  • OpenRefine can import a variety of file types.

  • OpenRefine can be used to explore data using filters.

  • Clustering in OpenRefine can help to identify different values that might mean the same thing.

  • OpenRefine can transform the values of a column.


Filtering and Sorting with OpenRefine

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How can we select only a subset of our data to work with?

  • How can we sort our data?

Objectives
  • Filter to a subset of rows by text filter or include/exclude.

  • Sort table by a column.

  • Sort by multiple columns.

Lesson

Filtering

There are many entries in our data table. We can filter it to work on a subset of the data in the list for the next set of operations. Please ensure you perform this step to save time during the class.

  1. Click the down arrow next to respondent_roof_type > Text filter. A respondent_roof_type facet will appear on the left margin.
  2. Type in mabat and press return. There are 58 matching rows of the original 131 rows (and these rows are selected for the subsequent steps).
  3. At the top, change the view to Show 50 rows. This way you will see most of the matching rows.

Exercise

  1. What roof types are selected by this procedure?
  2. How would you restrict this to only one of the roof types?

Solution

  1. Do Facet > Text facet on the respondent_roof_type column after filtering. This will show that two names match your filter criteria. They are mabatipitched and mabatisloping.
  2. To restrict to only one of these two roof types, you could include more letters in your filter.

Excluding entries

In addition to the simple text filtering we used above, another way to narrow our filter is to include and/or exclude entries in a facet. You will see the include or exclude options if you hover over the name in the facet window.

If you still have your facet for respondent_roof_type, you can use it, or use drop-down menu > Facet > Text facet to create a new facet. Only the entries with names that agree with your Text filter will be included in this facet.

Faceting and filtering look very similar. A good distinction is that faceting gives you an overview description of all of the data that is currently selected, while filtering allows you to select a subset of your data for analysis.

Exercise

Use include / exclude to select only entries from one of these two roof types.

Solution

  1. In the facet (left margin), click on one of the names, such as mabatisloping. Notice that when you click on the name, or hover over it, there are entries to the right for edit and include.
  2. Click include. This will explicitly include this roof type, and exclude others that are not explicitly included. Notice that the option now changes to exclude.
  3. Click include and exclude on the other roof type and notice how the two entries appear and disappear from the table.

Remove the filter before moving on so that you again have the full dataset of 131 records.

Sort

You can sort the data by a column by using the drop-down menu in that column. There you can sort by text, numbers, dates or booleans (TRUE or FALSE values). You can also specify what order to put Blanks and Errors in the sorted results.

If this is your first time sorting this table, then the drop-down menu for the selected column shows Sort.... Select what you would like to sort by (such as numbers). Additional options will then appear for you to fine-tune your sorting.

Exercise

Sort the data by gps_Altitude. Do you think the first few entries may have incorrect altitudes?

Solution

In the gps_Altitude column, select Sort... > numbers and select smallest first. The first few values are all 0. The altitudes are more likely ‘missing’ than incorrect. The survey is delivered by Smartphone with the gps information added automatically by the app. The lack of an altitude value suggests that the smartphone was unable to provide it and it defaulted to 0.

If you try to re-sort a column that you have already used, the drop-down menu changes slightly, to > Sort without the ..., to remind you that you have already used this column. It will give you additional options:

Sorting by multiple columns

You can sort by multiple columns by performing sort on additional columns. The sort will depend on the order in which you select columns to sort. To restart the sorting process with a particular column, check the sort by this column alone box in the Sort pop-up menu.

If you go back to one of the already sorted columns and select > Sort > Remove sort, that column is removed from your multiple sort. If it is the only column sorted, then data reverts to its original order.

Exercise

We discovered in an earlier lesson that the value for one of the village entries was given as 49. This is clearly wrong. By looking at the GPS coordinates for the entries of the other villages can we decide what village the data in that column was collected from?

  1. Sort on gps_Latitude as a number with the smallest first.
  2. Add a sort on gps_Longitude as a number with the smallest first.
  3. Using the drop down arrow on the village column, select Edit column > Move column to end. This will allow you to compare village names with GPS coordinates.
  4. Scroll through the entries until you find village 49. Can you tell from it’s GPS coordinates which village it belong to?
  5. Now sort only by interview_date as date. Move the village column to the start of the table. Does the row where village is 49 group with one particular village? Is it the same village as when comparing GPS coordinates?

Solution

The interview data for that row is in a small cluster of Chirodzo interviews when sorting by GPS coordinates. When sorting by interview date, it is also with Chirodzo interviews. In fact, only Chirodzo had interviews conducted on that date.

Perform a text facet on the village column and change 49 to the village name that was determined in the previous exercise. You should now have only three village names.

Key Points

  • OpenRefine provides a way to sort and filter data without affecting the raw data.


Examining Numbers in OpenRefine

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How can we convert a column from one data type to another?

  • How can we find non-numeric values in a column that should contain numbers?

Objectives
  • Transform a text column into a number column.

  • Identify and modify non-numeric values in a numeric column using facets.

Lesson

Numbers

When a table is imported into OpenRefine, all columns are treated as containing text values. We saw earlier how we can sort column values as numbers, but this does not change the cells in a column from text to numbers. Rather, this interprets the values as numbers for the purposes of sorting but keeps the underlying data type as is. We can, however, transform columns from text to other data types (e.g. number or date) using the Edit cells > Common transforms feature. Here we will experiment changing columns to numbers and see what additional capabilities that grants us.

Be sure to remove any Text filter facets you have enabled from the left panel so that we can examine our whole dataset. You can remove an existing facet by clicking the x in the upper left of that facet window.

To transform cells in the years_farm column to numbers, click the down arrow for that column, then Edit cells > Common transforms… > To number. You will notice the years_farm values change from left-justified to right-justified, and black to green in color.

Exercise

Transform three more columns, no_membrs, years_liv, and buildings_in_compound, from text to numbers. Can all columns be transformed to numbers? - Try it with village for example.

Solution

Only observations that include only numerals (0-9) can be transformed to numbers. If you apply a number transformation to a column that doesn’t meet this criteria, and then click the Undo / Redo tab, you will see a step that starts with Text transform on 0 cells. This means that the data in that column was not transformed.

Numeric facet

Sometimes non-numeric values or blanks occur in a column where numbers are expected. Such values may represent errors in data entry, and we want to find them. We can do that with a Numeric facet.

Exercise

  1. For a column you transformed to numbers, edit one or two cells, replacing the numbers with text (such as abc) or blank (no number or text). You will need to change the Data type to text using the drop-down menu.
  2. Use the column pulldown menu to apply a numeric facet to the column you edited. The facet will appear in the left panel.
  3. Notice that there are several checkboxes in this facet: Numeric, Non-numeric, Blank, and Error. Below these checkboxes are counts of the number of cells in each category. You should see checks for Non-numeric and Blank if you changed some values.
  4. Experiment with checking or unchecking these boxes to select subsets of your data.

When done examining the numeric data, remove this facet by clicking the x in the upper left corner of its panel. Note that this does not undo the edits you made to the cells in this column. Use the Undo / Redo function to reverse these changes.

Key Points

  • OpenRefine also provides ways to to examine and clean numerical data.


Using scripts

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How can we document the data-cleaning steps we’ve applied to our data?

  • How can we apply these steps to additional data sets?

Objectives
  • Describe how OpenRefine generates JSON code representing the work done in an analysis session.

  • Demonstrate ability to export this JSON code from OpenRefine.

  • Demonstrate ability to import a JSON code file to apply the analysis to another dataset.

Lesson

How OpenRefine records what you have done

As you conduct your data cleaning and preliminary analysis, OpenRefine saves every change you make to the dataset. These changes are saved in a format known as JSON (JavaScript Object Notation). You can export this JSON script and apply it to other data files. If you had 20 files to clean, and they all had the same type of errors (e.g. misspellings, leading white spaces), and all files had the same column names, you could save the JSON script, open a new file to clean in OpenRefine, paste in the script and run it. This gives you a quick way to clean all of your related data.

Saving your work as a script

  1. In the Undo / Redo section, click Extract..., and select the steps that you want to apply to other datasets by clicking the check boxes.

History

  1. Copy the code from the right hand panel and paste it into a text editor (like NotePad on Windows or TextEdit on Mac). Make sure it saves as a plain text file. In TextEdit, do this by selecting Format > Make plain text and save the file as a .txt file.

Importing a script to use against another dataset

Let’s practice running these steps on a new dataset. We’ll test this on an uncleaned version of the dataset we’ve been working with.

  1. Start a new project in OpenRefine using the messy dataset you downloaded before. Give the project a new name.
  2. Click the Undo / Redo tab > Apply and paste in the contents of .txt file with the JSON code.
  3. Click Perform operations. The dataset should now be the same as your other cleaned dataset.

For convenience, we used the same dataset. In reality you could use this process to clean related datasets. For example, data that you had collected over different fieldwork periods or data that was collected by different researchers (provided everyone uses the same column headings). The data in this file was generated from an eSurvey system with the actual survey being delivered centrally to a smartphone, so the column headings are pretty much guaranteed to be the same.

Key Points

  • All changes are being tracked in OpenRefine, and this information can be used for scripts for future analyses or reproducing an analysis.


Exporting and Saving Data from OpenRefine

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • How can we save and export our cleaned data from OpenRefine?

Objectives
  • Save an OpenRefine project.

  • Export cleaned data from an OpenRefine project.

Lesson

Saving and Exporting a Project

In OpenRefine you can save or export the project. This means you’re saving the data and all the information about the cleaning and data transformation steps you’ve done. Once you’ve saved a project, you can open it up again and be just where you stopped before.

Saving

By default OpenRefine is saving your project continuously. If you close OpenRefine and open it up again, you’ll see a list of your projects. You can click on any one of them to open it up again.

Exporting

You can also export a project. This is helpful, for instance, if you wanted to send your raw data and cleaning steps to a collaborator, or share this information as a supplement to a publication.

  1. Click the Export button in the top right and select OpenRefine project archive to file.
  2. A tar.gz file will download to your default Download directory. Depending on your browser you may have to confirm that you want to save the file. The tar.gz extension tells you that this is a compressed file. The downloaded tar.gz file is actually a folder of files which have been compressed. Linux and Mac machines will have software installed to automatically expand this type of file when you double-click on it. For Windows based machines you may have to install a utility like ‘7-zip’ in order to expand the file and see the files in the folder.
  3. After you have expanded the file look at the files that appear in this folder. What files are here? What information do you think these files contain?

Solution

You should see:

  • a history folder which contains a collection of zip files. Each of these files itself contains a change.txt file. These change.txt files are the records of each individual transformation that you did to your data.
  • a data.zip file. When expanded, this zip file includes a file called data.txt which is a copy of your raw data. You may also see other files.

You can import an existing project into OpenRefine by clicking Open... in the upper right > Import Project and selecting the tar.gz project file. This project will include all of the raw data and cleaning steps that were part of the original project.

Exporting Cleaned Data

You can also export just your cleaned data, rather than the entire project.

  1. Click Export in the top right and select the file type you want to export the data in. Tab-separated values (tsv) or Comma-separated values (csv) would be good choices.
  2. That file will be exported to your default Download directory. That file can then be opened in a spreadsheet program or imported into programs like R or Python, which we’ll be discussing later in our workshop.

Remember from our lesson on Spreadsheets that using widely-supported, non-proprietary file formats like tsv or csv improves the ability of yourself and others to use your data.

Key Points

  • Cleaned data or entire projects can be exported from OpenRefine.

  • Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.


Other Resources in OpenRefine

Overview

Teaching: 5 min
Exercises: 5 min
Questions
  • What other resources are available for working with OpenRefine?

Objectives
  • Understand that there are many online resources available for more information on OpenRefine.

  • Identify other resources about OpenRefine.

Lesson

Using online resources to get help with OpenRefine

OpenRefine is more than a simple data cleaning tool. People are using it for all sorts of activities. Here are some other resources that might prove useful.

OpenRefine has its own web site with documentation and a book:

In addition, see these other useful resources:

There are more advanced uses of OpenRefine, such as bringing in column or cell data using web locators (URLs or APIs). The links above can give you a start on your journey.

Exercise

Visit one of these sites and share what you find with another person.

Key Points

  • Other examples and resources online are good for learning more about OpenRefine.