OverviewTeaching: 10 min
Exercises: 0 minQuestions
What is OpenRefine useful for?Objectives
Describe OpenRefine’s uses and applications.
Differentiate data cleaning from data organization.
Experiment with OpenRefine’s user interface.
Locate helpful resources to learn more about OpenRefine.
Motivations for the OpenRefine Lesson
- Data is often very messy, and this tool saves a lot of time on cleaning headaches.
- Data cleaning steps often need repeating with multiple files. It is important to know what you did to your data. This makes it easy for you to repeat these steps again with similarly structured data. OpenRefine is perfect for speeding up repetitive tasks by replaying previous actions on multiple datasets.
- Additionally, journals, granting agencies, and other institutions are requiring documentation of the steps you took when working with your data. With OpenRefine, you can capture all actions applied to your raw data and share them with your publication as supplemental material.
- Any operation that changes the data in OpenRefine can be easily reversed or undone.
Some concepts such as clustering algorithms are quite complex, but OpenRefine makes it easy to introduce them, use them, and show their power.
Note: You must export your modified dataset to a new file: OpenRefine does not write back into your original sources. If you don’t save it, your OpenRefine work will be lost.
Before we get started
The following setup is necessary before we can get started (see the instructions here.)
Do you need help with any of the following?
- Download and install OpenRefine 3.4.1 from https://openrefine.org/download.html
- Download this data file and save to your desktop
- If after installation and running OpenRefine, it does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333/ to launch the program.
What is OpenRefine?
- OpenRefine is a Java program that runs on your machine (not in the cloud): it is a desktop application that uses your web browser as a graphical interface. No internet connection is needed, and none of the data or commands you enter in OpenRefine are sent to a remote server.
- OpenRefine does not modify your original dataset. All actions are easily reversed in OpenRefine and you can capture all the actions applied to your data and share this documentation with your publication as supplemental material.
- OpenRefine saves as you go. You can return to the project at any time to pick up where you left off or export your data to a new file.
- OpenRefine can be used to standardise and clean data across your file.
It can also help you
- Get an overview of a data set
- Resolve inconsistencies in a data set
- Help you split data up into more granular parts
- Match local data up to other data sets
- Enhance a data set with data from other sources
- Save a set of data cleaning steps to replay on multiple files
OpenRefine is a powerful, free, and open source tool with a large growing community of practice. More help can be found at https://openrefine.org.
More Information of OpenRefine
You can find out a lot more about OpenRefine at the official user manual docs.openrefine.org. There is a Google Group that can answer a lot of beginner questions and problems. OpenRefine recipes, scripts, projects, and extensions are available too, where you can find and copy them into your OpenRefine instance to run on your dataset.
- Open source (source on GitHub).
- A large growing community, from novice to expert, ready to help.
OpenRefine is a powerful, free and open source tool that can be used for data cleaning.
OpenRefine will automatically track any steps you take in working with your data.