Scripts from OpenRefine
OverviewTeaching: 10 min
Exercises: 5 minQuestions
How can we document the data-cleaning steps we’ve applied to our data?
How can we apply these steps to additional data sets?Objectives
Describe how OpenRefine generates JSON code.
Demonstrate ability to export JSON code from OpenRefine.
Save JSON code from an analysis.
Apply saved JSON code to an analysis.
- In the
Undo / Redosection, click
Extract..., and select the steps that you want to apply to other datasets by clicking the check boxes.
- Copy the code from the right hand panel and paste it into a text editor (like NotePad on Windows or TextEdit on Mac). Make sure it saves as a plain text file. In TextEdit, do this by selecting
Make plain textand save the file as a
Let’s practice running these steps on a new dataset. We’ll test this on an uncleaned version of the dataset we’ve been working with.
- Download an uncleaned version of the dataset: https://ndownloader.figshare.com/files/7823341 or use the version of the raw dataset you saved to your computer.
- Start a new project in OpenRefine with this file and name it something different from your existing project.
- Click the
Undo / Redotab >
Applyand paste in the contents of
txtfile with the JSON code.
Perform operations. The dataset should now be the same as your other cleaned dataset.
For convenience, we used the same dataset. In reality you could use this process to clean related datasets. For example, data that you had collected over different fieldwork periods or data that was collected by different researchers (provided everyone uses the same column headings).
Now, that you know how scripts work, you may wonder how to use them in your own scientific research. For inspiration, you can read more about the succesful application of the reproducible science principles in archaeology or marine ecology:
- Marwick et al. (2017) Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation
- Stewart Lowndes et al. (2017) Our path to better science in less time using open data science tools
All changes are being tracked in OpenRefine, and this information can be used for scripts for future analyses or reproducing an analysis.
Scripts can (and should) be published together with the dataset as part of the digital appendix of the research output.