Summary and Schedule
OpenRefine is a powerful free, open source tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data. This lesson was especially developed for researchers in the humanities who want to learn how to improve the quality of their research data. It is designed for participants with no pevious experience.
OpenRefine is a powerful, free, open source tool for working with messy data: cleaning it, transforming it from one format into another, and extending it with web services and external data.
In scientific research, data is rarely perfect. Information is often collected from many different sources, such as archives, museums, or fieldwork, and combined into a single dataset. This can lead to a variety of problems: names spelled in different ways, missing or inconsistent values, dates in different formats, or duplicate entries. Sometimes data is entered manually, which can introduce typos or errors. Other times, data comes from automated exports or digitization projects, which may not follow a consistent structure. As a result, researchers often face the challenge of working with data that is not ready for analysis.
Before you can draw meaningful conclusions from your data, it is essential to clean and organize it. This process helps ensure that your analysis is accurate and reliable. OpenRefine is designed to make this step easier, even for those with no technical background. With OpenRefine, you can quickly identify and fix errors, standardize formats, and prepare your data for further research.
This lesson was especially developed for researchers in the humanities who want to learn how to improve the quality of their research data. It is designed for participants with no previous experience.
Learning objectives
By the end of this lesson, you will be able to:
- start a new OpenRefine project and import data
- work with a subset of your data with the help of facets and filters
- correct errors and reduce variations in your data through facets and clustering
- transform your data for future analysis
- use undo and redo actions of your cleaning steps
- enrich your data with the help of reconciliation service
- save and export cleaned data as well as data cleaning steps
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Introduction to OpenRefine |
What is OpenRefine and how can it help with messy data your
research? What kinds of tasks and analyses can you perform with OpenRefine? |
Duration: 00h 00m | 2. Importing Data and Getting to Know the OpenRefine User Interface |
How do I start a new project in OpenRefine? How do I import a CSV file? What options and settings are available during import? |
Duration: 00h 00m | 3. Exploring Data |
What options does OpenRefine offer for data exploration? What is a facet and how does it help me explore data? How do facets differ from filters? |
Duration: 00h 00m | 4. Custom Facets and GREL |
When do we need a custom facet instead of a built-in one? How can GREL help us filter or transform data more flexibly? |
Duration: 00h 00m | 5. Transforming Data |
How can we clean and standardize ArtistBio values in
OpenRefine? What is the difference between finding issues (facets) and fixing them (transformations & clustering)? |
Duration: 00h 00m | 6. Reconciling Data with External Data Sources |
What does it mean to reconcile data? Why is reconciliation useful in humanities research? How can we use OpenRefine to enrich our dataset with identifiers and structured information? |
Duration: 00h 35m | 7. Undo, Redo, and Exporting Workflows |
How can we go back to an earlier step if we realize we made a
mistake? How can we save our cleaning process to repeat it later or share it with colleagues? |
Duration: 00h 35m | 8. Resources for Future Self-study | TODO |
Duration: 00h 35m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Getting started
To follow this lesson, you must have OpenRefine installed on your computer and download a data file.
Dataset
The dataset used in this lesson is a subset of the Metropolitan Museum of Art’s Open Access Initiative dataset with information about the objects in the Metropolitan Museum of Art (e.g. title, culture, artist biography). It has been reduced in the number of columns and intentionally ‘messed up’ a little bit.
Download the csv data file to your Computer.
Software Setup
For this lesson you will need OpenRefine and a web browser. Note: OpenRefine is a Java program that runs on your machine (not in the cloud). It runs inside your browser, but no web connection is needed.
- Check that you have Firefox, Edge, Opera or Chrome, Chromium, Safari browsers installed and set as your default browser. OpenRefine runs in your default browser. It will not run correctly in Internet Explorer. Sometimes it even has some issues with Firefox.
- Download the software from openrefine.org/download and check below for further instructions depending on your operating system
Getting help
If you encounter problems installing or running OpenRefine, a good source of support is the OpenRefine mailing list and user forum. Include your operating system when searching to find the most relevant answers for your issue, such as threads related to Windows, macOS, or Linux.
You may also want to check the Stack Overflow OpenRefine tag.
If you want to know more details about installation, upgrades and configuration the installing manual of OpenRefine is a good resource.