Introduction to OpenRefine
Last updated on 2026-02-24 | Edit this page
Estimated time: 0 minutes
Overview
Questions
- What is OpenRefine and how can it help with messy data?
- What kinds of tasks and analyses can you perform with OpenRefine?
Objectives
- Identify typical problems in cultural heritage datasets
- Describe how OpenRefine supports exploratory data cleaning
Discussion: Do you often work with digital data in your research, your studies or your work?
For example, cleaning it, processing it, converting it or analyzing it? Do you have an idea what problems could arise when working with data, especially external data?
Before you can answer research questions, you first need to understand and clean your data. In the humanities, you might work with lists of artworks, artists, historical events, or other information collected from museums, archives, or fieldwork. Often, this data is stored in spreadsheets or tables, and at first glance, it may seem tidy. But as you look closer, you may notice small issues: names spelled in different ways, missing details, or dates written in various formats. These challenges are common and can make it difficult to analyze or share your data.
OpenRefine is a free, open-source tool designed to help you work with messy data. Think of OpenRefine as a workbench for your data, a place where you can clean, organize, and explore information, even if you have no technical background. OpenRefine runs locally on your computer and opens in your web browser, providing a user-friendly interface that guides you through each step. Working in your web browser might be confusing, but nothing from your dataset is sent to the internet — everything runs locally on your computer.
Our Dataset
Throughout this lesson, you will use a sample dataset from the Metropolitan Museum of Art’s (The Met) Open Access Initiative. The dataset includes artwork titles, creators, production dates, materials, and locations — fields that are typical for cultural heritage collections. Even if you have never worked with data before, you will see how OpenRefine can make your research easier.
In many digital humanities projects, a significant amount of time is spent preparing and cleaning data before analysis. It’s not always the most exciting part, but it is essential. With OpenRefine, you will learn how to think about data organization and develop practices for more effective data cleaning. By the end of this lesson, participants will be able to clean, explore, and analyze structured cultural heritage data. You don’t need to be a technical expert, just curious and willing to try something new.
Challenge: Spot the Messy Data
Look at the small sample below. It contains only a few records from
The Met dataset you will work with later.
Identify anything that might cause problems during analysis.
| Title | Artist Display Name | Object Date | Object Name | City | Tags | Medium |
|---|---|---|---|---|---|---|
| Tile | J. and J. G. Low Art Tile Works | ca. 1884 | Tile | Chelsea | Earthenware | |
| Cabrette | Joseph Bechonnet | 19th century | Cabrette | Effiat, Puy-de-Dôme | Animals | various material |
| “A weaver of dreams” | Margaret Neilson Armstrong; G.P. Putnam & Co., New York; Myrtle Reed | 1911 | New York | |||
| Design for a shawl with scrolling paisley leaves and Indian flowers | Fleury Chavant; Georges Schlatter; J.E.G.; Herault | [after 1844] | Book Print Ornament, Architectur | Paris | Lithograph | |
| Nouveau Cayer de Paysages à l’usage des personnes qui apprennent le Dessin | J. B. Crépy | 1781 | Book | Paris, France | Etching, printed in red |
Questions to discuss:
- What inconsistencies or formatting issues can you spot?
- Which values might make filtering or sorting difficult?
- Are there entries where you would want to investigate further before
analysis?
- Why might these issues matter later in OpenRefine?
- Missing data are sometimes represented by blank cells, N/A or 0.
- Is the title the same as the object name? In some rows they differ in others they are the same.
- The Artist Display Name sometimes contains more people.
- Object dates are often not given as a specific year.
- The cities in the table could be ambiguous, is it Chelsea in the UK or Chelsea in the US? Maybe you can derive a unique location from information about the artist.
- Paris and Paris, France refer to the same place, but are recorded differently..
- Some titles are enclosed in quotation marks, while others are not.
Advantages of OpenRefine
With OpenRefine, you can import your data, discover patterns, fix mistakes, and transform your information so it’s ready for analysis or sharing. You don’t need to know how to code or use complicated software. OpenRefine is built for researchers who want to focus on their work, not on technical details.
One of the strengths of OpenRefine is its ability to help you explore your data and perform simple analysis right from the start. You can quickly filter and sort your data, group similar entries, and visualize distributions to spot trends or outliers. This makes it easy to get a sense of your dataset before diving deeper into research questions. You can also use built-in functions to split or merge columns, remove duplicates, and transform data formats, making your information more consistent and reliable.
OpenRefine supports a wide range of data formats, including CSV, Excel, and JSON, and can connect to online sources and databases. It also allows you to match your data against external databases, such as Wikidata, to enrich your dataset with additional information. Because OpenRefine is open source, it can be extended with add-ons and custom scripts, giving you even more possibilities. The active community around OpenRefine has developed many plugins that add new features, such as connecting to other data sources, exporting to different formats, or automating repetitive tasks.
- OpenRefine is a free, open-source tool for cleaning, organizing, and exploring messy data.
- You can easily import, filter, sort, and analyze your data, even without technical experience.
- OpenRefine supports many data formats and can be extended with add-ons and custom scripts for even more possibilities.
- Using OpenRefine helps you prepare your data for analysis, supporting transparent and reproducible research practices.