Introduction to OpenRefine

Last updated on 2026-02-24 | Edit this page

Overview

Questions

  • What is OpenRefine and how can it help with messy data?
  • What kinds of tasks and analyses can you perform with OpenRefine?

Objectives

  • Identify typical problems in cultural heritage datasets
  • Describe how OpenRefine supports exploratory data cleaning
Discussion

Discussion: Do you often work with digital data in your research, your studies or your work?

For example, cleaning it, processing it, converting it or analyzing it? Do you have an idea what problems could arise when working with data, especially external data?

Before you can answer research questions, you first need to understand and clean your data. In the humanities, you might work with lists of artworks, artists, historical events, or other information collected from museums, archives, or fieldwork. Often, this data is stored in spreadsheets or tables, and at first glance, it may seem tidy. But as you look closer, you may notice small issues: names spelled in different ways, missing details, or dates written in various formats. These challenges are common and can make it difficult to analyze or share your data.

OpenRefine is a free, open-source tool designed to help you work with messy data. Think of OpenRefine as a workbench for your data, a place where you can clean, organize, and explore information, even if you have no technical background. OpenRefine runs locally on your computer and opens in your web browser, providing a user-friendly interface that guides you through each step. Working in your web browser might be confusing, but nothing from your dataset is sent to the internet — everything runs locally on your computer.

Our Dataset


Throughout this lesson, you will use a sample dataset from the Metropolitan Museum of Art’s (The Met) Open Access Initiative. The dataset includes artwork titles, creators, production dates, materials, and locations — fields that are typical for cultural heritage collections. Even if you have never worked with data before, you will see how OpenRefine can make your research easier.

In many digital humanities projects, a significant amount of time is spent preparing and cleaning data before analysis. It’s not always the most exciting part, but it is essential. With OpenRefine, you will learn how to think about data organization and develop practices for more effective data cleaning. By the end of this lesson, participants will be able to clean, explore, and analyze structured cultural heritage data. You don’t need to be a technical expert, just curious and willing to try something new.

Discussion

Challenge: Spot the Messy Data

Look at the small sample below. It contains only a few records from The Met dataset you will work with later.
Identify anything that might cause problems during analysis.

Title Artist Display Name Object Date Object Name City Tags Medium
Tile J. and J. G. Low Art Tile Works ca. 1884 Tile Chelsea Earthenware
Cabrette Joseph Bechonnet 19th century Cabrette Effiat, Puy-de-Dôme Animals various material
“A weaver of dreams” Margaret Neilson Armstrong; G.P. Putnam & Co., New York; Myrtle Reed 1911 New York
Design for a shawl with scrolling paisley leaves and Indian flowers Fleury Chavant; Georges Schlatter; J.E.G.; Herault [after 1844] Book Print Ornament, Architectur Paris Lithograph
Nouveau Cayer de Paysages à l’usage des personnes qui apprennent le Dessin J. B. Crépy 1781 Book Paris, France Etching, printed in red

Questions to discuss:

  1. What inconsistencies or formatting issues can you spot?
  2. Which values might make filtering or sorting difficult?
  3. Are there entries where you would want to investigate further before analysis?
  4. Why might these issues matter later in OpenRefine?
  • Missing data are sometimes represented by blank cells, N/A or 0.
  • Is the title the same as the object name? In some rows they differ in others they are the same.
  • The Artist Display Name sometimes contains more people.
  • Object dates are often not given as a specific year.
  • The cities in the table could be ambiguous, is it Chelsea in the UK or Chelsea in the US? Maybe you can derive a unique location from information about the artist.
  • Paris and Paris, France refer to the same place, but are recorded differently..
  • Some titles are enclosed in quotation marks, while others are not.

Advantages of OpenRefine


With OpenRefine, you can import your data, discover patterns, fix mistakes, and transform your information so it’s ready for analysis or sharing. You don’t need to know how to code or use complicated software. OpenRefine is built for researchers who want to focus on their work, not on technical details.

One of the strengths of OpenRefine is its ability to help you explore your data and perform simple analysis right from the start. You can quickly filter and sort your data, group similar entries, and visualize distributions to spot trends or outliers. This makes it easy to get a sense of your dataset before diving deeper into research questions. You can also use built-in functions to split or merge columns, remove duplicates, and transform data formats, making your information more consistent and reliable.

OpenRefine supports a wide range of data formats, including CSV, Excel, and JSON, and can connect to online sources and databases. It also allows you to match your data against external databases, such as Wikidata, to enrich your dataset with additional information. Because OpenRefine is open source, it can be extended with add-ons and custom scripts, giving you even more possibilities. The active community around OpenRefine has developed many plugins that add new features, such as connecting to other data sources, exporting to different formats, or automating repetitive tasks.

Key Points
  • OpenRefine is a free, open-source tool for cleaning, organizing, and exploring messy data.
  • You can easily import, filter, sort, and analyze your data, even without technical experience.
  • OpenRefine supports many data formats and can be extended with add-ons and custom scripts for even more possibilities.
  • Using OpenRefine helps you prepare your data for analysis, supporting transparent and reproducible research practices.