All in One View

Content from Introduction to OpenRefine


Last updated on 2026-06-24 | Edit this page

Overview

Questions

  • What is OpenRefine and how can it help with messy data?
  • What kinds of tasks and analyses can you perform with OpenRefine?

Objectives

  • Identify typical problems in cultural heritage datasets
  • Describe how OpenRefine supports exploratory data cleaning
Discussion

Do you often work with digital data in your research, your studies or your work?

For example, cleaning it, processing it, converting it or analyzing it? Do you have an idea what problems could arise when working with data, especially external data?

Before you can answer research questions, you first need to understand and clean your data. In the humanities, you might work with lists of artworks, artists, historical events, or other information collected from museums, archives, or fieldwork. Often, this data is stored in spreadsheets or tables, and at first glance, it may seem tidy. But as you look closer, you may notice small issues: names spelled in different ways, missing details, or dates written in various formats. These challenges are common and can make it difficult to analyze or share your data.

OpenRefine is a free, open-source tool designed to help you work with messy data. Think of OpenRefine as a workbench for your data, a place where you can clean, organize, and explore information, even if you have no technical background. OpenRefine runs locally on your computer and opens in your web browser, providing a user-friendly interface that guides you through each step. Although OpenRefine runs in your web browser, nothing from your dataset is sent to the internet — everything runs locally on your computer.

Our Dataset


Throughout this lesson, you use a sample dataset from the Metropolitan Museum of Art (The Met) Open Access Initiative. The dataset includes artwork titles, creators, production dates, materials, and locations — fields that are typical for cultural heritage collections. Even if you have never worked with data before, you see how OpenRefine can make your research easier.

In many digital humanities projects, a significant amount of time is spent preparing and cleaning data before analysis. It’s not always the most exciting part, but it is essential. With OpenRefine, you learn how to think about data organization and develop practices for more effective data cleaning. By the end of this lesson, you will be able to clean, explore, and analyze structured cultural heritage data. You don’t need to be a technical expert, just curious and willing to try something new.

Discussion

Spot the Messy Data

Look at the small sample below. It contains only a few records from The Met dataset you work with later.
Identify anything that might cause problems during analysis.

Title Artist Display Name Object Date Object Name City Tags Medium
Tile J. and J. G. Low Art Tile Works ca. 1884 Tile Chelsea 0 Earthenware
Cabrette Joseph Bechonnet 19th century Cabrette Effiat, Puy-de-Dôme Animals various material
“A weaver of dreams” Margaret Neilson Armstrong; G.P. Putnam & Co., New York; Myrtle Reed 1911 New York
Design for a shawl with scrolling paisley leaves and Indian flowers Fleury Chavant; Georges Schlatter; J.E.G.; Herault [after 1844] Book Print Ornament, Architectur Paris N/A Lithograph
Nouveau Cayer de Paysages à l’usage des personnes qui apprennent le Dessin J. B. Crépy 1781 Book Paris, France Etching, printed in red

Questions to discuss:

  1. What inconsistencies or formatting issues can you spot?
  2. Which values might make filtering or sorting difficult?
  3. Are there entries where you would want to investigate further before analysis?
  4. Why might these issues matter later in OpenRefine?
  • Missing data are sometimes represented by blank cells, N/A or 0.
  • Is the title the same as the object name? In some rows they differ in others they are the same.
  • The Artist Display Namecolumn sometimes contains more people.
  • Object dates are often not recorded as a specific year.
  • The cities in the table could be ambiguous, is it Chelsea in the UK or Chelsea in the US? Additional information about the artist might help identify the correct location.
  • Paris and Paris, France refer to the same place, but are recorded differently.
  • Some titles are enclosed in quotation marks, while others are not.

Advantages of OpenRefine


With OpenRefine, you can import your data, discover patterns, fix mistakes, and transform your information so it’s ready for analysis or sharing. You don’t need to know how to code or use complicated software. OpenRefine is built for researchers who want to focus on their work, not on technical details.

One of the strengths of OpenRefine is its ability to help you explore your data and perform simple analysis right from the start. You can quickly filter and sort your data, group similar entries, and visualize distributions to spot trends or outliers. This makes it easy to get a sense of your dataset before diving deeper into research questions. You can also use built-in functions to split or merge columns, remove duplicates, and transform data formats, making your information more consistent and reliable.

OpenRefine supports a wide range of data formats, including CSV, Excel, and JSON, and can connect to online sources and databases. It also allows you to match your data against external databases, such as Wikidata, to enrich your dataset with additional information. Because OpenRefine is open source, it can be extended with add-ons and custom scripts, giving you even more possibilities. The active community around OpenRefine has developed many plugins that add new features, such as connecting to other data sources, exporting to different formats, or automating repetitive tasks.

Key Points
  • OpenRefine is a free, open-source tool for cleaning, organizing, and exploring messy data.
  • You can easily import, filter, sort, and analyze your data, even without technical experience.
  • OpenRefine supports many data formats and can be extended with add-ons and custom scripts for even more possibilities.
  • Using OpenRefine helps you prepare your data for analysis, supporting transparent and reproducible research practices.

Content from Importing Data and Getting to Know the OpenRefine User Interface


Last updated on 2026-06-24 | Edit this page

Overview

Questions

  • How do I start a new project in OpenRefine?
  • How do I import a CSV file?
  • What options and settings are available during import?
  • How is the user interface structured?

Objectives

  • Create and configure a new OpenRefine project
  • Evaluate import settings using the preview pane and import a CSV file
  • Identify the core interface components and explain their functions

Importing Data


In this episode, you import the Metropolitan Museum of Art dataset used throughout the lesson. To begin, open OpenRefine. When you start OpenRefine, a window in your web browser (at the address http://127.0.0.1:3333/) opens displaying the start page.

Screenshot of the Import Data Window
Screenshot of the Import Data Window

There are various options in the left-hand bar. You can:

  • Create a new project and load data.
  • Open an existing project.
  • Import an existing project.
  • Change the language of OpenRefine.
  • Manage extensions.

OpenRefine structures your work in projects. To begin working, you first need to create a new project and import The Met dataset. If you pause your work on the project (the data and the changes you made) and want to continue later on, you can choose Open project. If a colleague sends you an OpenRefine project, you can import it under Import project.

  1. Click on Create Project and then on Get data from This Computer.
  2. Here click Browse, locate the dataset met_dataset_oa.csv on your computer and select it.
  3. Click on Next and upload the data into OpenRefine.
  4. On the next page OpenRefine shows you a preview of your data, allowing you to check that everything looks correct before you proceed.

Below the data preview, you find various import settings for how the data should be loaded. These settings have a direct effect on the preview above, allowing us to immediately check whether the settings are correct. This is especially important when files use non-standard formats, as incorrect settings may result in a distorted table structure.

Callout

What kinds of data files can I import?

There are several options for getting your dataset into OpenRefine. You can import files in a variety of formats including:

  • Comma-separated values (CSV) or text-separated values (TSV)
  • Text files
  • JSON (javascript object notation)
  • XML (extensible markup language)
  • RDF (resource description framework)
  • OpenDocument spreadsheet (ODS) or Excel spreadsheet (XLS or XLSX)

If needed, you can change the format on the left side under Parse data as. For more information see the Create a project by importing data page in the OpenRefine manual.

Challenge

Import as Interpretation

Let’s test how sensitive the import process is to incorrect settings. Try changing the separator and the character encoding.

In the import preview, change the column separator from comma to semicolon and the character encoding to US-ASCII and UTF-8. Observe the preview carefully.

  1. How does the table structure change?
  2. How many columns are displayed now?
  3. Why does this happen?
  4. Regarding the character encoding, what difference can be identified?
  5. What would happen if you created the project without correcting this setting?
  1. The table structure collapses.
  2. Most or all values appear in a single column.
  3. This happens because the dataset is comma-separated. When the separator does not match the file structure, OpenRefine cannot divide the data into columns correctly.
  4. With the character encodiung US-ASCII umlauts, accents and dashes are not displayed correctly. In contrast everything looks fine using UTF-8.
  5. The dataset would be imported incorrectly, making further analysis difficult or impossible.

Sometimes, data files include extra header lines or notes at the top. You can tell OpenRefine to skip these lines, so only the actual data is imported. For Excel files, you can select which sheet to import. For CSVs, you can preview and adjust how columns are interpreted. There are many different options depending on the file format and the dataset. In our case, the default settings are sufficient.

Callout

Why import settings matter?

Import settings determine how OpenRefine interprets your file and turns it into a table. If these settings are incorrect, data may be split into the wrong columns, rows may be skipped, or values may be read incorrectly. Correct settings help prevent problems later in your workflow and save time.

  1. Once you are happy with the preview and settings, you can change your project name above the preview and click Create Project.
  2. OpenRefine loads your data into its workspace on the next page.

Overview of the OpenRefine interface


Screenshot of the user interface
Screenshot of the user interface

The OpenRefine interface is organized around a central workspace. The main window displays the data in a tabular format called the grid, with rows and columns similar to a spreadsheet: A row represents a record in your dataset, and every column represents a type of information. Above the table in the grid header, you can choose how many rows are shown at once, and you can scroll through the columns within the table.

Each column header has a small arrow. Clicking this arrow opens a drop-down menu with actions that apply only to that column, such as sorting, faceting, and editing its values: These are the actions you learn about in the following chapters

On the left-hand side, the Facet/Filter tab shows all active filters and facets. These tools allow you to explore the dataset and to see how your actions affect it. The Undo/Redo tab records every change you applied to the data. From here, you can step backward or forward through your changes. Note: All changes are stored within the OpenRefine project; the original file remains unchanged. You will use both tabs a lot in the following episodes.

Lastly, in the right corner of the project bar, the menu provides access to project-level actions. When you click on Open..., you return to the start page. The Help button links to the official OpenRefine documentation. If you encounter problems in the future, the official documentation is a useful starting point.

Now that we have successfully imported the dataset and understood the interface, we can begin exploring the data itself.

Key Points
  • OpenRefine organizes your work in projects
  • You can import data from different sources and in different formats into OpenRefine
  • Adjust import settings to ensure your data is read correctly and preview the results before starting
  • The main components of the user interface are the grid, the grid header, the project bar, and Facet/Filter as well as Undo/Redo tab

Content from Exploring Data


Last updated on 2026-06-24 | Edit this page

Overview

Questions

  • What options does OpenRefine offer for data exploration?
  • What is a facet and how does it help me explore data?
  • How do facets differ from filters?
  • What data types exist in OpenRefine?

Objectives

  • Define faceting and identify when to use it.
  • Define filtering and identify when to use it.
  • Create a text facet to get an overview of values in a column.
  • Transform values from one data type to another.
  • Use Split multi-valued cells to prepare data for accurate faceting and later analysis.

Facets


When we work with data in OpenRefine, one of the first challenges is to make sense of the information that has been imported. Looking at rows of raw data rarely gives us much insight, especially if the dataset is large. What we need is a way of quickly summarizing values, spotting patterns, and finding potential problems such as inconsistent spellings or unexpected categories.

OpenRefine provides a set of tools for this under the name facets. Faceting allows us to group together all of the different values that appear in a column and to see how often they occur. A facet creates a kind of “interactive summary” of the data: it lists all unique values in a column and shows how many times each one appears.

Text Facet


The most commonly used facet type is called text facet. This facet type applies to all string values. We create our first text facet together for the column Department. Open the column menu with the small arrow next to the column name and choose Facet → Text facet. A new panel appeara on the left side of the screen showing the unique values in the column and how often they occur. You can sort the values alphabetically or by frequency. You can also hover over values to edit them directly. This simple step immediately transforms a spreadsheet with hundreds of rows into a clear summary of categories and helps identify initial inconsistencies.

Both “Arts of Africa Oceania and the Americas” and “Arts of Africa, Oceania and the Americas” appear in the panel. Since these values most likely refer to the same department, the version without the comma is probably a data entry error. To correct this, identify the incorrect value in the facet list, hover over it, click edit, and change it. This merges all records under the correct spelling by just one action.

You can also click on one department to see only those rows in the table, or select multiple values to combine them by clicking include. Ensure that you undo your selection by pressing exclude, otherwise you only continue working with a small subset of the data.

Challenge

First Insights into Your Dataset & Correct Errors

Create a text facet on the columns:

  • Is Public Domain

  • Object Name

  • City.

  1. How many unique values are listed in each column?
  2. What is the most common value in Object Name / City and how often does it appear?
  3. Can you spot and fix the problems in the Is Public Domain column?
  1. 4 / 389 / 321 different values.
  2. The most frequent value is Book with 493 counts and Paris with 439 counts.
  3. There are inconsistent capitalizations: “False”, “false” and “true”, “True”. You can decide for one spelling and edit the others with the edit button.

The difference between Facets and Filters

It is useful to distinguish between facets and filters. Both are ways of focusing on a subset of the data, but they work differently. A filter is like a search box: you type in a word or part of a word, and OpenRefine hides all rows that do not match. A facet, by contrast, gives you an overview of all the distinct values in a column (as you saw in the Challenge above) and lets you select them interactively.

For example, in the Met dataset, the column Department might contain categories such as “Medieval Art”, “Islamic Arts,” or “Drawings & Prints.” A text facet on this column immediately shows you how many artworks belong to each department. By clicking on one or several values in the facet, you can quickly restrict your view to only those artworks, and then easily switch back to the full dataset. Using this functionality, you can filter your data through the facet interface.

Another way to filter values in a column is to use a text filter. When you choose Filter → Text filter, a search field appears on the left side. You can type a term to restrict the dataset to matching rows.

Challenge

Combining Facets and Text Filters

How many prints are in the department “Drawings and Prints”? What difficulties arise when you only use filtering via the include function of the facet?

First, create a text facet on the column Department and filter for the department “Drawings and Prints” via include. Then you can use a text facet on the column Object Name. When you include the facet “Print”, you find 281 matching records. However, the results only include exact matches with “Print” not simliar values such as “Print collection ornament & architecture” and other similar ones.

To filter for all prints in the department, you can use a text filter and type “print”. There are 665 prints in the department.

Rows, Records, and Multi-Valued Cells

Up to this point, we have assumed that every cell in a column contains only a single value. In real-world data, however, that is often not the case. In the Met dataset, the Artist Display Name column sometimes contains two or more names, separated by a | like Horace Harral|Joseph Wolf. This is problematic for faceting. If we want to find out which artist appears most frequently in the dataset, this structure makes it difficult. That’s because if we apply a text facet to the Artist Display Name column as it stands, OpenRefine treats the entire string as one value. The facet then lists the artists as if it were a single artist, which is clearly not what we want. What we need instead is to treat each artist as a separate value: Horace Harral and Joseph Wolf.

To understand what happens when we correct this, it helps to know about the distinction between rows and records in OpenRefine. By default, data is displayed as rows, one after the other. But OpenRefine can also treat a group of rows as belonging to the same record. When we split a cell that contains multiple values, OpenRefine creates additional rows within the same record. This means that the number of rows increases, while the number of records remains unchanged.

  1. Switch to records at the top left of the grid. This makes it easier to see what happens after the split.
  2. Open the column menu for Artist Display Name, choose Edit cells → Split multi-valued cells…. Enter the separator | in the box, and click OK.
  3. Create a text facet on Artist Display Name again. This time you see the names of individual artists listed separately. Each can now be counted and selected on its own.

If you now switch back to the rows view you will see that the number of rows increased to 4,644.

Challenge

Split multi-valued cells

The column Tags also contains multiple values in some cells. Split the cells for this column and create a text facet.

  • Which separator do you need?

  • Which tag appears most frequently in the dataset?

  • Do you notice duplicate values in the facet list?

  • When splitting the cells, you should use a pipe (|) as the separator.

  • After splitting the values and creating a text facet, the most frequent tag in the dataset is “Men”.

  • The facet may show duplicate values. These appear as separate values because some tags contain leading whitespace. To fix this you can Edit cells → Common transforms → Trim leading and trailing whitespace. This removes spaces at the beginning or end of the cells. After trimming the whitespace, the duplicate values merge into a single tag.

Rejoining Values

Sometimes you may want to put the data back into its original form. After cleaning or analyzing, OpenRefine allows you to join split rows back into a single cell. To do this, return to the column menu, select Edit cells → Join multi-valued cells…, and specify a separator again. This is useful if you need to export the data in a more compact form later.

Other Facet Types


So far we have used text facets to explore categorical values such as departments or object types. OpenRefine also offers other facet types that help explore numeric values like creation year of an artwork.

Let us try this out with the Object Begin Date column from the Met dataset which contains the year an artwork was created. Open the column menu and choose Facet → Numeric facet. You might expect to see the different values, but instead, OpenRefine displays a message such as “No numeric values.” This tells us that the values in the column are not actually recognized as numbers, even though they look like numbers in the table.

This situation is common when importing data: numbers are often stored as strings (that is, as text), so OpenRefine does not treat them as numeric values. We need to transform them first.

To fix this, open the column menu for Object Begin Date again and choose:
Edit cells → Common transforms → To number.

OpenRefine now attempts to parse each cell in that column as a number. If the value is a valid number, the cell is converted; if not (for example, if the cell is empty or contains text like “unknown”), it becomes a blank cell.

You see a small change in the formatting of the numbers: they are now right-aligned, which is OpenRefine’s way of indicating that they are numeric rather than text.

With the column properly converted, repeat the earlier step:
Facet → Numeric facet.

This time, OpenRefine shows a histogram with a slider to explore ranges of values from year 0 up to year 2100. The histogram groups the year values into intervals of 100 years, giving you an overview of how artworks are distributed by year of their creation.

By dragging the slider handles, you can focus on particular ranges. For example, you might restrict the view to artworks before 1,500. You see that the dataset contains only 42 matching objects. Most artworks therefore date from after the 15th century; this is evident at a glance.

Numeric facets therefore serve two purposes at once: they help you explore distributions, and they highlight anomalies that need cleaning.

Callout

OpenRefine also offers other facet types such as a timeline facet for date values or a scatterplot facet for exploring the relationship between two numeric columns. In addition, OpenRefine provides a facet type that helps you identify duplicate values in a column: Facet → Customized facets → Duplicate facet. Duplicate facets can be useful, for example, when working with columns that are expected to contain unique identifiers.

The best way to understand these facets is simply to experiment with them.

Challenge

Numeric Facet

Turn the values in the column Accession Year into a numeric facet. In which decade did the Met collection acquire the most artworks? Why are there non-numeric values? Can you spot the error?

In the 1940s. There are 5 non-numeric values. You can take a look at them by ticking Non-numeric. It looks like someone wrote an “O” instead of a “0”.

Detect Blank Values


Not every object contains information in all columns. Often, the data is incomplete. Knowing how incomplete the data is can often be important for planning a later analysis. We can quickly identify the gaps by selecting Facet → Customised facets → Facet by blank. This gives us an output with “true” and “false” on the left. By selecting “false”, we exclude rows with missing values from the subset. If we do this for all columns, we only get the data records that are complete.

Challenge

Identify complete records

How many records in the dataset contain information on both Culture and Tags?

First apply Facet by blank to Culture and select “false”. Then apply Facet by blank to Tags and select “false”. Only the remaining 175 records contain values in both columns. This is quite a small number, considering that the dataset contains 2,076 records.

Key Points
  • Facets provide an interactive overview of the values in a column and help you explore your data.
  • Multi-valued cells must be split before accurate faceting is possible.
  • Numeric and Timeline facets require converting text values into numbers or dates first.

Content from Custom Facets and GREL


Last updated on 2026-06-24 | Edit this page

Overview

Questions

  • When do we need a custom facet instead of a built-in one?
  • How can GREL help us explore and classify data more flexibly?

Objectives

  • Understand what a custom facet is and how it differs from standard facets.
  • Learn to write simple GREL expressions for filtering and classifying data.
  • Create custom facets using GREL expressions to group and analyse data in new ways.

Creating a Custom Facet with GREL


Discussion

How can we tell whether a cell contains one artist or several artists?

How can we check the column Artist Display Name to find out whether one or more people were involved in creating an artwork without splitting it? Look carefully at a few cells. What character consistently separates multiple names?

So far, we have explored facets that you can create by clicking through the menu—text, numeric, and timeline facets. These are powerful, but sometimes an exploration or cleaning task requires a rule that is not built in. In those cases, OpenRefine lets us define our own facets using small expressions written in GREL. Don’t worry if these terms are new to you. We demonstrate them with an example based on the dataset you already explored in the previous episode.

  1. Open the column menu for Artist Display Name.

  2. Choose Facet → Custom text facet…

  1. Enter the following expression:

    value.contains("|")

    This expression asks, “Does the cell contain a pipe?” For each row, OpenRefine evaluates the expression and returns either true or false.

  2. Click OK. In the left panel, you now see a facet with two categories: true and false.

This small expression creates a logic-based facet that is not available as a built-in facet. It does not modify the data. It simply checks whether the condition is true or false for each row.

Why this works: A Custom facet runs your expression on every row, groups the results, and lets you filter by the outcome. You can write tests that return booleans (true/false), strings (e.g., normalized categories), or even numbers — OpenRefine facets whatever the expression returns.

Challenge

Finding Titles with Quotation Marks

Create a custom text facet on the Title column and determine how many titles contain quotation marks. Then inspect a few examples and discuss why quotation marks might have been used.

Tip: You need to escape the quotation mark in the expression using a backslash (\).

The expression is:

value.contains("\"")

It returns 79 rows with the value true, meaning that 79 titles contain quotation marks. Looking at several examples suggests that quotation marks are often used when the title refers to a component of a larger work, such as an illustration in a book. This information could be important for a later analysis.

What Is GREL?


GREL stands for General Refine Expression Language. It is a small, specialized language used inside OpenRefine to:

  • inspect cell values
  • transform text and numbers
  • check conditions
  • extract patterns
  • create new values on the fly

GREL looks like code, but many useful expressions are short and readable. You can think of them as tiny instructions that tell OpenRefine how to interpret or transform the value currently in a cell (that value is referred to as value inside GREL). Many of the actions we have already performed can also be expressed directly in GREL, but to make things easier the most common functions are already built in. The menu simply provides shortcuts for the most common actions.

Challenge

Detecting unusually long titles

Very long artwork titles may indicate data issues, such as multiple titles stored in one cell or comments included in the title field, but they can also reflect descriptive cataloguing practices.

Create a custom text facet on the column Title using the following GREL expression:

if(value.length() > 40, "long title", "short title")
  • What does this expression do in your own words?

  • How many long titles are there in the dataset?

The expression works as follows:

  1. Inspect the cell with value.length(), which calculates the number of characters in the title.

  2. The if() function checks whether the title has more than 40 characters (value.length() > 40).

  3. Produce a new output. If the condition is true, the expression returns “long title”, otherwise it returns “short title”. OpenRefine then groups rows according to these generated values.

This means the facet does not group by the original cell content, but by values that are generated by the expression.

There are 1009 long titles in the dataset. Looking at these entries shows that many titles include a description in addition to the actual title.

Coming back to our column Artist Display Name, you can also ask a more complex question: How many artists contributed to each artwork? This question cannot easily be answered using the menu interface alone. Instead of simply asking whether a pipe exists, we can count how many artist names are stored in a cell. To do this, we first split the text into a list and then count how many elements the list contains.

value.split("|").length()

In this expression, we chain two operations together. First, split() creates a list of names. Then .length() counts how many elements are in that list. In the window where you enter your GREL function, you can see the result it produces as you type.

Working with GREL always starts with a question about the data. That’s why we’re going to take a look at the most common GREL functions and what they do.

Callout

Callout: GREL-Functions

Good places to look up your problem and the corresponding GREL function are:

Some useful functions include:

  • value.toLowercase() – lowercase the text.
  • value.trim() – remove spaces at start/end.
  • value.length() – number of characters.
  • value.contains("text")true if “text” occurs.
  • value.startsWith("A"), value.endsWith(".") – prefix/suffix checks.
  • value.replace("old","new") – literal replace.
  • value.replace(/\s+/," ") – regex replace (collapse multiple spaces).
  • value.split(";") – split into an array on ;.
  • array.join("|") – join array back to a string.
Key Points
  • Custom facets group data using computed results from a GREL expression, not only the original cell values.
  • GREL is a lightweight language that allows you to inspect, classify, and analyse data inside OpenRefine.
  • Custom facets let you ask flexible questions about your data, such as identifying multiple creators or unusually long titles.
  • With conditional expressions like if(), you can define new categories that support deeper exploration and data-quality checks.
  • GREL functions can be chained together to answer more complex questions about your data.

Content from Transforming Data


Last updated on 2026-06-24 | Edit this page

Overview

Questions

  • How can GREL be used to extract information?
  • What is the difference between modifying existing values and creating new columns?
  • How can clustering help identify inconsistent values in a dataset?

Objectives

  • Apply GREL transformations to extract information.
  • Create new columns based on existing data.
  • Use arrays and indexing to select specific parts of a value.
  • Use clustering to identify and review inconsistent values.

From Exploring Data to Cleaning Data


In the previous episodes, we used built-in and custom facets to explore our dataset and identify potential issues. Facets help us find patterns and problems, but they do not change the data itself. For further analysis, it is often useful to reorganize and standardize the data.

In this episode, we move from exploration to cleaning. We use transformations, column operations, and clustering to modify and standardize our data.

Discussion

What information is stored in Artist Display Bio?

Look at several values in the column.

  • What different kinds of information are stored together?
  • What problems could this cause for later analysis?
  • How could we separate these different pieces of information?

Splitting Multi-Valued Cells

The first step is to separate information about individual artists. As you learned in previous episodes, the pipe character (|) is used to separate information about multiple artists within a single cell.

To separate these artists into individual rows:

  1. Open the column menu for Artist Display Bio.

  2. Choose Edit cells → Split multi-valued cells...

  3. Enter: |.

  4. Confirm with OK.

You notice that some artists have only partial information, while others have much more detailed entries. Most complete entries follow a pattern similar to: Nationality, Place Year–Year Place.

Creating New Columns and Extracting Information

Now we can begin extracting specific pieces of information. To preserve the original data, we create a new column rather than modifying the existing one. This is an important distinction: unlike the previous operation, which changed the structure of the dataset by creating new rows, we are now creating an additional column while keeping the original values unchanged.

Challenge

How to extract the Nationality?

Look at the column Artist Display Bio. How could you separate the nationality from the remaining information?

Most values follow the pattern: Nationality, additional biographical information

We can therefore split the text at the first comma and keep only the first element.

  1. Open the column menu for Artist Display Bio .

  2. Choose Edit column → Add column based on this column...

  3. A new window appears.

  1. Name the new column Nationality.
  2. Enter the GREL expression and check what happens in the Preview tab below
value.split(",")

The cell content is transformed from text into an array with three elements:

American (born Germany), Frankfurt-am-Main 1863–1962 Staten Island, New York

[ "American (born Germany)", " Frankfurt-am-Main 1863–1962 Staten Island", " New York" ]

The list is called an array and stores the elements or values in a specific order. In OpenRefine, arrays are displayed unsing square brackets with elements separated by commas: [value1, value2, value3]. Each element has a position, called an index. Counting starts at 0, so the first element is accessed with [0], the second with [1], and so on. Since we are interested in the nationality, we only want the first element of the array.

  1. Add [0] to the end of the GREL expression.

  2. Click OK to create the new column.

Challenge

Additional cleaning

Look at the new Nationality column. It seems that the cells contain more than just nationalities. Discuss the following questions:

  • Which additional information is still present?
  • Can you identify recurring patterns?
  • How might these patterns be removed?

You do not need to propose a GREL expression. A description in natural language is sufficient.

Common issues include:

  • Birth information in parentheses, such as (born Germany).
  • Uncertainty is marked with (?)
  • Entries containing dates instead of nationalities.
  • Entries containing place names where no nationality was recorded.
  • Leading or trailing whitespace introduced during splitting.
  • Information about active periods, such as active 16th century.

Data cleaning is rarely finished after a single transformation. Manual review is still necessary, but transformations can greatly reduce the amount of work required.

Challenge

Extracting Life Dates

Once we have extracted the nationality, we can use a similar strategy to isolate other information stored in the same field. Life dates are another important piece of information that may be useful for later analysis.

Create a new column containing only the life dates of the artists. Look again at the values in Artist Display Bio. Can you create a new column that contains only the birth and death years?

Most life dates contain digits and a dash separating birth and death year.

One possible solution is:

value.replace(/[^0-9–-]/,"")

Explanation: value.replace(pattern, replacement) replaces all parts of a value that match a given pattern.

In this example:

  • / ... / indicates that the pattern is written as a regular expression.
  • [0-9–-] matches any digit (09) as well as dashes (- and ).
  • ^ inside the square brackets reverses the selection. It means “anything except” the listed characters.

Therefore: [^0-9–-] means: match every character that is not a number or a dash.

The replacement string is empty (""), so all matching characters are removed.

Clustering


Even after applying transformations, some inconsistencies remain. Not all problems can be solved with rules or regular expressions. Sometimes values differ only slightly because of spelling variations, abbreviations, or typing mistakes. In these cases, clustering can help identify potentially equivalent values. It identifies and normalizes variations — especially when different inconsistencies appear similar but not identical. We demonstrate clustering on the column Object Name.

  1. Open the column menu for Object Name.

  2. Choose Edit cells → Cluster and edit....

  3. A new window appears. Click Cluster.

Screenshot of the Cluster and edit column …
Screenshot of the Cluster and edit column …

The clustering window displays one suggested cluster per row. For each cluster the variations of a similar value with the count of rows they appear are shown on the left; a field on the right allows you to select or edit a preferred value.

For every suggested cluster you can decide to merge the values into a standardized form, to ignore the cluster if the variations are meaningful or to edit only some entries of the cluster. Clustering never changes anything automatically. OpenRefine simply helps you notice patterns you would otherwise miss.

Different clustering methods and keying functions identify different kinds of similarities. You do not need to understand the underlying algorithms. In practice, it is often sufficient to experiment with several methods and compare the results.

Challenge

Cluster the Classification

Look at the column Classification and try clustering the values by using different clustering methods and keying functions

  • Which method produces the most useful suggestions?
  • Which values are grouped together?
  • What preprocessing step could improve the clustering results?

There is no single correct answer. Different clustering methods produce different suggestions.

One useful preprocessing step is to split multi-valued cells before clustering. If several classifications are stored together in one cell, clustering has difficulty identifying similarities between individual values.

Key Points
  • GREL expressions can be used to extract, modify, and standardize information.
  • Creating new columns preserves the original data and makes transformations easier to review.
  • The split() function creates arrays that can be accessed using positions such as [0] and [1].
  • Structured information can be extracted from text using GREL expressions and pattern matching.
  • Data cleaning often requires multiple transformation steps and manual review.
  • Clustering helps identify potentially equivalent values and supports manual standardization.

Content from Reconciling Data with External Data Sources


Last updated on 2026-06-24 | Edit this page

Overview

Questions

  • What does it mean to reconcile data?
  • Why is reconciliation useful in humanities research?
  • How can we use OpenRefine to enrich our dataset with identifiers and structured information?

Objectives

  • Understand the concept of data reconciliation.
  • Reconcile artist names with an authority database.
  • Add stable identifiers to the dataset.

Why use reconciliation?


Up to this point we have cleaned and explored our dataset. We standardized values, split columns, and corrected inconsistencies. However, the values in the table are still plain text labels. For example in the column Artist Display Name we find values such as “Frank Lloyd Wright” and “Jean Le Pautre”.

For a human reader these values clearly represent people. A computer, however, only sees text. It cannot know whether “Frank Lloyd Wright” refers to the famous architect, another person with the same name, or a variant spelling used in another dataset.

This becomes a problem when we want to combine datasets, search across collections, or enrich our data with additional information. Computers need stable identifiers, not just names.

Reconciliation connects these text labels to authority records. Instead of working only with the name written in the dataset, we link the value to a stable identifier in an authority database.

For example, artists in our dataset can be linked to records in the Getty Union List of Artist Names (ULAN). Each person in ULAN has a unique identifier that distinguishes them from every other individual in the database: Frank Lloyd Wright (ULAN 500020307) or Jean le Pautre (ULAN 500000036).

When a name in our dataset is reconciled with an authority record, we are essentially answering the question: Which exact person in this authority database corresponds to the name written in our dataset?

Challenge

Quick check

Open the Union List of Artist Names (https://www.getty.edu/research/tools/vocabularies/ulan/) and search for “Frank Lloyd Wright” and “Jean Le Pautre”.

  • Why is it not always obvious which authority record is the correct one?

  • What information in the authority record helps you decide?

  • What kinds of information are available in ULAN that are not present in our dataset?

There are several search results for both. You have to check exactly which result is the right. ULAN records often contains additional structured information such as birth and death dates, occupations, as well as different spelling of names.

Authority searches may return multiple results like in these cases. To identify the correct person you need to compare additional information such as life dates, occupations, or alternative spellings in the authority data and your dataset.

Reconciling with OpenRefine


Because reconciliation can be computationally expensive, we will first work with a subset of the dataset. Make a text facet in the column Departmentand form a subset from the department “Drawings and Prints”.

  1. Open the menu on the column Artist Display Name and select Reconcile → Start reconciling…

  2. A new window appears, where you select Discover services... and a new browser tab opens with all the possible reconciliation services in OpenRefine. Search for “Getty ULAN” and copy the URL “https://services.getty.edu/vocab/reconcile/”.

  3. Now return to your other browser tab, select Add standard service... and paste the copied URL into the appearing field. Select Add service.

  4. Select the service and click on Next.

  5. Select ULAN search, then click Start reconciling....

Screenshot of the Reconciliation Service WindowScreenshot of the Getty Reconciliation Service

OpenRefine now sends each name in the column to the Getty database and suggests possible matches.

Challenge

Compare Reconciliation Services

So far, you have reconciled artist names against Getty ULAN. OpenRefine can also connect to many other authority databases – like Wikidata, VIAF (Virtual Authority File) or GND (Integrated Authority File). Add a second reconciliation service, such as Wikidata, VIAF or the Integrated Authority File (GND) and reconcile the column Artist Display Name again.

  • Which reconciliation service did you choose?
  • Does it return the same matches as ULAN?
  • What additional information is available in the new authority database?

To add another service:

  1. Open Reconcile → Start reconciling…
  2. Select Discover services...
  3. Search for a reconciliation service.
  4. Copy the service URL and add it via Add standard service....
  5. Start a new reconciliation process.

Different services may return different matches and provide different metadata. For example, Wikidata often includes links to many external databases, images, and biographical information, while VIAF and GND focus on authority control in libraries and archives.

The most useful service depends on your research question and the type of information you want to add to your dataset.

Reviewing the matches

If OpenRefine finds a clear match, the reconciliation is applied automatically. If several possible matches exist, OpenRefine shows multiple candidates. Hovering over one of the names displays some information to help you decide which person is correct. You can also go directly to the entire database page to obtain even more information. Once you have found the correct person, you can either reconcile all cells with this name or just this one. Notice that OpenRefine also displays a confidence score. While high-confidence matches are often correct, they should still be reviewed, especially when several people share the same name.

Screenshot of the Reconciliation Candidates Review
Screenshot of the Reconciliation Candidates Review
Challenge

Matchmaking

Find names in the column Artist Display Name where OpenRefine suggests multiple matches.

Look carefully at the candidate entries.

  • What information helps you choose the correct match?

  • What might make a match ambiguous?

Helpful clues include life dates, nationality, occupations, or alternative spellings.
Ambiguity often occurs when several people share the same name or when the dataset contains little contextual information.

Adding identifiers

Reconciliation links are stored inside OpenRefine but are not automatically included when exporting the dataset. To preserve them we add an identifier column.

  1. Open the column menu Artist Display Name and choose Reconcile → Add entity identifiers column
  2. Name the column something like “Artist_ULAN_ID”

The identifier column may not look very meaningful at first glance. However, identifiers are often more useful than names because they remain stable even when labels change. They allow different datasets to refer to the same person unambiguously.

Key Points
  • Reconciliation links text strings to unique identifiers in external databases.
  • This makes your dataset more precise, reusable, and comparable across projects.
  • OpenRefine suggests matches, but users should always review and confirm them.
  • Identifier columns preserve these links when exporting the dataset.

Content from Exporting and Importing Data and Workflows


Last updated on 2026-06-24 | Edit this page

Overview

Questions

  • How can we go back to an earlier step if we realize we made a mistake?
  • How can we save our cleaning process to repeat it later or share it with colleagues?
  • How can we export the cleaned data?

Objectives

  • Use the Undo/Redo tab to reverse mistakes.
  • Export and Import workflows.
  • Understand the value of transparency and reproducibility in data cleaning.
  • Export data in different formats.

Undo/Redo


On the left-hand side of the OpenRefine interface, you find the Undo/Redo tab. This tab lists every action you have taken since the project was created. Each action has a short label, such as “Text transform on 2000 cells… or “Split multi-valued cells in column nationality”.

  1. Click on the Undo/Redo tab in the left sidebar.
  2. You see a list of all your steps in order. The most recent one is at the bottom.
  3. By selecting an earlier step in the list, you can roll the dataset back to exactly how it looked at that moment.

This is like having a time machine for your dataset: you can test transformations freely without the fear of making permanent mistakes. And if you change your mind, you can always jump back to any earlier state. When cleaning messy data, we rarely get everything right on the first try.

Example: After splitting the Artist Display Bio column into multiple parts, you might notice that the country information was separated cleanly, but the century data became fragmented and less useful. Using Undo/Redo, you can jump back to the step before the split and try a different approach.

Exporting and Importing Workflows


Undo/Redo does more than let you move backwards. It also keeps track of your entire cleaning process as a set of instructions. OpenRefine can export these instructions as a JSON file. This file does not contain the cleaned data itself, but only the cleaning steps which were applied to the data.

  1. Go to the Undo/Redo tab.
  2. Click on the button Extract....
  3. A dialog opens showing all the processing steps in JSON on the right side. You can select which steps to include into the JSON by selecting the checkboxes on the left side.

Screenshot of the Workflow Window 4. Save the processing steps to a JSON file by clicking Export or copy it manually to your clipboard and paste it into a file on your computer.

Later, you or someone else can import this workflow into another OpenRefine project by clicking Apply... in the Undo/Redo tab in the other project. OpenRefine replays the exact same steps on the new dataset.

This feature is especially powerful in research, where transparency and reproducibility are essential. Instead of describing vaguely what was done, we can share the precise workflow that produced our dataset. Other researchers can review it, replicate it, or adapt it to their own data.

Callout

Workflows as Reusable Recipes

Think of Undo/Redo export files as recipes. Just like a recipe tells you how to combine ingredients to bake a cake, an OpenRefine workflow tells you how to transform raw data into a cleaned dataset. If you don’t like the taste, you can always tweak the recipe.

Exporting data


The cleaned dataset can be exported in different formats like tsv, csv, html depending on your further planned work with the data. You find these options under Export on the right side of the project bar. Here you also have the option to export the entire OpenRefine project (data and processing history) for sharing it with colleagues or as a project backup. In this case you select OpenRefine project to archive file and an archive file (.tar.gz) will be downloaded to your computer.

Callout

Exporting pitfall

Ensure that no filters are active so that the entire dataset is exported. You can check this by looking at the information above the table, for example, “10 matching rows (1999 in total)”. In this case, only the subset of 10 data rows will be exported. Filters are easy to forget, especially when you have been exploring subsets of the data.

Discussion

Reuse a Workflow

One of OpenRefine’s strengths is that cleaning steps can be reused on similar datasets. Create a new OpenRefine project and import the original data set again. Open the Undo/Redo tab and apply the workflow JSON file you exported earlier. Compare the new project with your previous project.

  • Which changes were reproduced automatically?
  • Did the workflow recreate the same columns, transformations, and cleaning steps?
  • Why is this useful when working with larger datasets or collaborating with other researchers?
Key Points
  • OpenRefine records every transformation you make.
  • The Undo/Redo tab lets you move backward and forward through your cleaning process.
  • Workflows can be exported as JSON and reapplied to other projects, ensuring transparency and reproducibility.