Content from Introduction to OpenRefine


Last updated on 2025-08-15 | Edit this page

Overview

Questions

  • What is OpenRefine and how can it help with messy data your research?
  • What kinds of tasks and analyses can you perform with OpenRefine?

Objectives

  • Explain what OpenRefine is
  • Explain the key-features of OpenRefine
Discussion

Discussion: Do you often work with digital data in your research, your studies or your work?

For example, cleaning it, processing it, converting it or analysing it?

Good data organization is the foundation of any research project. In the humanities, you might work with lists of artworks, artists, historical events, or other information collected from museums, archives, or fieldwork. Often, this data is stored in spreadsheets or tables, and at first glance, it may seem tidy. But as you look closer, you may notice small issues: names spelled in different ways, missing details, or dates written in various formats. These challenges are common and can make it difficult to analyze or share your data.

OpenRefine is a free, open-source tool designed to help you work with messy data. Think of OpenRefine as a workbench for your data, a place where you can clean, organize, and explore information, even if you have no technical background. OpenRefine runs locally on your computer and opens in your web browser, providing a user-friendly interface that guides you through each step. Working in your web browser might be confusing, but nothing of your data will be in the internet, everything runs local.

With OpenRefine, you can import your data, discover patterns, fix mistakes, and transform your information so it’s ready for analysis or sharing. You don’t need to know how to code or use complicated software. OpenRefine is built for researchers who want to focus on their work, not on technical details.

One of the strengths of OpenRefine is its ability to help you explore your data and perform simple analyses right from the start. You can quickly filter and sort your data, group similar entries, and visualize distributions to spot trends or outliers. This makes it easy to get a sense of your dataset before diving deeper into research questions. You can also use built-in functions to split or merge columns, remove duplicates, and transform data formats, making your information more consistent and reliable.

OpenRefine supports a wide range of data formats, including CSV, Excel, and JSON, and can connect to online sources and databases. Its reconciliation feature allows you to match your data against external databases, such as Wikidata, to enrich your dataset with additional information. Because OpenRefine is open source, it can be extended with add-ons and custom scripts, giving you even more possibilities. The active community around OpenRefine has developed many plugins that add new features, such as connecting to other data sources, exporting to different formats, or automating repetitive tasks.

Throughout this lesson, you will use a sample dataset from the Museum of Modern Art (MoMA). This real-world data will help you learn how to use OpenRefine to solve common problems in humanities research. Even if you have never worked with data before, you will see how OpenRefine can make your research easier, more accurate, and more enjoyable.

Much of your time as a researcher will be spent in the initial cleaning stage, where you need to organize and clean your data before you can analyze it. It’s not always the most exciting part, but it is essential. With OpenRefine, you will learn how to think about data organization and develop practices for more effective data cleaning. By the end of this lesson, you will feel confident using OpenRefine to clean, explore, and analyze your own data. You don’t need to be a technical expert—just curious and willing to try something new.

Let’s get started!

Key Points
  • OpenRefine is a free, open-source tool for cleaning, organizing, and exploring messy data.
  • You can easily import, filter, sort, and analyze your data, even without technical experience.
  • OpenRefine supports many data formats and can be extended with add-ons and custom scripts for even more possibilities.
  • Using OpenRefine helps you prepare your data for analysis, making your research more accurate, efficient, and enjoyable.

Content from Importing Data and Getting to Know the OpenRefine User Interface


Last updated on 2025-08-15 | Edit this page

Overview

Questions

  • How do I start a new project in OpenRefine?
  • How do I import a CSV file?
  • What options and settings are available during import?

Objectives

  • Start a new OpenRefine project.
  • Import data from a CSV file

To begin working with your data in OpenRefine, you first need to create a new project. Imagine you have downloaded a CSV file containing a sample of the Museum of Modern Art (MoMA) collection. This file might include information about artworks, artists, dates, and more—ready to be explored and cleaned.

When you open OpenRefine in your web browser, you are greeted by these start page.

Import Data Window
Import Data Window

There are already various options in the left-hand bar. You can:

  • Create a new project and load data.
  • Open an existing project.
  • Import an existing project.
  • Change the language of OpenRefine.
  • Manage extensions.

We will not be looking at extensions in this lesson, but they offer the possibility to customise OpenRefine in even greater detail and expand it with additional functions. We click on Create Project and then on Get data from * This Computer . Here we click on Browse and search for our file, select it and click on next . OpenRefine will show you a preview of your data, allowing you to check that everything looks correct before you proceed.

Below the data preview, we now find various settings options for how the data should be loaded. This has a direct effect on the preview, allowing us to check immediately whether the settings have been selected correctly. This is important when files have special formats, so that they can be converted into the correct table structure. For example, if your CSV uses a separator other than a comma (for example, a semicolon or tab), you can specify this so that columns are recognized correctly. Sometimes, data files include extra header lines or notes at the top. You can tell OpenRefine to skip these lines, so only the actual data is imported. For Excel files, you can select which sheet to import. For CSVs, you can preview and adjust how columns are interpreted. Once you are happy with the preview and settings, you can give your project a name and click Create Project. In our case everything should be fine and we can go on without chaniging any settings. OpenRefine will load your data into its workspace, displaying it in a clear, table-like view.

Import Data Window
Import Data Window

The user interface is designed to be intuitive. In general, it is structured so that we can see the data we currently have access to in the large main window in the middle. There, we can set how many rows we want to see at once and whether we want to see rows or records displayed. We will understand what this difference means later. The arrows next to the column labels open the functions that OpenRefine offers us for changing and processing our data. These always refer to the column we open them from. On the left-hand side, we can see exactly what we can apply or what we have created. We will also clarify and understand this in detail in a moment. The Undo/Redo column allows us to undo steps, but also to export and reuse entire processing pipelines. In the upper right corner, there are additional options where we can create a new project or export our current data. The Help button takes us directly to the OpenRefine documentation. There, all functions are also described in detail.

By starting with a CSV file from the MoMA dataset, you have taken the first step toward organizing and understanding your data. OpenRefine’s flexible import options and user-friendly interface make it easy to get started, no matter your technical background.

Key Points
  • You can import data from different different formats in OpenRefine
  • Adjust import settings to ensure your data is read correctly and preview the results before starting.
  • Functions to work with your data are used from the Arrow Buttons next to the collumn header

Content from Exploring Data


Last updated on 2025-08-18 | Edit this page

Overview

Questions

  • What options does OpenRefine offer for data exploration?
  • What is a facet and how does it help me explore data?
  • How do facets differ from filters?

Objectives

  • Define faceting and identify when to use it.
  • Create a basic Text facet to get an overview of values in a column.
  • Use Split multi-valued cells to prepare data for accurate faceting and later analysis.

When we work with data in OpenRefine, one of the first challenges is to make sense of the information that has been imported. Looking at rows of raw data rarely gives us much insight, especially if the dataset is large. What we need is a way of quickly summarizing values, spotting patterns, and finding potential problems such as inconsistent spellings or unexpected categories.

OpenRefine provides a set of tools for this under the name facets. Faceting allows us to group together all of the different values that appear in a column and to see how often they occur. A facet creates a kind of “interactive summary” of the data: it lists all unique values in a column, shows how many times each one appears, and lets us filter the table to only those rows that match. This makes it possible to move from a confusing spreadsheet view to a clear overview with just a few clicks. We also already have the option of performing some cleaning tasks on the data set, even though this is not the main purpose of this tool.

Facets


It is useful to distinguish between facets and filters. Both are ways of focusing on a subset of the data, but they work differently. A filter is like a search box: you type in a word or part of a word, and OpenRefine hides all rows that do not match. A facet, by contrast, gives you an overview of all the distinct values in a column and lets you select them interactively.

For example, in the Museum of Modern Art dataset, the column Department might contain categories such as “Architecture & Design,” “Painting & Sculpture,” or “Drawings & Prints.” A text facet on this column will immediately show you how many artworks belong to each department. By clicking on one or several values in the facet, you can quickly restrict your view to only those artworks, and then easily switch back to the full dataset.

Text Facet


Let us try this in practice. Once your MoMA sample dataset is loaded, find the column named Department. Open the column menu with the small awrrow next to the column name and choose Facet → Text facet. A new panel will appear on the left side of the screen.

In this panel you now see a list of all the different departments represented in your data, along with the number of artworks in each. You might see something like:

  • Drawings and Prints (412)
  • Architecture & Design (163)
  • Painting & Sculpture (228)

The numbers will vary depending on your sample, but the idea is always the same: OpenRefine has grouped all artworks by their department and counted them. You can now click on one department to see only those rows in the table, or select multiple values to combine them by clicking include.

In the different departments, you can see that there is both ‘Film’ and ‘Films’. Here, it is important to know whether this is a mistake or whether these are deliberately two different categories. In our example, this is clearly a spelling mistake, which we can correct right here. In practice, it is often difficult to assess whether these are mistakes or deliberate category names. This makes it important to be familiar with the data and the underlying real concepts in order to be able to clean them up. To do clean this mistake, we click on ‘Edit’ next to “Films” and remove the ‘s’, which applies to every line. We have now completed our first clean-up.

This simple step immediately transforms a spreadsheet with hundreds of rows into a clear summary of categories and also helps to detect first inconsistencies.

Rows, Records, and Multi-Valued Cells

Up to this point, we have assumed that every cell in a column contains only a single value. In real-world data, however, that is often not the case. In the MoMA dataset, the Artist column sometimes contains two or more names, separated by a comma. For example, one artwork might have its artist recorded as:

Pablo Picasso, Georges Braque

This is problematic for faceting. If we apply a text facet to the Artist column as it stands, OpenRefine will treat the entire string above as one value. The facet will then list “Pablo Picasso; Georges Braque” as if it were a single artist, which is clearly not what we want. What we need instead is to treat Pablo Picasso and Georges Braque as separate values, each counted individually.

To understand what happens when we correct this, it helps to know about the distinction between Rows and Records in OpenRefine. By default, data is displayed as rows, one after the other. But OpenRefine can also treat a group of rows as belonging to the same record. When we split a cell that contains multiple values, OpenRefine creates additional rows within the same record. That means the number of rows goes up, but the number of records stays the same.

Let us now prepare the Artist column for faceting by splitting its multi-valued cells.

  1. First, look through a few rows in the Artist column and identify whether any cells contain more than one artist. These are usually separated by a comma (,) in the MoMA dataset.
  2. Switch to Records view (at the top left of the grid, choose “Show as: Records”). This makes it easier to see what happens after the split.
  3. Open the column menu for Artist, choose Edit cells → Split multi-valued cells…. Enter the seperator , in the box, and click OK. Ignore the opther options.
  4. OpenRefine now divides each cell containing multiple artists into separate rows. The total number of rows increases, but the record count stays constant.
  5. Create a Text facet on Artist again. This time you will see the names of individual artists listed separately. Each can now be counted and selected on its own.
Discussion

Exercise: Split multi-valued cells

It is important to note that OpenRefine only splits the column you apply the function to. Other columns are left untouched. This means that if another column also contains multiple values that correspond to those in Artist, you need to split that column separately, for example Nationality. Look in the data and split the other columns if needed.

This small step has a big impact: it ensures that our summaries and analyses represent the data correctly. Instead of hiding two names inside one cell, we now see each artist clearly and can identify patterns such as which artists appear most often in the dataset.

Rejoining Values

Sometimes you may want to put the data back into its original form. After cleaning or analyzing, OpenRefine allows you to join split rows back into a single cell. To do this, return to the column menu, select Edit cells → Join multi-valued cells…, and specify a separator such as a comma character again (,). This is useful if you need to export the data in a more compact form later.

Numeric Facet


Numeric facets are designed for columns that contain numbers, such as dimensions of artworks. They let you interactively explore ranges of values with a histogram and slider.

Let us try this out with the Height (cm) column from the MoMA dataset. Open the column menu and choose Facet → Numeric facet. You might expect to see a histogram, but instead, OpenRefine shows only a message such as “No numeric values.” This tells us that the values in the column are not actually recognized as numbers, even though they look like numbers in the table.

This situation is common when importing data: numbers are often stored as strings (that is, as text), so OpenRefine does not treat them as numeric values. We need to convert them first.

To fix this, open the column menu for Height (cm) again and choose:
Edit cells → Common transforms → To number.

OpenRefine now attempts to parse each cell in that column as a number. If the value is a valid number, the cell is converted; if not (for example, if the cell is empty or contains text like “unknown”), it becomes a blank cell.

You will see a small change in the formatting of the numbers: they are now right-aligned, which is OpenRefine’s way of indicating that they are numeric rather than text.

With the column properly converted, repeat the earlier step:
Facet → Numeric facet.

This time, OpenRefine shows a histogram with a slider. The histogram groups all the height values into ranges, giving you an overview of how artworks are distributed by size.

By dragging the slider handles, you can focus on particular ranges. For example, you might restrict the view to artworks shorter than 20 cm, or zoom in on those taller than 300 cm. This is especially helpful for spotting unusual values. If you see a very high number, such as “20000 cm,” that might indicate a data entry error.

Numeric facets therefore serve two purposes at once: they help you explore distributions, and they highlight anomalies that need cleaning.

Timeline Facet


Some columns represent dates, such as DateAcquired or the year of creation of the artwork. For these, OpenRefine offers the Timeline facet.

Before you can use a timeline facet, you need to ensure that the column is stored as a date type. You can do this by choosing Edit cells → Common transforms → To date from the column menu. Once converted, you can create a Timeline facet.

The result looks similar to the numeric facet but with dates on the horizontal axis. OpenRefine displays a histogram showing how many rows fall into each time period. You can then drag the slider handles to focus on a specific date range.

For example, a timeline facet on DateAcquired could reveal when most of the artworks in your sample entered the MoMA collection. You might see peaks in certain decades, which may correspond to acquisition campaigns or curatorial trends. By selecting a particular time range, you can then explore only those artworks acquired during that period.

Scatterplot Facet


Sometimes we want to explore the relationship between two numeric columns at the same time. For this, OpenRefine provides the Scatterplot facet.

To create one, you need to apply it to a numeric column (for instance Height (cm)), and then choose a second numeric column for the other axis (for example Width (cm)). OpenRefine will then display a scatterplot diagram, where each point corresponds to one row (in this case, one artwork).

Looking at the scatterplot can help you see whether there are clusters of values or unusual outliers. For example, if most works have height and width under 200 cm, but one point is far out at 500 cm by 800 cm, that tells you there is either a very large artwork or possibly an error in the data.

The scatterplot facet is not as commonly used as the text or numeric facet, but it is a powerful way to visualize relationships directly within OpenRefine without having to export the data to another tool.

Key Points

Content from Custom Facets and GREL


Last updated on 2025-08-19 | Edit this page

Overview

Questions

  • When do we need a custom facet instead of a built-in one?
  • How can GREL help us filter or transform data more flexibly?

Objectives

  • Understand what a custom facet is and how it differs from standard facets.
  • Learn to write simple GREL expressions for filtering and transformation.
  • Apply a custom facet on the Artist column to distinguish between single and multiple creators.

So far, we have explored facets that you can create by clicking through the menu—text, numeric, and timeline facets. These are powerful, but sometimes our exploration or cleaning task needs a rule that isn’t built in. In those cases, OpenRefine lets us define our own facets using GREL (General Refine Expression Language).

GREL looks like code, but many useful expressions are short and readable. You can think of them as tiny instructions that tell OpenRefine how to interpret or transform the value currently in a cell (that value is referred to as value inside GREL). Every funciton we already used could be written in GREL as well, but to make things eaisier the most common functions are already built it.

If, for example, we now want to examine our data set to see which works of art were created by only one or by then one, we could create a custom facet to filter by this criteria.

Discussion

Question: How could we filter the column to find out whether one or more people were involved in creating a work of art??

Look at the column and consider how cells that contain one or more values differ. We can then use GREL to filter based on this difference.

Creating a Custom Facet with GREL


  1. Open the column menu for Artist.

  2. Choose Facet → Custom text facet…

  3. In the dialog, type the following GREL expression:

    value.contains(",")

    This asks, “Does the cell contain a comma?” For each row, OpenRefine evaluates the expression and returns either true or false.

  4. Click OK. In the left panel, you now see a facet with two buckets: true and false.

That’s it—you just created a logic-based facet that the standard menus couldn’t provide.

Why this works: A Custom Facet runs your expression on every row, groups the results, and lets you filter by the outcome. You can write tests that return booleans (true/false), strings (e.g., normalized categories), or even numbers—OpenRefine will facet whatever the expression returns.

Callout

Callout: GREL-Functions

  • value.toLowercase() – lowercase the text.
  • value.toUppercase() – uppercase the text.
  • value.trim() – remove spaces at start/end.
  • value.length() – number of characters.
  • value.contains("text")true if “text” occurs.
  • value.startsWith("A"), value.endsWith(".") – prefix/suffix checks.
  • value.replace("old","new") – literal replace.
  • value.replace(/\s+/," ") – regex replace (collapse multiple spaces).
  • value.split(";") – split into an array on ;.
  • array.join("|") – join array back to a string.
  • value.toNumber() – parse numbers (non-numeric becomes blank).
  • value.toDate() – parse dates (useful before Timeline facets).
  • value.toString() – convert back to text.
  • isBlank(value)true for empty/whitespace-only.
  • isNull(value)true for null values (missing).
  • type(value) – returns “string”, “number”, “date”, etc.
  • value.toDate() – parse a date (from strings like 1954-05-01).
  • value.toDate().toString("yyyy") – format the date to year.
  • now() – current date/time (useful for comparisons).

Content from Transforming Data


Last updated on 2025-08-19 | Edit this page

Overview

Questions

  • How can we clean and standardize ArtistBio values in OpenRefine?
  • What is the difference between finding issues (facets) and fixing them (transformations & clustering)?

Objectives

  • Remove literal characters with GREL replacements.
  • Split ArtistBio into nationality and life‑dates
  • Inspect and normalize a column.

Facets help us find patterns; transformations and clustering help us fix them consistently. In the MoMA sample, the column ArtistBio stores nationality plus life information in a compact string. Typical examples in your file look like:

(American, born 1936)
(American, born Italy. 1919–2013)
(American, born Germany. 1886–1969)
(French, 1857–1927)
(Ivorian, 1923–2014)

As previously noted, some columns also contain information about multiple artists, which is why we first need to split them.

Discussion

Discussion: By which seperator can we split the ArtistBio Column?

It may happen that different data items are not separated by a comma or a semicolon, in which case creativity is required to determine how to resolve these data items.

You can separate the data at the string “) (”.

Next, we will simplify the text, then split nationality from the remaining details, and finally use clustering to review and standardize the results.

Transformation with GREL


We start with a so called literal replacement.

  1. Open the column menu for ArtistBio.
  2. Choose Edit cells → Transform….
  3. Enter:
value.replace("(", "").replace(")", "")
  1. Click OK.

What this does: It deletes ( and ) everywhere and leaves the rest untouched. More precisely, it replaces the characters with nothing. The value.replace() function requires two inputs, which are passed in parentheses. One is the character to be replaced, and the other is the character with which it is replaced. In our case, we simply leave the second part blank, which results in the character being deleted.

After this step, examples look like:

American, born 1936
American, born Italy. 1919–2013
French, 1857–1927
Callout

Add a quick Text facet on ArtistBio to confirm that parentheses are gone.

Edit Columns


So far, we have applied transformations that modified the contents of cells and rows. But OpenRefine also allows us to restructure and transform our data through columns. For example creating new columns from existing ones. When working with tables, it makes sense to process them in such a way that each cell contains only one piece of information. This is useful if you want to analyze or clean different parts of the data separately, instead of working in one long text string.

The values, at the moment, are generally in the form:

<Nationality>, <rest>

By splitting at the comma, we can move the nationality into its own column and leave the rest (birth/death information, places, etc.) in another. This makes it much easier to work with later.

  1. Open the column menu ArtistBio → Edit column → Split into several columns…
  2. Separator: “,”
  3. Split into: leave the default (OpenRefine will automatically create ArtistBio 1 and ArtistBio 2 … ArtistBio X).
  4. Confirm with OK.

After splitting, your dataset will look something like this:

  • ArtistBio 1 → nationality (e.g., American, French, Ivorian, …)

  • ArtistBio 2 → the remaining details (e.g., born 1936, born Germany. 1886–1969, 1857–1927)

If cells have more then one comma, OpenRefine will create as much columns as needed. This step shows how OpenRefine is not only about cleaning text but also about reshaping your data, which is just as important when preparing a dataset for analysis.

Clustering


Even after splitting, the second part (ArtistBio 2) still contains a mix of formats:

  • born 1936

  • born Italy. 1919–2013

  • 1857–1927

  • 1923–2014

This is where clustering helps.

What is clustering?

Imagine you have a long list of words that are almost—but not quite—the same: colour, color, colr. A computer can group them together by applying a rule that says “treat things that sound or look similar as if they belonged to one bucket.” That’s clustering. OpenRefine offers different clustering methods. They don’t just compare characters literally but transform them into simplified “keys.” Rows with the same key end up in the same cluster.

You can think of it like:

  • Writing all words in block capitals and removing vowels: COLOR and COLOUR both become CLR.
  • Or sounding the word out loud: Smith and Smyth both sound like SMITH.

This is how OpenRefine groups values that are close enough, even if not exactly the same.

How to use it here:

  1. Column menu ArtistBio 2 → Edit cells → Cluster and edit…

  2. Method: Key collision

  3. Keying function: Metaphone 3

What you’ll see

In the clustering window, OpenRefine shows suggested groups of values. For example:

  • born Germany. 1886–1969 might be grouped with slight spelling variants like born Germany 1886-1969.

  • Different years won’t cluster together, but you’ll see repeated phrases like born Italy or born Germany.

  • Plain ranges like 1857–1927 won’t be grouped with 1923–2014, because the numbers differ, but you can still skim them quickly.

Clustering is not about fixing everything automatically. It’s about seeing the patterns and deciding what you want to merge. For example, you might merge born Germany. 1886–1969 and born Germany 1886–1969 into one consistent form.

Content from Reconciling Data with External Data Sources


Last updated on 2025-08-19 | Edit this page

Overview

Questions

  • What does it mean to reconcile data?
  • Why is reconciliation useful in humanities research?
  • How can we use OpenRefine to enrich our dataset with identifiers and structured information?

Objectives

  • Understand the concept of data reconciliation.
  • Reconcile names and places.
  • Add stable identifiers (IDs).

So far, we have used OpenRefine to look at and clean our dataset: splitting columns, removing unwanted characters, and clustering values. These steps improve the quality of our data, but our values are still just strings, plain text without a deeper connection to knowledge outside our file.

Reconciliation is the process of linking these strings to stable, external identifiers in authority databases such as Wikidata, the Getty vocabularies, or other domain-specific repositories. Instead of simply having the text Pablo Picasso, reconciliation can connect our cell to the unique Wikidata item Q5593. This turns our dataset into something that can be connected and compared with other datasets and research around the world.

You can think of reconciliation as asking a librarian: “I have this name written here – which exact person in your catalog does it refer to?” The librarian might return a short stack of cards with possible matches, and you confirm the right one. Once linked, the reference is unambiguous and stable.

In humanities datasets, names and places are central. But names are often ambiguous:

  • Variant spellings: Shakespeare, Shakespear, Shakspeare
  • Common names: John Smith
  • Different languages: Munich vs. München

If we keep these as plain text, any comparison across collections or projects becomes unreliable. But if we reconcile to shared identifiers (like Wikidata QIDs), we can:

  • Connect our dataset to others, regardless of spelling differences.
  • Enrich our data with structured information (e.g., dates of birth, countries, occupations).
  • Support reproducible analysis by referring to stable, citable identifiers rather than local labels.

Reconciliation therefore transforms a local, isolated dataset into part of a larger knowledge graph.

Reconciling with OpenRefine


OpenRefine makes reconciliation simple and interactive:

  • It provides built-in or addable reconciliation services (such as Wikidata).
  • It lets you review and confirm matches cell by cell, or accept high-confidence matches in bulk.
  • It allows you to pull in identifiers, labels, and even additional properties as new columns.

This combination of automation and human oversight is powerful: the machine proposes matches, but the researcher remains in control of what is accepted.

We will reconcile two columns in our dataset:

  1. Artist – the name of the artist.
  2. Nationality – the country information we previously separated from the biography.

Reconciling the Artist column

  1. Open the menu on ArtistReconcile → Start reconciling…
  2. Select Wikidata as the reconciliation service. If it does not appear, add it via Add Standard Service… and paste the URL: https://wikidata.reconci.link/en/api.
  3. In the type field, type and select Human (Q5). This tells Wikidata we are looking specifically for people.
  4. Click Start reconciling.

OpenRefine now sends each name in the column to Wikidata and retrieves possible matches.

Reviewing the matches

If the assignment is clear the reconciliation is complete. However, it is often the case that it is not clear and requires manual checking. If there are several candidates to choose from and OpenRefine is unsure which one is correct, all options are displayed in the respective cell. Hovering over one of the names will display some information to help you decide which person is correct. You can also go directly to the entire database page to obtain even more information. Once you have found the correct person, you can either reconcile all cells with this name or just this one.

This is like being handed several possible business cards for the same name. Your task is to select the one that fits the person in your dataset.

Callout

Use birth/death dates and occupations in the description to disambiguate common names.

Adding identifiers

The links now looks very good and can already be used in OpenRefine. However, if we export the file, the reconciliation disappears again, as in its current state it only works in OpenRefine itself. We therefore need to add another column with the assigned ID so that it can also be used outside of OpenRefine. We do this as follows:

  1. Column menu → Artist → Reconcile → Add entry identifiers column.
  2. Give it a name, for example Artist_ID.
  3. Click OK.

Now, every artist is linked to a stable identifier.

Reconciling the Nationality column (countries)

Now we can reconcile these values as well:

  1. Column menu → Nationality → Reconcile → Start reconciling…
  2. Choose Wikidata.
  3. Set the type to Country (Q6256).
  4. Start reconciliation.

This ensures that different spellings or forms like USA, United States, and United States of America all link to the same stable identifier: United States of America (Q30).

Key Points
  • Reconciliation links text strings to unique identifiers in external databases.
  • This makes your dataset more precise, reusable, and comparable across projects.
  • OpenRefine provides a structured workflow for reconciliation: propose → review → confirm → enrich.
  • The human researcher stays in control: machines suggest, but you decide.

Content from Undo, Redo, and Exporting Workflows


Last updated on 2025-08-26 | Edit this page

Overview

Questions

  • How can we go back to an earlier step if we realize we made a mistake?
  • How can we save our cleaning process to repeat it later or share it with colleagues?

Objectives

  • Use the Undo/Redo panel in OpenRefine to reverse mistakes.
  • Export and Import workflows.
  • Understand the value of transparency and reproducibility in data cleaning.

When cleaning messy data, we rarely get everything right on the first try. In the earlier chapters of this lesson, we:

  • Removed extra details from the ArtistBio column.
  • Split the column into separate parts, such as nationality.
  • Applied clustering to standardize values.
  • Reconciled artist names against an external authority.

All of these steps helped us improve our dataset. But what if, after clustering, we realize that some merges were wrong? Or what if we decide later that splitting the ArtistBio column at commas wasn’t the best idea? Instead of starting all over again, OpenRefine allows us to step backwards and explore different alternatives.

Undo/Redo


On the left-hand side of the OpenRefine interface, you will find the Undo/Redo panel. This panel lists every action you have taken since the project was created. Each action has a short label, such as “Text transform on 2000 cells… or “Split multu-valued cells in column nationality”.

  1. Click on the Undo/Redo tab in the left sidebar.
  2. You will see a list of all your steps in order. The most recent one is at the bottom.
  3. By selecting an earlier step in the list, you can roll the dataset back to exactly how it looked at that moment.

This is like having a time machine for your dataset: you can test transformations freely without the fear of making permanent mistakes. And if you change your mind, you can always jump back to any earlier state.

Example: After splitting the ArtistBio column into multiple parts, you might notice that the country information was separated cleanly, but the century data became fragmented and less useful. Using Undo/Redo, you can jump back to the step before the split and try a different approach.

Exporting and Importing Workflows


Undo/Redo does more than let you move backwards. It also keeps track of your entire cleaning process as a set of instructions. OpenRefine can export these instructions as a JSON file. This file is not the cleaned data itself, but the recipe of how the cleaning was done with the used data.

  1. Go to the Undo/Redo panel.
  2. Click on the link Extract….
  3. A dialog will open showing all the steps in JSON format. You can select which steps to include.
  4. Click Copy to clipboard or save the JSON to a file.

Later, you or someone else can import this workflow into another OpenRefine project:

  1. In the Undo/Redo panel of the other project, click Apply….
  2. Paste the JSON recipe or upload the file.
  3. OpenRefine will replay the exact same steps on the new dataset.

This feature is especially powerful in humanities research, where transparency and reproducibility are essential. Instead of describing vaguely what was done, we can share the precise workflow that produced our dataset. Other researchers can review it, replicate it, or adapt it to their own data.

Callout

Callout: Workflows as Reusable Recipes

Think of Undo/Redo export files as recipes. Just like a recipe tells you how to combine ingredients to bake a cake, an OpenRefine workflow tells you how to transform raw data into a cleaned dataset. If you don’t like the taste, you can always tweak the recipe.

Key Points
  • OpenRefine records every transformation you make.
  • The Undo/Redo panel lets you move backward and forward through your cleaning process.
  • Workflows can be exported as JSON and reapplied to other projects, ensuring transparency and reproducibility.

Content from Resources for Future Self-study


Last updated on 2025-07-01 | Edit this page

Overview

Questions

  • TODO

Objectives

  • TODO
Key Points