Exploring Data
Last updated on 2026-06-24 | Edit this page
Overview
Questions
- What options does OpenRefine offer for data exploration?
- What is a facet and how does it help me explore data?
- How do facets differ from filters?
- What data types exist in OpenRefine?
Objectives
- Define faceting and identify when to use it.
- Define filtering and identify when to use it.
- Create a text facet to get an overview of values in a column.
- Transform values from one data type to another.
- Use Split multi-valued cells to prepare data for accurate faceting and later analysis.
Facets
When we work with data in OpenRefine, one of the first challenges is to make sense of the information that has been imported. Looking at rows of raw data rarely gives us much insight, especially if the dataset is large. What we need is a way of quickly summarizing values, spotting patterns, and finding potential problems such as inconsistent spellings or unexpected categories.
OpenRefine provides a set of tools for this under the name facets. Faceting allows us to group together all of the different values that appear in a column and to see how often they occur. A facet creates a kind of “interactive summary” of the data: it lists all unique values in a column and shows how many times each one appears.
Text Facet
The most commonly used facet type is called text facet. This facet
type applies to all string values. We create our first text facet
together for the column Department. Open the column menu
with the small arrow next to the column name and choose
Facet → Text facet. A new panel appeara on the left side of
the screen showing the unique values in the column and how often they
occur. You can sort the values alphabetically or by frequency. You can
also hover over values to edit them directly. This simple step
immediately transforms a spreadsheet with hundreds of rows into a clear
summary of categories and helps identify initial inconsistencies.
Both “Arts of Africa Oceania and the Americas” and “Arts of Africa, Oceania and the Americas” appear in the panel. Since these values most likely refer to the same department, the version without the comma is probably a data entry error. To correct this, identify the incorrect value in the facet list, hover over it, click edit, and change it. This merges all records under the correct spelling by just one action.
You can also click on one department to see only those rows in the table, or select multiple values to combine them by clicking include. Ensure that you undo your selection by pressing exclude, otherwise you only continue working with a small subset of the data.
First Insights into Your Dataset & Correct Errors
Create a text facet on the columns:
Is Public DomainObject NameCity.
- How many unique values are listed in each column?
- What is the most common value in
Object Name/Cityand how often does it appear? - Can you spot and fix the problems in the
Is Public Domaincolumn?
- 4 / 389 / 321 different values.
- The most frequent value is Book with 493 counts and Paris with 439 counts.
- There are inconsistent capitalizations: “False”, “false” and “true”, “True”. You can decide for one spelling and edit the others with the edit button.
The difference between Facets and Filters
It is useful to distinguish between facets and filters. Both are ways of focusing on a subset of the data, but they work differently. A filter is like a search box: you type in a word or part of a word, and OpenRefine hides all rows that do not match. A facet, by contrast, gives you an overview of all the distinct values in a column (as you saw in the Challenge above) and lets you select them interactively.
For example, in the Met dataset, the column Department
might contain categories such as “Medieval Art”, “Islamic Arts,” or
“Drawings & Prints.” A text facet on this column immediately shows
you how many artworks belong to each department. By clicking on one or
several values in the facet, you can quickly restrict your view to only
those artworks, and then easily switch back to the full dataset. Using
this functionality, you can filter your data through the facet
interface.
Another way to filter values in a column is to use a text filter.
When you choose Filter → Text filter, a search field
appears on the left side. You can type a term to restrict the dataset to
matching rows.
Combining Facets and Text Filters
How many prints are in the department “Drawings and Prints”? What difficulties arise when you only use filtering via the include function of the facet?
First, create a text facet on the column Department and
filter for the department “Drawings and Prints” via include. Then you
can use a text facet on the column Object Name. When you
include the facet “Print”, you find 281 matching records. However, the
results only include exact matches with “Print” not simliar values such
as “Print collection ornament & architecture” and other similar
ones.
To filter for all prints in the department, you can use a text filter and type “print”. There are 665 prints in the department.
Rows, Records, and Multi-Valued Cells
Up to this point, we have assumed that every cell in a column
contains only a single value. In real-world data, however, that is often
not the case. In the Met dataset, the Artist Display Name
column sometimes contains two or more names, separated by a
| like Horace Harral|Joseph Wolf. This is
problematic for faceting. If we want to find out which artist appears
most frequently in the dataset, this structure makes it difficult.
That’s because if we apply a text facet to the
Artist Display Name column as it stands, OpenRefine treats
the entire string as one value. The facet then lists the artists as if
it were a single artist, which is clearly not what we want. What we need
instead is to treat each artist as a separate value:
Horace Harral and Joseph Wolf.
To understand what happens when we correct this, it helps to know about the distinction between rows and records in OpenRefine. By default, data is displayed as rows, one after the other. But OpenRefine can also treat a group of rows as belonging to the same record. When we split a cell that contains multiple values, OpenRefine creates additional rows within the same record. This means that the number of rows increases, while the number of records remains unchanged.
- Switch to
recordsat the top left of the grid. This makes it easier to see what happens after the split. - Open the column menu for
Artist Display Name, chooseEdit cells → Split multi-valued cells…. Enter the separator|in the box, and clickOK. - Create a text facet on
Artist Display Nameagain. This time you see the names of individual artists listed separately. Each can now be counted and selected on its own.
If you now switch back to the rows view you will see
that the number of rows increased to 4,644.
Split multi-valued cells
The column Tags also contains multiple values in some
cells. Split the cells for this column and create a text facet.
Which separator do you need?
Which tag appears most frequently in the dataset?
Do you notice duplicate values in the facet list?
When splitting the cells, you should use a pipe (
|) as the separator.After splitting the values and creating a text facet, the most frequent tag in the dataset is “Men”.
The facet may show duplicate values. These appear as separate values because some tags contain leading whitespace. To fix this you can
Edit cells → Common transforms → Trim leading and trailing whitespace. This removes spaces at the beginning or end of the cells. After trimming the whitespace, the duplicate values merge into a single tag.
Rejoining Values
Sometimes you may want to put the data back into its original form.
After cleaning or analyzing, OpenRefine allows you to join split rows
back into a single cell. To do this, return to the column menu, select
Edit cells → Join multi-valued cells…, and specify a
separator again. This is useful if you need to export the data in a more
compact form later.
Other Facet Types
So far we have used text facets to explore categorical values such as departments or object types. OpenRefine also offers other facet types that help explore numeric values like creation year of an artwork.
Let us try this out with the Object Begin Date column
from the Met dataset which contains the year an artwork was created.
Open the column menu and choose Facet → Numeric facet. You
might expect to see the different values, but instead, OpenRefine
displays a message such as “No numeric values.” This tells us that the
values in the column are not actually recognized as numbers, even though
they look like numbers in the table.
This situation is common when importing data: numbers are often stored as strings (that is, as text), so OpenRefine does not treat them as numeric values. We need to transform them first.
To fix this, open the column menu for Object Begin Date
again and choose:Edit cells → Common transforms → To number.
OpenRefine now attempts to parse each cell in that column as a number. If the value is a valid number, the cell is converted; if not (for example, if the cell is empty or contains text like “unknown”), it becomes a blank cell.
You see a small change in the formatting of the numbers: they are now right-aligned, which is OpenRefine’s way of indicating that they are numeric rather than text.
With the column properly converted, repeat the earlier step:Facet → Numeric facet.
This time, OpenRefine shows a histogram with a slider to explore ranges of values from year 0 up to year 2100. The histogram groups the year values into intervals of 100 years, giving you an overview of how artworks are distributed by year of their creation.
By dragging the slider handles, you can focus on particular ranges. For example, you might restrict the view to artworks before 1,500. You see that the dataset contains only 42 matching objects. Most artworks therefore date from after the 15th century; this is evident at a glance.
Numeric facets therefore serve two purposes at once: they help you explore distributions, and they highlight anomalies that need cleaning.
OpenRefine also offers other facet types such as a timeline facet for
date values or a scatterplot facet for exploring the relationship
between two numeric columns. In addition, OpenRefine provides a facet
type that helps you identify duplicate values in a column:
Facet → Customized facets → Duplicate facet. Duplicate
facets can be useful, for example, when working with columns that are
expected to contain unique identifiers.
The best way to understand these facets is simply to experiment with them.
Numeric Facet
Turn the values in the column Accession Year into a
numeric facet. In which decade did the Met collection acquire the most
artworks? Why are there non-numeric values? Can you spot the error?
In the 1940s. There are 5 non-numeric values. You can take a look at them by ticking Non-numeric. It looks like someone wrote an “O” instead of a “0”.
Detect Blank Values
Not every object contains information in all columns. Often, the data
is incomplete. Knowing how incomplete the data is can often be important
for planning a later analysis. We can quickly identify the gaps by
selecting Facet → Customised facets → Facet by blank. This
gives us an output with “true” and “false” on the left. By selecting
“false”, we exclude rows with missing values from the subset. If we do
this for all columns, we only get the data records that are
complete.
Identify complete records
How many records in the dataset contain information on both
Culture and Tags?
First apply Facet by blank to Culture and
select “false”. Then apply Facet by blank to
Tags and select “false”. Only the remaining 175 records
contain values in both columns. This is quite a small number,
considering that the dataset contains 2,076 records.
- Facets provide an interactive overview of the values in a column and help you explore your data.
- Multi-valued cells must be split before accurate faceting is possible.
- Numeric and Timeline facets require converting text values into numbers or dates first.