Key Points
Introduction to OpenRefine
- OpenRefine is a free, open-source tool for cleaning, organizing, and exploring messy data.
- You can easily import, filter, sort, and analyze your data, even without technical experience.
- OpenRefine supports many data formats and can be extended with add-ons and custom scripts for even more possibilities.
- Using OpenRefine helps you prepare your data for analysis, supporting transparent and reproducible research practices.
Importing Data and Getting to Know the OpenRefine User Interface
- OpenRefine organizes your work in projects
- You can import data from different sources and in different formats into OpenRefine
- Adjust import settings to ensure your data is read correctly and preview the results before starting
- The main components of the user interface are the grid, the grid
header, the project bar, and
Facet/Filteras well asUndo/Redotab
Exploring Data
- Facets provide an interactive overview of the values in a column and help you explore your data.
- Multi-valued cells must be split before accurate faceting is possible.
- Numeric and Timeline facets require converting text values into numbers or dates first.
Custom Facets and GREL
- Custom facets group data using computed results from a GREL
expression, not only the original cell values.
- GREL is a lightweight language that allows you to inspect, classify, and analyse data inside OpenRefine.
- Custom facets let you ask flexible questions about your data, such as identifying multiple creators or unusually long titles.
- With conditional expressions like
if(), you can define new categories that support deeper exploration and data-quality checks. - GREL functions can be chained together to answer more complex questions about your data.
Transforming Data
- GREL expressions can be used to extract, modify, and standardize information.
- Creating new columns preserves the original data and makes transformations easier to review.
- The
split()function creates arrays that can be accessed using positions such as[0]and[1]. - Structured information can be extracted from text using GREL expressions and pattern matching.
- Data cleaning often requires multiple transformation steps and manual review.
- Clustering helps identify potentially equivalent values and supports manual standardization.
Reconciling Data with External Data Sources
- Reconciliation links text strings to unique identifiers in external databases.
- This makes your dataset more precise, reusable, and comparable across projects.
- OpenRefine suggests matches, but users should always review and confirm them.
- Identifier columns preserve these links when exporting the dataset.
Exporting and Importing Data and Workflows
- OpenRefine records every transformation you make.
- The
Undo/Redotab lets you move backward and forward through your cleaning process.
- Workflows can be exported as JSON and reapplied to other projects, ensuring transparency and reproducibility.